D: 18 Mar 2024 Main functions (--make-grm-bin...) Quick index search |
Data managementGenerate binary fileset--make-pgen ['vzs'] ['format='<code>] ['trim-alts'] ['erase-phase'] --make-bed ['vzs'] ['trim-alts'] --make-pgen creates a new PLINK 2 binary fileset, after applying sample/variant filters and other operations below. For example, plink2 --bgen input.bgen --maf 0.05 --make-pgen --out binary_fileset does the following:
In contrast, the fileset left behind by --keep-autoconv is just the result of step 1. --make-bed creates a PLINK 1 binary fileset instead, while --make-bpgen creates a hybrid fileset (main genotype table is in PLINK 2 format, sample and variant files use the PLINK 1 representation) loadable with --bpfile. Other notes:
By default, --make-[b]pgen/--make-bed (as well as --make-just-pvar/--make-just-bim below) do not resort the variants, and they'll error out if the input file is not at least sorted by chromosome. (This is a change from PLINK 1.x.) However, if you add --sort-vars, the variants will be resorted by chromosome code, then position, then ID. The following string-comparison modes are supported:
Regular chromosomes are sorted (in numeric code order; for humans, PAR1 has an effective numeric code of 22.5, PAR2 23.5) before other contigs. The --sort-vars setting also controls --pmerge[-list]'s variant output order. --make-just-pvar ['zs'] ['cols='<column set descriptor>] --make-just-bim ['zs'] --make-just-pvar is a variant of --make-pgen which only generates a .pvar file, and --make-just-psam plays the same role for .psam files. Similarly, --make-just-bim just generates a .bim file, and --make-just-fam just generates a .fam file. Unlike most other PLINK commands, these do not require genotype data (though you won't have access to many filtering flags when using these in no-genotype mode). Use these cautiously. It is very easy to desynchronize your binary genotype data and your .pvar/.psam indexes if you use these commands improperly. If you have any doubt, stick with --make-[b]pgen/--make-bed. Export non-PLINK 2 fileset--export <output format(s)...> [{01 | 12}] ['bgz'] ['id-delim='<char>] --export-allele <filename> --export creates a new fileset, after sample/variant filters have been applied. The following output formats are currently supported:
(Use --make-bed + PLINK 1.9 --recode to export other formats for now.) For example, plink2 --pfile binary_fileset --export bgen-1.1 --out new_text_fileset generates new_text_fileset.bgen and new_text_fileset.sample from the data in binary_fileset.pgen + .pvar + .psam, while plink2 --pfile binary_fileset --recode vcf id-paste=iid --out new generates new.vcf from the same data, removing family IDs in the process. In addition,
Irregular output coding--output-chr <MT code> PLINK 1.9 and 2.0 support seven chromosome coding schemes in output files. You can select between them by providing the desired human mitochondrial code:
PLINK correctly interprets all of these encodings in input files. --output-missing-genotype <char> --output-missing-genotype allows you to change the character (default '.' unless that breaks PLINK 1.9, in which case it's '0'1) used to represent missing genotypes in PLINK-format files generated by --make-[b]pgen/--make-bed/--export, while --output-missing-phenotype changes the string (default 'NA' for .psam, '-9' for older formats) representing missing phenotypes. Note that these defaults are mostly different from PLINK 1.x. These flags do not affect --pmerge[-list] or the autoconverters, since they generate files that may be reloaded during the same run. Add --make-[b]pgen/--make-bed if you want to change missing genotype/phenotype coding when performing those operations. 1: This applies to "--export ped" and "--export tped". Invalid genotypes--set-invalid-haploid-missing ['keep-dosage'] --set-invalid-haploid-missing (--set-hh-missing before a6) causes heterozygous haploid hardcalls and all female chrY calls to be erased during --make-[b]pgen/--make-bed.
--set-mixed-mt-missing ['keep-dosage'] Mitochondrial DNA is subject to heteroplasmy, so PLINK 2 normally saves MT dosages near 0.5 as 'heterozygous' genotypes, and these are not erased by --set-hh-missing. However, some analytical methods don't use these mixed MT genotype calls, and instead assume that they don't exist. The --set-mixed-mt-missing flag can be used with --make-[b]pgen/--make-bed to generate a dataset with mixed MT hardcalls erased. X chromosome pseudo-autosomal region--split-par <last bp position of head> <first bp position of tail> PLINK 2 prefers to represent the X chromosome's pseudo-autosomal region as 'PAR1' and 'PAR2' regions; this removes the need for special handling of male X heterozygous calls. This has a major computational advantage over PLINK 1.x's 'XY' convention: splitting and remerging no longer require resorting of the variants. Thus, PLINK 1.9's --split-x flag has been retired in favor of --split-par, which takes the base-pair boundaries of the pseudo-autosomal regions, and treats all chrX variants in those regions as if their chromosome codes were PAR1/PAR2 instead. As (typo-resistant) shorthand, you can pass one of the following build codes to --split-par:
--split-par errors out if the dataset already contains a PAR1 or PAR2 region. Conversely, --merge-par treats all variants in PAR1/PAR2 as if their chromosome code was X. Note that "--export vcf" has special-case logic for chrX/PAR1/PAR2: chromosome codes are all saved as chrX, but male ploidies are rendered using the PAR1/PAR2 boundaries. It should not be combined with --merge-par. --merge-x To import PLINK 1.x-style data with 'XY' codes,
Unknown-sex samples on chrY--y-nosex-missing-stats As of the 18 Oct 2023 alpha 6 build, PLINK 2 includes unknown-sex samples for most purposes on chrY: usually, most or all of the genotypes for actually-female samples are missing, and that's not true for the actually-male samples, so results are similar or even identical to what you'd get with complete sex information. However, missingness-rate is an exception; otherwise "--geno 0.1" and similar filters would break. (Het-haploid-rate is also an exception since --geno/--mind may take it into account.) When you do want unknown-sex samples to be included in chrY missingness-rate and het-haploid-rate computations, use the --y-nosex-missing-stats flag. This affects --geno/--mind, --genotyping-rate, and --missing. (It does not affect --freq or --geno-counts.) Update variant information--set-missing-var-ids <template string> --set-all-var-ids <template string> --new-id-max-allele-len <len> [{error | missing | truncate}] Whole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you don't want to throw out all of that data, you'll usually want to assign them chromosome-and-position-based IDs. --set-missing-var-ids (which just replaces missing IDs) and --set-all-var-ids (which overwrites everything) provide one way to do this. The parameter taken by these flags is a special template string, with a '@' where the chromosome code should go, and a '#' where the base-pair position belongs. (Exactly one @ and one # must be present.) For example, given a .pvar file starting with #CHROM POS ID REF ALT "--set-missing-var-ids @:#[b37]" would name the first variant 'chr1:10583[b37]', the second variant 'chr1:886817[b37]'... and the third variant also gets the name 'chr1:886817[b37]'. To maintain unique IDs in this situation, you can include '$r'/'$a' in your template string to refer to the REF/first ALT allele. So, if we're using a bash shell, we can try again with --set-missing-var-ids @:#[b37]\$r,\$a which would name the first variant 'chr1:10583[b37]G,A', the second variant 'chr1:886817[b37]T,C', and the third variant 'chr1:886817[b37]C,CATTTT'. Note the extra backslashes: they are necessary in bash because '$' is a reserved character there. (PLINK 1.9's '$1'/'$2' syntax for referring to those two alleles in ASCII-sort order is still supported as well, and it has a place when no reference genome exists. However, we recommend avoiding it most of the time, since it does not distinguish between deletions and insertions in some cases, whereas '$r'/'$a' doesn't have that problem.) In combination with either flag above, --var-id-multi can be used to specify a special template to use for just multiallelic variants (since it may not make sense to mention the first ALT allele in this case), and --var-id-multi-nonsnp does the same for variants that are both multiallelic and not SNPs (i.e. at least one allele code has length > 1). Allele names associated with indels are occasionally very, very long, and the synthetic variant ID names which would be generated from such long alleles are very inconvenient to work with. As a result, if any allele codes are longer than 23 characters, PLINK 2 requires you to use --new-id-max-allele-len to explicitly specify how they should be handled. Its first parameter is a length threshold, and its optional second parameter specifies how allele codes longer than the length threshold should be handled (default is now 'error'; 'missing' causes such variants to be assigned the unnamed-variant ID, while 'truncate' does what it sounds like and is a bit dangerous). As of alpha 6, these flags are applied to all --pmerge[-list] inputs. --recover-var-ids <.pvar/VCF/.bim filename> ['strict-bim-order'] ['partial'] --recover-var-ids provides a simple way to invert a --set-all-var-ids (or other variant-ID-changing) operation: given a .pvar/VCF/.bim file with the original IDs, it replaces the current IDs with the originals whenever there is an unambiguous CHROM+POS+alleles match. Allele order is also required to match, unless a .bim file is provided; and in the latter subcase, you can specify 'strict-bim-order' to require A1=ALT, A2=REF. If any variant has multiple matching records in the original-ID file, and the IDs conflict, --recover-var-ids writes the affected (current) ID(s) to plink2.recoverid.dup, and normally errors out. If the original-ID file has the same number of variants in the same order, you can still recover the old IDs with the 'rigid' modifier in this case (or with a simple bash script, but this is still slightly more convenient). Alternatively, to proceed and assign the missing-ID code to these ambiguous variants, add the 'force' modifier. (The .recoverid.dup file is still written when 'rigid' or 'force' is specified; we strongly suggest using e.g. --rm-dup to resolve the ambiguities when you have time.) --recover-var-ids normally expects to replace all variant IDs, and errors out if any are left untouched. Add the 'partial' modifier when you actually want to update just a proper subset. --missing-var-code <missing ID string> '.' is the default missing-variant-ID code. You can use --missing-var-code to change this; e.g. "--missing-var-code NA" would be appropriate for a .pvar file starting with #CHROM POS ID REF ALT --update-chr <filename> [chr col. number] [variant ID col.] [skip] --update-chr, --update-cm, --update-map, and --update-name update variant chromosomes, centimorgan positions, base-pair positions, and IDs, respectively. By default, the new value is read from column 2 and the (old) variant ID from column 1, but you can adjust these positions with the second and third parameters. The optional fourth 'skip' parameter is either a nonnegative integer, in which case it indicates the number of lines to skip at the top of the file, or a single nonnumeric character, which causes each line with that leading character to be skipped. (Note that, if you want to specify '#' as the skip character, you need to surround it with single- or double-quotes in some Unix shells.) For example, if the --update-name file is SNP_A-1919191 rs123456 and no column numbers are specified, SNP_A-1919191 will be renamed to rs123456, and SNP_A-64646464 will be renamed to rs222222. (Note that "--update-name <filename> 1 2" would invert the operation if all variant IDs are unique.) Strictly speaking, you can use Unix tail, cut, paste, and/or sed to perform the same job (albeit with more time and hassle) as the three optional parameters we have introduced. If you have not used these Unix commands before, we recommend that you familiarize yourself with what they do because they are still likely to come in handy in other scenarios. You can combine --update-chr, --update-cm, and/or --update-map in the same run. (However, to avoid confusion regarding whether old or new variant IDs apply, we force --update-name to be run separately.) When invoking --update-chr, you must use --make-bed/--make-[b]pgen and --sort-vars in the same run, and no other output commands. Otherwise, we still recommend that you use --make-bed/--make-[b]pgen once instead of --update-... over and over, but it's not absolutely required. --update-alleles updates variant allele codes. Its input should have the following five fields:
For example, if the --update-alleles file is rs10001 A B G T allele A for rs10001 will be changed to G, allele B for rs10001 will be changed to T, allele A for rs10002 will be unchanged, and allele B for rs10002 will be changed to C. Note that, if you just want to change REF/ALT allele assignments in the .pvar/.bim files without changing the real genotype data, you must use a flag like --ref-allele instead. --allele1234 interprets and/or recodes A/C/G/T alleles in the input as 1/2/3/4, while --alleleACGT does the reverse. With the 'multichar' modifier, these will translate multi-character alleles as well, e.g. '--allele1234 multichar' converts 'TT' to '44'. Update sample information--update-ids <filename> --update-sex <filename> ['col-num='<n>] ['male0'] These update sample IDs, parental codes, and sexes, respectively. --update-parents also updates founder/nonfounder status in the current run when appropriate. --update-ids expects a file with old sample IDs and new sample IDs.
For example, if the --update-ids file is 1001 I0001 the sample with FID=0, IID=1001 will have its IID changed to I0001, and the sample with FID=0, IID=1002 will have its IID changed to I0002. To avoid confusion regarding whether old or new IDs should be used in the latter files, we do not allow --update-ids to be used in the same run as --update-parents or --update-sex. --update-parents expects a file with sample IDs in front, followed by parental ID columns.
--update-sex expects a file with sample IDs in front, and a sex information column.
Set REF/ALT alleles--ref-allele ['force'] <filename> [REF col. number] [variant ID col.] [skip] --ref-from-fa ['force'] --ref-allele sets all alleles specified in the file to REF, while --alt1-allele does the same for the first ALT allele. Column and skip parameters work the same way as with --update-chr and friends. In combination with a FASTA file, --ref-from-fa sets REF alleles when it can be done unambiguously. (Note that this is never possible for deletions and some insertions.)
--maj-ref ['force'] --maj-ref sets major alleles to REF, like PLINK 1.x automatically did. (This is now opt-in instead of opt-out; --keep-allele-order is no longer necessary to prevent allele-swapping.)
--real-ref-alleles When a PLINK 1 fileset is loaded, PLINK 2 normally treats its A2 alleles as provisional-REF. Use --real-ref-alleles to specify that they're from a real reference genome. Left-normalization--normalize ['list'] ['adjust-overlapping-deletions'] In combination with a FASTA file, --normalize tries to left-normalize all variants, using the algorithm described in Tan A, Abecasis GR, Kang HM (2015) Unified representation of genetic variants. It currently assumes no differences in capitalization between the FASTA and the allele codes, and skips variants with one or more symbolic alleles (starting with '<'). The 'list' modifier causes the IDs of all modified variants to be written to plink2.normalized. By default, variants with a '*' overlapping-deletion allele are left alone. (This was not true before 25 Apr 2022.) The 'adjust-overlapping-deletions' modifier allows such variants to be normalized based on the other alleles; this is usually valid, but it can occasionally overshoot the left end of the deletion, and in some contexts it can lose a bit of information. Note that left-normalization has a "blind spot" when it comes to non-tandem-repeat deletions of differing lengths ending at the same position: they won't end up in the same multiallelic variant after split + left-normalize + join. Consider handling this case separately. In the order of operations, --normalize happens before the --make-[b]pgen/--make-bed step where variant-splitting occurs when specified. Unfortunately, it is the other order of operations that is usually desired here, and when it is, it's necessary to split the job across two PLINK 2 runs. This detail is not obvious to most "bcftools norm" users, so a warning (which will be upgraded to an error in a future build) is now printed when such a job is not split. You can disable this warning/error with --allow-normalize-with-split. Sort by FID/IID--indiv-sort <mode name> [filename] This allows you to specify how samples should be sorted when generating new datasets. The four modes are:
Covariate files--write-covar ['cols='<column set descriptor>] If covariates are defined, an updated version (with all filters applied) is automatically written to plink2.cov whenever --make-pgen, --make-just-psam, --export, or a similar command is present. However, if you do not wish to simultaneously generate a new sample file, you can use --write-covar to just produce a pruned covariate file. The following column sets are supported:
The default is maybefid,maybesid. Phenotype/covariate transformations--variance-standardize [phenotype/covariate name(s)...] --variance-standardize linearly transforms named quantitative phenotypes and covariates to mean-zero, variance 1. If no parameters are provided, all quantitative phenotypes and covariates are affected. --covar-variance-standardize does the same for just quantitative covariates. --quantile-normalize [phenotype/covariate name(s)...] --quantile-normalize forces named quantitative phenotypes and covariates to a N(0, 1) distribution, preserving only the original rank orders; if no parameters are provided, all quantitative phenotypes and covariates are affected. --pheno-quantile-normalize does the same for just quantitative phenotypes, while --covar-quantile-normalize does this for just quantitative covariates. --split-cat-pheno [{omit-most | omit-last}] ['covar-01'] --split-cat-pheno splits n-category phenotype(s) into n (or n-1 if 'omit-most' or 'omit-last' is used to exclude one category) binary phenotypes, with names of the form '<original phenotype name>=<category name>'. (As a consequence, affected phenotypes and categories are not permitted to contain the '=' character.)
Merge filesets--pmerge <.pgen/.bed filename> <.pvar/.bim> <.psam/.fam> --pmerge-list-dir <dir> --merge-mode <mode> --merge-parents-mode <mode> (Only handles concatenation-like jobs for now.) --pmerge merges one binary fileset with the main fileset. The 'vzs' modifier works as with --pfile. --pmerge-list merges all of the filesets specified in the given file. If there is a main fileset, it's also included in the merge (as if it were the first entry in the --pmerge-list file). The lines of the --pmerge-list file are interpreted as follows:
In both cases, the result is written to plink2.pgen + .pvar[.zst] + .psam (unless a later operation in the same run would overwrite one of these files, in which case the prefix is plink2-merge). The .pvar is normally uncompressed, but you can request compression with --pmerge-output-vzs. Merge tends to be a much more expensive operation than e.g. VCF autoconversion, so (unlike the case with VCF autoconversion) PLINK 1.9 and 2.0 default to keeping its output files around. You can use --delete-pmerge-result to request deletion at the end of the run. By default, --pmerge[-list] performs "outer joins": the merged fileset contains the union of the samples, variants, and phenotypes in the input filesets. To specify intersections instead, use the --sample-inner-join, --variant-inner-join, and/or --pheno-inner-join flags. --merge-mode, --merge-parents-mode, --merge-sex-mode, and --merge-pheno-mode define conflict resolution behavior for genotypes/dosages, parental IDs, sexes, and phenotypes, respectively. The following modes are supported for these flags:
Note that PLINK 1.x's --merge-mode 6/7 has been replaced by --pgen-diff. (Tip: to find all genotype-conflict positions in a multiway merge, you can perform both "--merge-mode nm-match" and "--merge-mode nm-first" merges, and then run a "--pgen-diff include-missing" comparison between them.) --merge-xheader-mode defines conflict resolution behavior for .pvar header entries. (For '##' header lines where the first '=' character is followed by a '<', the key is everything up to the first comma (or '>' if there is none) in the '<' expression, and the value is everything after the comma; otherwise, the key is everything up to the '=' and the value is everything after.) The following modes are supported:
--merge-qual-mode, --merge-filter-mode, --merge-info-mode, and --merge-cm-mode define conflict resolution behavior for QUAL, FILTER, INFO, and CM entries, respectively. The following modes are supported:
--merge-pheno-sort and --merge-info-sort define the phenotype column and INFO key sort orders, respectively. The following modes are supported:
--merge-max-allele-ct causes a merged variant to be excluded from the result if it has more than the specified number of alleles. Other notes:
Sample/variant filtering results--write-samples --write-snplist ['zs'] ['allow-dups'] --write-samples writes IDs of all samples which pass the filters and inclusion thresholds you've specified to plink2.id, while --write-snplist does the same for variants (output filename plink2.snplist[.zst]). By default, --write-samples (and almost all other .id-generating commands) includes a header line in the output file. You can use --no-id-header to generate headerless .id file(s) instead. This normally forces two-column FID/IID output; add the 'iid-only' modifier to produce single-column IID output instead. Meanwhile, since the actual variants referred to by the .snplist file can be ambiguous when duplicate variant IDs are present, --write-snplist now errors out in that case unless you specify 'allow-dups'. |