D: 3 Dec 2024 Main functions (--make-grm-bin...) Quick index search |
Standard data inputMost of PLINK's calculations operate on tables of samples and variant calls. The following flags are available for defining the form and location of this input, and associated metadata. PLINK 1 binary--bfile <prefix> ['vzs'] The --bfile flag normally causes the binary fileset prefix.bed + prefix.bim + prefix.fam to be referenced. (The structure of these files is described in the PLINK 1.9 file formats appendix.). PLINK 2 supports Zstandard compression of large text files. You can use the 'vzs' modifier to tell PLINK to look for prefix.bim.zst instead of a plain .bim file (the 'v' indicates that it's the variant info file which has the different file extension). --bed <filename> --bed, --bim, and --fam let you specify the full name of one part of the PLINK 1 binary fileset, taking precedence over --bfile. For example, plink2 --bfile toy --bed bob --freq would reference the binary fileset bob.bed + toy.bim + toy.fam. (.fam files are also present in some other fileset types. The --fam flag has the same function when loading them.) --no-fid These allow you to use .fam and .ped files which lack family ID, parental ID, and/or sex columns. (See also --no-pheno below.) What's PROVISIONAL_REF?PLINK 2 treats the A2-allele (usually 6th) column in a .bim file as REF, and the A1-allele (usually 5th) column as ALT. However, this is only expected to be correct ~95-99% of the time, because PLINK 1.x usually sets A2=major whenever writing a .bed+.bim+.fam fileset, and human reference genomes contain some minor alleles. To distinguish these possibly-not-REF alleles from actually-known-to-be-REF alleles, PLINK 2 tracks a "PROVISIONAL_REF" flag for each variant, and most commands that generate a file with a REF output column also add a PROVISIONAL_REF? column when any variant has this flag set. To address this, use --ref-allele or --ref-from-fa to set REF alleles correctly, and the PLINK 2 fileset format (--pfile / --make-pgen instead of --bfile / --make-bed) to keep track of them across runs. PLINK 2 binary--pfile <prefix> ['vzs'] The --pfile flag usually causes the binary fileset prefix.pgen + prefix.pvar + prefix.psam to be referenced, while --pgen/--pvar/--psam let you fully name one file at a time. New features supported by these formats include:
See the draft specification for more details. With --pfile, you can use the 'vzs' modifier to tell PLINK to look for prefix.pvar.zst instead of a plain .pvar file. By default, if the .pgen file does not have an embedded index, the index file is assumed to be the .pgen filename with ".pgi" appended; you can specify a different path with --pgi. --bpfile <prefix> ['vzs'] If you only need the first four features, you can also use --bpfile, which references prefix.pgen + prefix.bim + prefix.fam. Note that the .pgen file tracks whether the .bim file's sixth column can be trusted to contain only REF alleles. You can use the 'vzs' modifier to tell PLINK to look for prefix.bim.zst instead of a plain .bim file. PLINK 1 binary and PLINK 2 binary are PLINK 2's preferred input formats. Most other formats are automatically converted to PLINK 2 binary before the main loading sequence1; as a consequence, if you're performing multiple operations on the same otherwise-formatted files, you may want to keep the autoconversion products and work with them, instead of repeating the conversion on every run. PLINK gives you several ways to handle this situation. 1. If you just want to convert your data, don't use any other flags besides --out. For example: plink2 --vcf my.vcf --out binary_fileset This entirely skips the main loading sequence, so filters like --extract, --hwe, and --snps-only are not permitted (you'll get an error if you attempt to use them). 2. You can produce a binary fileset which is a filtered version of your text data. Use --make-pgen (or --make-bed/--make-bpgen) for this. 3. You can directly analyze the text fileset. In this case, the autoconversion products are silently deleted at the end of the run2, to avoid clogging your drive with unwanted files. For example, the following command writes an allele frequency report to results.afreq, and doesn't leave any other files behind besides results.log: plink2 --vcf my.vcf --freq --out results 4. You can analyze the text fileset while specifying (with --keep-autoconv) that you also want to keep the autoconversion products. So the following command leaves behind results.pgen, results.pvar and results.psam as well as results.afreq and results.log: plink2 --vcf my.vcf --freq --keep-autoconv --out results 1: Since binary files are so much smaller than the equivalent text files, we expect that this will not put undue pressure on your available disk space. This architectural choice allows PLINK's core to focus entirely on efficient streaming processing of binary data; we hope the memory usage, development speed, and performance benefits we're able to deliver as a result are worth any slight inconvenience. Variant Call Format--vcf <filename> ['dosage='<field>] --vcf loads a genotype VCF file, extracting information which can be represented by the PLINK 2 binary format and ignoring everything else (after applying the load filters described below); --bcf does the same thing for binary-VCF files. For example, per-call read depths and quality scores are discarded, but you can filter on them first. You can combine these with --fam/--psam. If you do, PLINK 2 will verify the sample IDs match and appear in the same order in the two files, and the sample information will be loaded. (It may be necessary to use --double-id/--const-fid/--id-delim to get the IDs to match.) Otherwise, an empty sample information file will be generated. If your file contains chrX, you should usually include --split-par in your import command; otherwise, if there are pseudoautosomal regions at the beginning and end of chrX which contain diploid variant calls for males, they won't be handled properly. See the chrX import section for more discussion. By default, dosage information is not imported. To import the GP field (a posterior probability per possible genotype, not phred scaled), add 'dosage=GP' (or 'dosage=GP-force', see below). To import Minimac4-style DS+HDS phased dosage, add 'dosage=HDS'. 'dosage=DS' (or anything else for now) causes the named field to be interpreted as a Minimac3-style dosage. Note that, in the dosage=GP case, PLINK 2 collapses the probabilities down to dosages; you cannot use PLINK 2 to losslessly convert VCF FORMAT/GP data to e.g. BGEN format. To make this more obvious, PLINK 2 now errors out when dosage=GP is used without --import-dosage-certainty on a file with a FORMAT/DS header line, since dosage=DS extracts the same information more quickly in this situation. You can suppress this error with 'dosage=GP-force'. In all of these cases, hardcalls are now regenerated from scratch from the dosages, using the --hard-call-threshold logic described below. As a consequence, variants with no GT field can now be imported; they will be assumed to contain only diploid calls when HDS is also absent. 'Sites-only' VCF files, such as those released by the gnomAD project, can also be loaded with --vcf, as long as you aren't performing any operations which require sample or genotype information. By default, when the GT field is absent, the variant is kept and all genotypes are set to missing (unless dosages are present). To skip all variants with no GT field instead, use --vcf-require-gt. --vcf-min-dp <val> --vcf-min-gq excludes all genotype calls with GQ below the given (nonnegative, decimal values permitted) threshold. Missing GQ values are not treated as being below the threshold. --vcf-min-dp does the same for a DP threshold, while --vcf-max-dp excludes genotype calls with DP above the given threshold (this often corresponds to an unwanted variant calling artifact). The current VCF standard does not specify how '0/.' and similar GT values should be interpreted. By default (mode 'error'/'e'), PLINK errors out and reports the line number of the anomaly. Should the half-call be intentional, though (this can be the case with Complete Genomics data), you can request the following other modes:
The VCF standard does not permit the REF allele to be missing. As a consequence, PLINK converts missing REF alleles (which can appear in e.g. data imported from PLINK 1 .ped files) to 'N' when exporting VCF files. If these aren't mixed with variants where the REF allele is genuinely supposed to be 'N', you can invert this conversion with --vcf-ref-n-missing when re-importing the affected VCF. Some VCF-generating tools, such as GATK GenotypeGVCFs, ordinarily exclude variants with no ALT alleles observed in the current dataset. When a single-sample VCF is generated in this manner, it is not suitable input for PLINK, which has many commands that assume homozygous-REF genotypes are not underrepresented. As a consequence, --vcf without dosage= now reports a warning when it encounters a single-sample VCF with no 0, 0/0, or 0|0 GT values, out of 1000+ scanned variants with non-missing GT. This will be upgraded to an error in a future build. The warning/error usually means you should backtrack and regenerate the VCF in a manner that doesn't exclude homozygous-REF genotypes. With GATK GenotypeGVCFs, you should add the --include-non-variant-sites flag. There are rare edge cases, e.g. most of your variants have 10+ ALT alleles, where you might legitimately have no homozygous-REFs out of 1000+ non-missing genotype calls. In these cases, you can use --vcf-allow-no-nonvar to suppress the warning/error. Oxford-format genotype--data <prefix> <REF/ALT mode> ['gzs'] --sample <filename> --lax-bgen-import --data normally causes the Oxford-format text genotype fileset prefix.gen + prefix.sample to be imported. The following modes are supported:
'gzs' tells PLINK 2 to look for prefix.gen.zst instead of a plain .gen file. --bgen, --gen, and --sample allow you to specify the filenames separately; --bgen is necessary for BGEN-format files, and --gen is necessary if your genomic data file has a .gen.gz extension. Note that PLINK 2 collapses the raw probabilities stored in .gen/.bgen files down to dosages; you cannot use PLINK 2 to losslessly convert between e.g. BGEN sub-formats. (But if the next program in your pipeline is e.g. BOLT-LMM, which only cares about dosages, it's fine to use PLINK 2 for conversion.) Additional notes:
--missing-code [comma-separated list of values] --missing-code lets you specify the set of strings to interpret as missing phenotype values in a .sample file. For example, "--missing-code -9,0,NA,na" would cause '-9', '0', 'NA', and 'na' to all be interpreted as missing phenotypes. (Note that no spaces are currently permitted between the strings.) By default, only 'NA' is interpreted as missing. Oxford-format phased reference panel--haps <filename> [{ref-first | ref-last}] --haps causes the named Oxford-format phased haplotype file to be imported. If (and only if) the --haps file does not contain header columns, it is also necessary to use --legend to specify the chromosome code and the .legend file with the rest of the variant information. When this isn't used with --sample, the new sample IDs are of the form 'per#', starting with 'per0'. PLINK 1 text genotype--pedmap <prefix> --ped <filename> --tfile <prefix> --tped <filename> The --pedmap flag causes the PLINK 1 sample-major text genotype fileset prefix.ped + prefix.map to be imported. --ped and --map allow you to specify the filenames separately. Similarly, the --tfile flag causes the PLINK 1 variant-major text genotype fileset prefix.tped + prefix.tfam to be imported, --tped and --tfam allow you to specify the filenames separately. Note that the old flag for importing PLINK 1 sample-major text filesets, '--file', causes PLINK 2 to error out; it is not automatically translated to '--pedmap'. This is because continued usage of .ped + .map filesets is usually a mistake. The format is simultaneously highly inefficient, even relative to other text formats, and limited in scope (unobserved minor allele codes can't be stored). PLINK 1 dosage--import-dosage <file> ['noheader'] ['id-delim='<char>] ['skip0='<i>] --import-dosage causes the named PLINK 1 dosage file to be imported.
For example, a .traw file exported by PLINK 2 can usually be imported with plink2 --import-dosage prefix.traw skip0=1 skip1=2 id-delim=_ chr-col-num=1 pos-col-num=4 ref-first Other formats--lfile <prefix> --lgen <filename> --23file <filename> [family ID] [indiv ID] [sex] [phenotype] [paternal ID] These import functions are not yet implemented in PLINK 2.0. For now, use PLINK 1.9 to convert them to PLINK 1 binary. Single-part sample ID import--double-id --iid-sid PLINK 2 allows sample IDs to have 1-3 components.
VCF and .bgen files contain single-part sample IDs. As implied above, when PLINK 2 encounters a single-part sample ID, its default behavior is to set the IID to that value; the FID and SID are implicitly treated as '0'. However, unlike the case with e.g. --keep where you can add extra column(s) to the input file, there is no standard way to specify nonzero FIDs or SIDs in VCF or .bgen files. To address this limitation, the following flags let you convert VCF/.bgen sample IDs into multiple parts:
--export's id-delim= and id-paste= modifiers let you control conversion in the other direction. Since PLINK sample IDs cannot contain spaces, an error is normally reported when there's a space in a VCF/.bgen sample ID. To work around this, you can use --idspace-to to convert all spaces in sample IDs to another character. This happens before regular parsing, so when the --idspace-to and --id-delim characters are identical, both space and the original --id-delim character are interpreted as FID/IID/SID delimiters. If you only want the space character to function as a delimiter, use "--id-delim ' '". (This is not compatible with --rerun.) Dosage import settings--hard-call-threshold <max distance from nearest hardcall> --dosage-erase-threshold <max distance from nearest hardcall> The PLINK 2 binary file format supports allelic dosages, with ~4 decimal place precision. However, some of PLINK 2's commands do not make use of dosage data. Thus, when importing dosage data, PLINK 2 also saves (possibly missing) hardcalls for those commands to use. By default, a proper hardcall is saved when the distance from the nearest hardcall, defined as 0.5 * sumi|xi - round(xi)| (where the xi's are 0..2 allele dosages, even on haploid chromosomes), is not greater than 0.1; i.e. for biallelic variants, the alternate allele dosage needs to be in [0, 0.1], [0.9, 1.1], or [1.9, 2.0]. Otherwise, a missing hardcall is saved. You can change the acceptable distance with --hard-call-threshold. The --hard-call-threshold value must be less than 0.5, so dosages of exactly 0.5 and 1.5 will always be translated to missing hardcalls. When this is a problem, see --make-[b]pgen's 'fill-missing-from-dosage' modifier. In some cases, e.g. when your reference allele dosage is 1.999, you may be willing to throw away the raw dosage value and only save a hardcall. --dosage-erase-threshold erases all dosage values which are no further than the specified distance from the nearest hardcall. --hard-call-threshold and --dosage-erase-threshold can also be used with --make-[b]pgen and --make-bed; it is not necessary to perform a full-blown import to adjust the dosage-to-hardcall mapping. --import-dosage-certainty <min certainty> Some dosage formats include separate probabilities for every possible genotype, e.g. {P(0/0)=0.2, P(0/1)=0.52, P(1/1)=0.28} or {P(0/0)=0.005, P(0/1)=0.91, P(1/1)=0.085}. By default, PLINK 2 treats these two sets of probabilities in the same manner: in both cases, the expected alternate allele dosage is 1.08, which is in [0.9, 1.1] so a heterozygous hardcall is saved. However, the first call is far less certain than the second; you'll frequently prefer to not save a dosage at all when the highest genotype probability is only 0.52, whereas 0.91 is a different story. During import, you can use --import-dosage-certainty to make this distinction. Chromosome X--lax-chrx-import Chromosome X data has extra subtleties:
In order for PLINK 2 to treat this data correctly, you may need to provide (or at least impute) sex information and/or the --split-par flag when importing it. PLINK 2 builds from August 2022 onward print a warning or error when it looks like chrX data is being mis-imported. You can disable this warning/error with the --lax-chrx-import flag, but you should first double-check what you're doing. Ploidy > 2--polyploid-mode <mode> VCF and .bgen files can store genotypes with ploidy > 2, which are not supported by the PLINK 2 file format. The following import modes are currently supported:
Randomized data--dummy <#samples> <#SNPs> [missing geno/dosage freq(s)] [missing pheno freq] This tells PLINK to generate a simple dataset from scratch (useful for basic software testing), with the specified number of samples and SNPs. All generated samples are females with random genotype and phenotype values, and all SNPs are on chromosome 1 with positions 0, 1, 2, etc.
Genotype encoding--input-missing-genotype <char> '.' is always interpreted as a missing genotype code in input files. By default, '0' also is; you can change this second missing code with --input-missing-genotype. Nonstandard chromosome IDs--strict-extra-chr --allow-extra-chr ['0'] When --strict-extra-chr is on, PLINK 2.0 normally reports an error if the input data contains unrecognized chromosome codes (such as hg19 haplotype chromosomes or unplaced contigs). If none of the additional codes start with a digit, you can permit them with the --allow-extra-chr flag. (These contigs are ignored by most analyses which skip unplaced regions.) --allow-extra-chr's '0' modifier causes these unrecognized chromosome codes to be treated as if they had been set to zero. This is sometimes necessary to produce reports readable by older software. Note that POS values for the affected variants tend to be rendered useless. --chr-set <autosome ct> ['no-x'] ['no-y'] ['no-xy'] ['no-mt'] --cow --chr-set changes the chromosome set. The first parameter specifies the number of diploid autosome pairs if positive, or haploid chromosomes if negative. (Polyploid and aneuploid data are not supported, and there is currently no special handling of sex or mitochondrial chromosomes in all-haploid chromosome sets.) Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (old representation of pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4. PAR1/PAR2 do not have numeric codes associated with them, and are disabled by 'no-xy'. n is currently limited to 95, so if you're working with adder's-tongue fern genomes, you're out of luck3. The other flags support PLINK 1.07 and GCTA semantics:
3: Just kidding. Contact us, and we'll send you a build supporting a higher autosome limit. Note that this isn't necessary if you're dealing with a draft assembly with lots of contigs, rather than actual autosomes—the standard build can handle that if you name your contigs 'contig1', 'contig2', etc. and use the --allow-extra-chr flag. --chr-override ['file'] PLINK 2 saves nonhuman chromosome set information to .pvar and VCF files (in a ##chrSet header line). Thus, after you've run --dog + --make-pgen once, it is not necessary to include --dog in subsequent commands involving that dataset. (This did not work properly before 23 Jan 2021.) By default, if a chromosome set was explicitly specified on the command line, and it conflicts with an input file ##chrSet header line, PLINK 2 will error out. --chr-override with no parameter causes the command line to take precedence, while "--chr-override file" defers to the file. --human can now be used to explicitly specify the human chromosome set. This can be useful as an additional sanity check: if you attempt to use --human on a fileset generated with --dog + --make-pgen (or vice versa), PLINK 2 will error out instead of accepting the incorrect chromosome set. Allele frequenciesWhen allele frequency estimates are needed, PLINK defaults to using empirical frequencies from the immediate dataset (with a pseudocount added when --af-pseudocount is specified). This is unsatisfactory when processing a small subset of a larger dataset or population. --read-freq <.afreq/.acount/.gcount/.freq/.frq/.frq.count/.frqx filename> --error-on-freq-calc --read-freq loads allele frequency estimates from a --freq (PLINK 1.x ok), --geno-counts, or PLINK 1.9 --freqx report, instead of imputing them from the immediate dataset.
Regarding that last point, it can be useful for computational efficiency to assert that a run doesn't invoke the allele frequency calculation; this is the function of the --error-on-freq-calc flag. (This can be combined with --read-freq; it then verifies that all necessary frequencies/counts were present in the --read-freq file.) Runs involving the following flags may be blocked:
--bad-freqs When PLINK 2 needs decent allele frequencies, it normally errors out if they aren't provided by --read-freq and less than 50 founders are available to impute them from. Use --bad-freqs to force PLINK 2 to proceed in this case. (As the flag name suggests, this is not recommended; it is almost always better to use --read-freq on an appropriate source instead.) Phenotypes--pheno ['iid-only'] <filename> --pheno-name <column ID(s)/range(s)...> --not-pheno <phenotype ID(s)...> --pheno causes (additional) phenotype values to be read from the specified space- or tab-delimited file. The first columns of that file must be either FID/IID or just IID (in which case the FID is assumed to be 0). A primary header line is required when using --pheno-name, and optional without it (if it's present, it should begin with 'FID', '#FID', 'IID', or '#IID'). Additional header lines (beginning with '#', not immediately followed by 'FID'/'IID') are permitted before the primary header line. For example: ## * If you also need PLINK 1.9 to read this file, add an FID column in front, PLINK 2 defaults to analyzing all phenotypes in the sample information and --pheno files (as if PLINK 1.x's --all-pheno flag was always in effect). --pheno-name lets you specify a subset of phenotypes to load from the --pheno file (or if no --pheno file was specified, from the sample information file), by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges. (Spaces are not permitted immediately before or after a range-denoting dash.) --pheno-col-nums lets you do the same thing with --pheno file column numbers instead. --no-psam-pheno tells PLINK 2 to ignore all phenotype data in the sample information file and allows .fam files with no phenotype column to be loaded. When no primary header line is present, phenotypes are assigned the names 'PHENO1', 'PHENO2', etc., and the first two columns are normally assumed to be FID/IID. Add the 'iid-only' modifier if you want only the first column to be interpreted as ID in this case. --not-pheno can be used to exclude phenotype(s) by name. Phenotype encoding--no-categorical --input-missing-phenotype <integer> --no-input-missing-phenotype Missing case/control or quantitative phenotypes are expected to be encoded as 'NA'/'nan' (any capitalization) or -9. By default, other strings which don't start with a number are now interpreted as categorical phenotype/covariate values; to force them to be interpreted as missing numeric values instead, use --no-categorical. You can change the numeric missing phenotype code to another integer with --input-missing-phenotype, or just disable -9 with --no-input-missing-phenotype. Relatedly, when neither --input-missing-phenotype nor --no-input-missing-phenotype are specified, and a phenotype/covariate value in [-8, -9) or (-9, 10] is present, PLINK 2 now errors out when -9 is also present: in this context it is too likely that -9 does not represent a missing value. --neg9-pheno-really-missing suppresses the error. Case/control phenotypes are expected to be encoded as 1=unaffected (control), 2=affected (case); 0 is accepted as an alternate missing value encoding. If you use the --1 flag, 0 is interpreted as unaffected status instead, while 1 maps to affected. Note that this only affects interpretation of input files; output files still use 1=control/2=case encoding. (Unlike PLINK 1.x, this does not force all phenotypes to be interpreted as case/control.) Covariates--covar ['iid-only'] <filename> --covar-name <column ID(s)/range(s)...> --not-covar <covariate ID(s)...> --covar designates the file to load covariates from. The file format is the same as for --pheno (example). The main phenotype is no longer set to missing when a covariate value is missing; instead, this only happens to the temporary phenotype copies used by e.g. the linear/logistic regression routine. Categorical covariates are now directly supported. Any nonnumeric string ('NA' and 'nan' are considered to be numbers for this purpose) is treated as a categorical covariate name. --covar-name works like --pheno-name. It can now be used without --covar (in which case the --pheno or .psam file is the target). --covar-col-nums works like --pheno-col-nums, and refers to the --pheno file when --covar is not specified. When no primary header line is present, covariates are assigned the names 'COVAR1', 'COVAR2', etc. --not-covar can be used to exclude covariate(s) by name. 'Cluster' import--within <filename> [new phenotype name] --within constructs a PLINK 2 categorical phenotype out of a PLINK 1.x 'cluster' file.
--family causes FID to be treated as a categorical phenotype. As with --within, if no new phenotype name is given, it defaults to 'CATPHENO'. --missing-catname <string> By default, missing categorical phenotype/covariate values are encoded as 'NONE'; you can change this with --missing-catname. (This value is case-sensitive.) --family-missing-catname can be used to specify another FID for --family to treat as missing. Reference genome--fa <FASTA file> Most PLINK operations don't require the full reference genome sequence behind the coordinates in the .pvar/.bim file. However, there are a few exceptions; use --fa to specify a FASTA-formatted reference genome file when it's needed. This must match the FASTA used for alignment (unless the coordinates have since been lifted over to another reference genome build, anyway). If you're unsure what that is, this blog post by Heng Li lists common choices (and provides suggestions). hs37d5 and GRCh38_full_analysis_set_plus_decoy_hla FASTA files can be downloaded from the Resources page. |