Introduction, downloads

D: 4 Oct 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Quick index search

Standard data input

Most of PLINK's calculations operate on tables of samples and variant calls. The following flags are available for defining the form and location of this input, and associated metadata.

PLINK 1 binary

--bfile <prefix> ['vzs']

The --bfile flag normally causes the binary fileset prefix.bed + prefix.bim + prefix.fam to be referenced. (The structure of these files is described in the PLINK 1.9 file formats appendix.).

PLINK 2 supports Zstandard compression of large text files. You can use the 'vzs' modifier to tell PLINK to look for prefix.bim.zst instead of a plain .bim file (the 'v' indicates that it's the variant info file which has the different file extension).

--bed <filename>
--bim <filename>
--fam <filename>

--bed, --bim, and --fam let you specify the full name of one part of the PLINK 1 binary fileset, taking precedence over --bfile. For example,

plink2 --bfile toy --bed bob --freq

would reference the binary fileset bob.bed + toy.bim + toy.fam.

(.fam files are also present in some other fileset types. The --fam flag has the same function when loading them.)

--no-fid
--no-parents
--no-sex

These allow you to use .fam and .ped files which lack family ID, parental ID, and/or sex columns. (See also --no-pheno below.)

What's PROVISIONAL_REF?

PLINK 2 treats the A2-allele (usually 6th) column in a .bim file as REF, and the A1-allele (usually 5th) column as ALT. However, this is only expected to be correct ~95-99% of the time, because PLINK 1.x usually sets A2=major whenever writing a .bed+.bim+.fam fileset, and human reference genomes contain some minor alleles.

To distinguish these possibly-not-REF alleles from actually-known-to-be-REF alleles, PLINK 2 tracks a "PROVISIONAL_REF" flag for each variant, and most commands that generate a file with a REF output column also add a PROVISIONAL_REF? column when any variant has this flag set.

To address this, use --ref-allele or --ref-from-fa to set REF alleles correctly, and the PLINK 2 fileset format (--pfile / --make-pgen instead of --bfile / --make-bed) to keep track of them across runs.

PLINK 2 binary

--pfile <prefix> ['vzs']
--pgen <filename>
--pgi <filename>
--pvar <filename>
--psam <filename>

The --pfile flag usually causes the binary fileset prefix.pgen + prefix.pvar + prefix.psam to be referenced, while --pgen/--pvar/--psam let you fully name one file at a time. New features supported by these formats include:

  • Reliable tracking of REF vs. ALT alleles.
  • Computationally efficient compression of low-MAF variants and high-LD adjacent variant pairs.
  • Phased genotypes.
  • Dosages.
  • VCF-style header information (including species-specific chromosome info, so you don't have to constantly use --chr-set).
  • Multiallelic variants.
  • Multiple phenotypes.
  • Named categorical phenotypes (a phenotype string which doesn't start with a number is interpreted as a category name).

See the draft specification for more details.

With --pfile, you can use the 'vzs' modifier to tell PLINK to look for prefix.pvar.zst instead of a plain .pvar file.

By default, if the .pgen file does not have an embedded index, the index file is assumed to be the .pgen filename with ".pgi" appended; you can specify a different path with --pgi.

--bpfile <prefix> ['vzs']

If you only need the first four features, you can also use --bpfile, which references prefix.pgen + prefix.bim + prefix.fam. Note that the .pgen file tracks whether the .bim file's sixth column can be trusted to contain only REF alleles.

You can use the 'vzs' modifier to tell PLINK to look for prefix.bim.zst instead of a plain .bim file.

--keep-autoconv ['vzs']

PLINK 1 binary and PLINK 2 binary are PLINK 2's preferred input formats. Most other formats are automatically converted to PLINK 2 binary before the main loading sequence1; as a consequence, if you're performing multiple operations on the same otherwise-formatted files, you may want to keep the autoconversion products and work with them, instead of repeating the conversion on every run. PLINK gives you several ways to handle this situation.

1. If you just want to convert your data, don't use any other flags besides --out. For example:

plink2 --vcf my.vcf --out binary_fileset

This entirely skips the main loading sequence, so filters like --extract, --hwe, and --snps-only are not permitted (you'll get an error if you attempt to use them).

2. You can produce a binary fileset which is a filtered version of your text data. Use --make-pgen (or --make-bed/--make-bpgen) for this.

3. You can directly analyze the text fileset. In this case, the autoconversion products are silently deleted at the end of the run2, to avoid clogging your drive with unwanted files. For example, the following command writes an allele frequency report to results.afreq, and doesn't leave any other files behind besides results.log:

plink2 --vcf my.vcf --freq --out results

4. You can analyze the text fileset while specifying (with --keep-autoconv) that you also want to keep the autoconversion products. So the following command leaves behind results.pgen, results.pvar and results.psam as well as results.afreq and results.log:

plink2 --vcf my.vcf --freq --keep-autoconv --out results

1: Since binary files are so much smaller than the equivalent text files, we expect that this will not put undue pressure on your available disk space. This architectural choice allows PLINK's core to focus entirely on efficient streaming processing of binary data; we hope the memory usage, development speed, and performance benefits we're able to deliver as a result are worth any slight inconvenience.
2: If you interrupt PLINK with e.g. Ctrl-C, or the program crashes, the files will not be deleted. You can use "rm *-temporary.*" (or "del *-temporary.*" on Windows) to clean up the mess.

Variant Call Format

--vcf <filename> ['dosage='<field>]
--bcf <filename> ['dosage='<field>]

--vcf loads a genotype VCF file, extracting information which can be represented by the PLINK 2 binary format and ignoring everything else (after applying the load filters described below); --bcf does the same thing for binary-VCF files. For example, per-call read depths and quality scores are discarded, but you can filter on them first.

You can combine these with --fam/--psam. If you do, PLINK 2 will verify the sample IDs match and appear in the same order in the two files, and the sample information will be loaded. (It may be necessary to use --double-id/--const-fid/--id-delim to get the IDs to match.) Otherwise, an empty sample information file will be generated.

If your file contains chrX, you should usually include --split-par in your import command; otherwise, if there are pseudoautosomal regions at the beginning and end of chrX which contain diploid variant calls for males, they won't be handled properly. See the chrX import section for more discussion.

By default, dosage information is not imported. To import the GP field (a posterior probability per possible genotype, not phred scaled), add 'dosage=GP' (or 'dosage=GP-force', see below). To import Minimac4-style DS+HDS phased dosage, add 'dosage=HDS'. 'dosage=DS' (or anything else for now) causes the named field to be interpreted as a Minimac3-style dosage.

Note that, in the dosage=GP case, PLINK 2 collapses the probabilities down to dosages; you cannot use PLINK 2 to losslessly convert VCF FORMAT/GP data to e.g. BGEN format. To make this more obvious, PLINK 2 now errors out when dosage=GP is used without --import-dosage-certainty on a file with a FORMAT/DS header line, since dosage=DS extracts the same information more quickly in this situation. You can suppress this error with 'dosage=GP-force'.

In all of these cases, hardcalls are now regenerated from scratch from the dosages, using the --hard-call-threshold logic described below. As a consequence, variants with no GT field can now be imported; they will be assumed to contain only diploid calls when HDS is also absent.

'Sites-only' VCF files, such as those released by the gnomAD project, can also be loaded with --vcf, as long as you aren't performing any operations which require sample or genotype information.

--vcf-require-gt

By default, when the GT field is absent, the variant is kept and all genotypes are set to missing (unless dosages are present). To skip all variants with no GT field instead, use --vcf-require-gt.

--vcf-min-gq <val>

--vcf-min-dp <val>
--vcf-max-dp <val>

--vcf-min-gq excludes all genotype calls with GQ below the given (nonnegative, decimal values permitted) threshold. Missing GQ values are not treated as being below the threshold.

--vcf-min-dp does the same for a DP threshold, while --vcf-max-dp excludes genotype calls with DP above the given threshold (this often corresponds to an unwanted variant calling artifact).

--vcf-half-call <mode>

The current VCF standard does not specify how '0/.' and similar GT values should be interpreted. By default (mode 'error'/'e'), PLINK errors out and reports the line number of the anomaly. Should the half-call be intentional, though (this can be the case with Complete Genomics data), you can request the following other modes:

  • 'haploid'/'h': Treat half-calls as haploid/homozygous (the PLINK 2 file format does not distinguish between the two).
  • 'missing'/'m': Treat half-calls as missing.
  • 'reference'/'r': Treat the missing part as reference.

--vcf-ref-n-missing

The VCF standard does not permit the REF allele to be missing. As a consequence, PLINK converts missing REF alleles (which can appear in e.g. data imported from PLINK 1 .ped files) to 'N' when exporting VCF files. If these aren't mixed with variants where the REF allele is genuinely supposed to be 'N', you can invert this conversion with --vcf-ref-n-missing when re-importing the affected VCF.

--vcf-allow-no-nonvar

Some VCF-generating tools, such as GATK GenotypeGVCFs, ordinarily exclude variants with no ALT alleles observed in the current dataset. When a single-sample VCF is generated in this manner, it is not suitable input for PLINK, which has many commands that assume homozygous-REF genotypes are not underrepresented. As a consequence, --vcf without dosage= now reports a warning when it encounters a single-sample VCF with no 0, 0/0, or 0|0 GT values, out of 1000+ scanned variants with non-missing GT. This will be upgraded to an error in a future build.

The warning/error usually means you should backtrack and regenerate the VCF in a manner that doesn't exclude homozygous-REF genotypes. With GATK GenotypeGVCFs, you should add the --include-non-variant-sites flag.

There are rare edge cases, e.g. most of your variants have 10+ ALT alleles, where you might legitimately have no homozygous-REFs out of 1000+ non-missing genotype calls. In these cases, you can use --vcf-allow-no-nonvar to suppress the warning/error.

Oxford-format genotype

--data <prefix> <REF/ALT mode> ['gzs']
--bgen <filename> <REF/ALT mode> ['snpid-chr']
--gen <filename> <REF/ALT mode>

--sample <filename>
--oxford-single-chr <chromosome code>

--lax-bgen-import

--data normally causes the Oxford-format text genotype fileset prefix.gen + prefix.sample to be imported.

The following modes are supported:

  • 'ref-first': The first allele for each variant is REF.
  • 'ref-last': The last allele for each variant is REF.
  • 'ref-unknown': The last allele for each variant is treated as provisional-REF.

'gzs' tells PLINK 2 to look for prefix.gen.zst instead of a plain .gen file.

--bgen, --gen, and --sample allow you to specify the filenames separately; --bgen is necessary for BGEN-format files, and --gen is necessary if your genomic data file has a .gen.gz extension.

Note that PLINK 2 collapses the raw probabilities stored in .gen/.bgen files down to dosages; you cannot use PLINK 2 to losslessly convert between e.g. BGEN sub-formats. (But if the next program in your pipeline is e.g. BOLT-LMM, which only cares about dosages, it's fine to use PLINK 2 for conversion.)

Additional notes:

  • The original .gen specification had 5 leading columns, but this was later amended to 6. Both flavors are now supported; PLINK 1.9 and 2.0 builds before 16 Apr 2021 did not support the 6-leading-column flavor.
  • With .bgen input, use the 'snpid-chr' modifier to specify that chromosome codes should be read from the "SNP ID" field. (The "SNP ID" field is usually ignored.)
  • If a BGEN v1.2+ file contains sample IDs, it may be imported without a companion .sample file.
  • With .gen input, the first column is normally assumed to contain chromosome codes. To import a single-chromosome .gen with an ignorable first column (or ignore the chromosome and SNP ID fields when importing a .bgen), use --oxford-single-chr.
  • .bgen files from IMPUTE5 may have an overstated variant count in the header. Use --lax-bgen-import to make PLINK 2 tolerate this instead of erroring out.
  • In the .sample file:
    • Sample IDs can be represented as either an ID_1/ID_2 column pair (interpreted as FID/IID), or a single ID column. In the latter case, --double-id/--const-fid/etc. apply.
    • If parental IDs are present, they must be in columns titled 'father'/'mother' (capital letters ok) of type 'D' (discrete covariate), to be loaded properly.
    • If sex information is present, it must be in a column titled 'sex' (capital letters ok) of type 'D' (discrete covariate), and be coded in the usual 1=male/2=female/{0,NA}=unknown manner, to be loaded properly.
    • Other discrete covariate columns are permitted to contain only positive integers, for compatibility with the original .sample specification; in this case, 'C' is prepended to all category names during import. Otherwise, numeric values are not permitted.
    • Binary phenotypes are converted from 1/0 to 2/1 coding.
    • This behavior was updated on 31 Aug 2020. Older builds only support the original .sample specification, without QCTOOLv2 extensions.

--missing-code [comma-separated list of values]
  (alias: --missing_code)

--missing-code lets you specify the set of strings to interpret as missing phenotype values in a .sample file. For example, "--missing-code -9,0,NA,na" would cause '-9', '0', 'NA', and 'na' to all be interpreted as missing phenotypes. (Note that no spaces are currently permitted between the strings.) By default, only 'NA' is interpreted as missing.

Oxford-format phased reference panel

--haps <filename> [{ref-first | ref-last}]
--legend <filename> <chromosome code>

--haps causes the named Oxford-format phased haplotype file to be imported. If (and only if) the --haps file does not contain header columns, it is also necessary to use --legend to specify the chromosome code and the .legend file with the rest of the variant information.

When this isn't used with --sample, the new sample IDs are of the form 'per#', starting with 'per0'.

PLINK 1 text genotype

--pedmap <prefix>

--ped <filename>
--map <filename>

--tfile <prefix>

--tped <filename>
--tfam <filename>

The --pedmap flag causes the PLINK 1 sample-major text genotype fileset prefix.ped + prefix.map to be imported. --ped and --map allow you to specify the filenames separately.

Similarly, the --tfile flag causes the PLINK 1 variant-major text genotype fileset prefix.tped + prefix.tfam to be imported, --tped and --tfam allow you to specify the filenames separately.

Note that the old flag for importing PLINK 1 sample-major text filesets, '--file', causes PLINK 2 to error out; it is not automatically translated to '--pedmap'. This is because continued usage of .ped + .map filesets is usually a mistake. The format is simultaneously highly inefficient, even relative to other text formats, and limited in scope (unobserved minor allele codes can't be stored).

PLINK 1 dosage

--import-dosage <file> ['noheader'] ['id-delim='<char>] ['skip0='<i>]
                       ['skip1='<j>] ['skip2='<k>] ['dose1'] ['format='<mode>]
                       [{ref-first | ref-last}] ['single-chr='<code>]
                       ['chr-col-num='<col #>] ['pos-col-num='<col #>]

--import-dosage causes the named PLINK 1 dosage file to be imported.

  • You must also specify a sample information file with --fam/--psam.
  • By default, PLINK assumes that the file contains a header line, which has 'SNP' in (1-based) column i+1, 'A1' in column i+j+2, 'A2' in column i+j+3, and sample FID/IIDs starting from column i+j+k+4. (i/j/k are normally zero, but can be changed with 'skip0', 'skip1', and 'skip2' respectively. FID/IID are normally assumed to be separate tokens, but if they're merged into a single token you can specify the delimiter with 'id-delim='.) If such a header line is not present, use the 'noheader' modifier; samples will then be assumed to appear in the same order as they do in the .psam/.fam file.
  • You may specify a companion .map file with --map. If you do not,
    • 'single-chr=' can be used to specify that all variants are on the named chromosome. Otherwise, you can use 'chr-col-num=' to read chromosome codes from the given (1-based) column number.
    • 'pos-col-num=' causes base-pair coordinates to be read from the given column number.
  • The 'format=' modifier lets you specify the number of values used to represent each dosage. 'format=1' normally indicates a single 0..2 allele 1 expected count; 'dose1' modifies this to a 0..1 frequency. 'format=2' indicates a 0..1 homozygous A1 likelihood followed by a 0..1 het likelihood. 'format=3' indicates 0..1 hom A1, 0..1 het, 0..1 hom A2. 'format=infer' (the default) infers the format from the number of columns in the first nonheader line.
    Note that, for 'format=3', the third value in each triplet is not actually parsed by PLINK; as a result, '0 0 0' is interpreted as homozygous A2 rather than a missing call.
  • By default, the A2 allele is treated as the provisional reference; 'ref-first' and 'ref-last' have the usual effect.

For example, a .traw file exported by PLINK 2 can usually be imported with

plink2 --import-dosage prefix.traw skip0=1 skip1=2 id-delim=_ chr-col-num=1 pos-col-num=4 ref-first

Other formats

--lfile <prefix>

--lgen <filename>
--reference <filename>
--allele-count

--23file <filename> [family ID] [indiv ID] [sex] [phenotype] [paternal ID]
                    [maternal ID]

These import functions are not yet implemented in PLINK 2.0. For now, use PLINK 1.9 to convert them to PLINK 1 binary.

Single-part sample ID import

--double-id
--const-fid [FID]
--id-delim [delimiter]

--iid-sid

PLINK 2 allows sample IDs to have 1-3 components.

  • An individual ID ("IID") is required.
  • A family ID ("FID") is optional. When it's explicitly defined, it must be positioned before the IID. If it's undefined, it's treated as '0'.
  • A source ID ("SID") is optional. When it's explicitly defined, it must be positioned after the IID. If it's undefined, it's treated as '0'.

VCF and .bgen files contain single-part sample IDs. As implied above, when PLINK 2 encounters a single-part sample ID, its default behavior is to set the IID to that value; the FID and SID are implicitly treated as '0'. However, unlike the case with e.g. --keep where you can add extra column(s) to the input file, there is no standard way to specify nonzero FIDs or SIDs in VCF or .bgen files. To address this limitation, the following flags let you convert VCF/.bgen sample IDs into multiple parts:

  • --double-id causes both family and individual IDs to be set to the sample ID.
  • --const-fid converts sample IDs to individual IDs while setting all family IDs to a single value (default '0'). Again, "--const-fid 0" is now the default behavior when none of these three flags is present; this is a change from PLINK 1.9's --double-id default.
  • --id-delim normally causes single-delimiter sample IDs to be parsed as <FID><delimiter><IID>, and double-delimiter IDs as <FID><delim><IID><delim><SID>; the default delimiter is '_'. With --iid-sid (which also affects --sample-diff), single-delimiter IDs are parsed as <IID><delimiter><SID> instead.
    --id-delim can no longer be used with --double-id/--const-fid; it will error out if any ID lacks the delimiter.

--export's id-delim= and id-paste= modifiers let you control conversion in the other direction.

--idspace-to <character>

Since PLINK sample IDs cannot contain spaces, an error is normally reported when there's a space in a VCF/.bgen sample ID. To work around this, you can use --idspace-to to convert all spaces in sample IDs to another character. This happens before regular parsing, so when the --idspace-to and --id-delim characters are identical, both space and the original --id-delim character are interpreted as FID/IID/SID delimiters.

If you only want the space character to function as a delimiter, use "--id-delim ' '". (This is not compatible with --rerun.)

Dosage import settings

--hard-call-threshold <max distance from nearest hardcall>

--dosage-erase-threshold <max distance from nearest hardcall>

The PLINK 2 binary file format supports allelic dosages, with ~4 decimal place precision. However, some of PLINK 2's commands do not make use of dosage data. Thus, when importing dosage data, PLINK 2 also saves (possibly missing) hardcalls for those commands to use.

By default, a proper hardcall is saved when the distance from the nearest hardcall, defined as

   0.5 * sumi|xi - round(xi)|

(where the xi's are 0..2 allele dosages, even on haploid chromosomes), is not greater than 0.1; i.e. for biallelic variants, the alternate allele dosage needs to be in [0, 0.1], [0.9, 1.1], or [1.9, 2.0]. Otherwise, a missing hardcall is saved. You can change the acceptable distance with --hard-call-threshold.

The --hard-call-threshold value must be less than 0.5, so dosages of exactly 0.5 and 1.5 will always be translated to missing hardcalls. When this is a problem, see --make-[b]pgen's 'fill-missing-from-dosage' modifier.

In some cases, e.g. when your reference allele dosage is 1.999, you may be willing to throw away the raw dosage value and only save a hardcall. --dosage-erase-threshold erases all dosage values which are no further than the specified distance from the nearest hardcall.

--hard-call-threshold and --dosage-erase-threshold can also be used with --make-[b]pgen and --make-bed; it is not necessary to perform a full-blown import to adjust the dosage-to-hardcall mapping.

--import-dosage-certainty <min certainty>

Some dosage formats include separate probabilities for every possible genotype, e.g. {P(0/0)=0.2, P(0/1)=0.52, P(1/1)=0.28} or {P(0/0)=0.005, P(0/1)=0.91, P(1/1)=0.085}. By default, PLINK 2 treats these two sets of probabilities in the same manner: in both cases, the expected alternate allele dosage is 1.08, which is in [0.9, 1.1] so a heterozygous hardcall is saved.

However, the first call is far less certain than the second; you'll frequently prefer to not save a dosage at all when the highest genotype probability is only 0.52, whereas 0.91 is a different story. During import, you can use --import-dosage-certainty to make this distinction.

Chromosome X

--lax-chrx-import

Chromosome X data has extra subtleties:

  • Males are haploid, while females are diploid.
  • There may be pseudoautosomal regions (PARs) where everyone should be treated as diploid.

In order for PLINK 2 to treat this data correctly, you may need to provide sex information and/or the --split-par flag when importing it. PLINK 2 builds from August 2022 onward print a warning when it looks like chrX data is being mis-imported. You can disable this warning (which will become an error in the future) with the --lax-chrx-import flag, but you should first double-check what you're doing.

Ploidy > 2

--polyploid-mode <mode>

VCF and .bgen files can store genotypes with ploidy > 2, which are not supported by the PLINK 2 file format. The following import modes are currently supported:

  • 'error'/'e' (default): Report an error.
  • 'missing'/'m': Treat the entire genotype as missing.
Randomized data

--dummy <#samples> <#SNPs> [missing geno/dosage freq(s)] [missing pheno freq]
                           [{acgt | 1234 | 12}] ['pheno-ct='<count>]
                           ['scalar-pheno'] ['phase-freq='<rate>]
                           ['dosage-freq='<rate>]

This tells PLINK to generate a simple dataset from scratch (useful for basic software testing), with the specified number of samples and SNPs. All generated samples are females with random genotype and phenotype values, and all SNPs are on chromosome 1 with positions 0, 1, 2, etc.

  • Hardcall (though not dosage) allele frequencies are uniformly distributed in [0, 1], and some explicit LD is simulated. This is a change from PLINK 1.x (and PLINK 2.0 builds before 24 Apr 2022), which only generated variants with AF=0.5 and didn't simulate LD.
  • By default, the missing dosage frequency is zero. This can be changed by providing a comma-separated list of possible frequencies (one frequency in the list is selected for each variant).
  • By default, the missing phenotype frequency is zero. This can be changed by providing both a dosage-frequency and a 4th numeric argument; the 4th argument is then interpreted as the missing phenotype frequency.
  • By default, allele codes are As and Bs. The 'acgt' modifier causes alleles to be randomly selected from {A, C, G, T}, while '1234' causes them to be selected from {1, 2, 3, 4}, and '12' makes them 1s and 2s.
  • By default, one binary phenotype is generated. 'pheno-ct=' can be used to change the number of phenotypes, and 'scalar-pheno' causes these phenotypes to be N(0, 1) scalars.
  • By default, all genotypes/dosages are unphased. To phase some of them, use 'phase-freq='.
  • By default, only hardcall genotypes are present. To generate some decimal dosages, use 'dosage-freq='. (These dosages are affected by --hard-call-threshold and --dosage-erase-threshold.)
Genotype encoding

--input-missing-genotype <char>

'.' is always interpreted as a missing genotype code in input files. By default, '0' also is; you can change this second missing code with --input-missing-genotype.

Nonstandard chromosome IDs

--strict-extra-chr

--allow-extra-chr ['0']
  (alias: --aec)

When --strict-extra-chr is on, PLINK 2.0 normally reports an error if the input data contains unrecognized chromosome codes (such as hg19 haplotype chromosomes or unplaced contigs). If none of the additional codes start with a digit, you can permit them with the --allow-extra-chr flag. (These contigs are ignored by most analyses which skip unplaced regions.)

--allow-extra-chr's '0' modifier causes these unrecognized chromosome codes to be treated as if they had been set to zero. This is sometimes necessary to produce reports readable by older software. Note that POS values for the affected variants tend to be rendered useless.

--chr-set <autosome ct> ['no-x'] ['no-y'] ['no-xy'] ['no-mt']

--cow
--dog
--horse
--mouse
--rice
--sheep
--autosome-num <value>

--chr-set changes the chromosome set. The first parameter specifies the number of diploid autosome pairs if positive, or haploid chromosomes if negative. (Polyploid and aneuploid data are not supported, and there is currently no special handling of sex or mitochondrial chromosomes in all-haploid chromosome sets.)

Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (old representation of pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4. PAR1/PAR2 do not have numeric codes associated with them, and are disabled by 'no-xy'.

n is currently limited to 95, so if you're working with adder's-tongue fern genomes, you're out of luck3.

The other flags support PLINK 1.07 and GCTA semantics:

  • --cow = "--chr-set 29 no-xy"
  • --dog = "--chr-set 38"
  • --horse = "--chr-set 31 no-xy no-mt"
  • --mouse = "--chr-set 19 no-xy no-mt"
  • --rice4 = "--chr-set -12"
  • --sheep = "--chr-set 26 no-xy no-mt"
  • --autosome-num <value> = "--chr-set <value> no-y no-xy no-mt"

3: Just kidding. Contact us, and we'll send you a build supporting a higher autosome limit. Note that this isn't necessary if you're dealing with a draft assembly with lots of contigs, rather than actual autosomes—the standard build can handle that if you name your contigs 'contig1', 'contig2', etc. and use the --allow-extra-chr flag.
4: Rice genomes are actually diploid, but breeding programs frequently work with doubled haploids.

--chr-override ['file']
--human

PLINK 2 saves nonhuman chromosome set information to .pvar and VCF files (in a ##chrSet header line). Thus, after you've run --dog + --make-pgen once, it is not necessary to include --dog in subsequent commands involving that dataset. (This did not work properly before 23 Jan 2021.)

By default, if a chromosome set was explicitly specified on the command line, and it conflicts with an input file ##chrSet header line, PLINK 2 will error out. --chr-override with no parameter causes the command line to take precedence, while "--chr-override file" defers to the file.

--human can now be used to explicitly specify the human chromosome set. This can be useful as an additional sanity check: if you attempt to use --human on a fileset generated with --dog + --make-pgen (or vice versa), PLINK 2 will error out instead of accepting the incorrect chromosome set.

Allele frequencies

When allele frequency estimates are needed, PLINK defaults to using empirical frequencies from the immediate dataset (with a pseudocount added when --af-pseudocount is specified). This is unsatisfactory when processing a small subset of a larger dataset or population.

--read-freq <.afreq/.acount/.gcount/.freq/.frq/.frq.count/.frqx filename>

--error-on-freq-calc

--read-freq loads allele frequency estimates from a --freq (PLINK 1.x ok), --geno-counts, or PLINK 1.9 --freqx report, instead of imputing them from the immediate dataset.

  • When a minor allele code is missing from the main dataset but present in the --read-freq file, it is not filled in by PLINK 2.
  • When a variant entry does not contain enough information to determine all allele frequencies (e.g. the variant is triallelic, but only the REF allele frequency is specified), the entire entry is skipped, and those allele frequencies are estimated from the immediate dataset instead.
  • When you're repeatedly running the same allele-frequency-using command on a very large dataset, and not filtering out any samples, you can often speed things up by running "--freq counts" once, and then using its output file as input to --read-freq in subsequent runs. This lets PLINK 2 skip the allele frequency estimation step.

Regarding that last point, it can be useful for computational efficiency to assert that a run doesn't invoke the allele frequency calculation; this is the function of the --error-on-freq-calc flag. (This can be combined with --read-freq; it then verifies that all necessary frequencies/counts were present in the --read-freq file.) Runs involving the following flags may be blocked:

--bad-freqs

When PLINK 2 needs decent allele frequencies, it normally errors out if they aren't provided by --read-freq and less than 50 founders are available to impute them from. Use --bad-freqs to force PLINK 2 to proceed in this case. (As the flag name suggests, this is not recommended; it is almost always better to use --read-freq on an appropriate source instead.)

Phenotypes

--pheno ['iid-only'] <filename>
--pheno-col-nums <1-based column number(s)/range(s)...>

--pheno-name <column ID(s)/range(s)...>
--no-psam-pheno
  (aliases: --no-pheno, --no-fam-pheno)

--not-pheno <phenotype ID(s)...>
  (alias: --phenoExcludeList)

--pheno causes (additional) phenotype values to be read from the specified space- or tab-delimited file. The first columns of that file must be either FID/IID or just IID (in which case the FID is assumed to be 0). A primary header line is required when using --pheno-name, and optional without it (if it's present, it should begin with 'FID', '#FID', 'IID', or '#IID'). Additional header lines (beginning with '#', not immediately followed by 'FID'/'IID') are permitted before the primary header line. For example:

## * If you also need PLINK 1.9 to read this file, add an FID column in front,
##   and fill it with zeroes.
## * 'site' is a categorical variable.  --glm would ignore it if you loaded it
##   as a phenotype (multinomial logistic regression is not implemented), but
##   it's a valid *covariate* for --glm.
#IID  qt1    bmi    site
1110  2.3    22.22  site2
2202  34.12  18.23  site1
...

PLINK 2 defaults to analyzing all phenotypes in the sample information and --pheno files (as if PLINK 1.x's --all-pheno flag was always in effect). --pheno-name lets you specify a subset of phenotypes to load from the --pheno file (or if no --pheno file was specified, from the sample information file), by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges. (Spaces are not permitted immediately before or after a range-denoting dash.) --pheno-col-nums lets you do the same thing with --pheno file column numbers instead. --no-psam-pheno tells PLINK 2 to ignore all phenotype data in the sample information file and allows .fam files with no phenotype column to be loaded.

When no primary header line is present, phenotypes are assigned the names 'PHENO1', 'PHENO2', etc., and the first two columns are normally assumed to be FID/IID. Add the 'iid-only' modifier if you want only the first column to be interpreted as ID in this case.

--not-pheno can be used to exclude phenotype(s) by name.

Phenotype encoding

--no-categorical

--input-missing-phenotype <integer>

--no-input-missing-phenotype

Missing case/control or quantitative phenotypes are expected to be encoded as 'NA'/'nan' (any capitalization) or -9. By default, other strings which don't start with a number are now interpreted as categorical phenotype/covariate values; to force them to be interpreted as missing numeric values instead, use --no-categorical.

You can change the numeric missing phenotype code to another integer with --input-missing-phenotype, or just disable -9 with --no-input-missing-phenotype.

--1

Case/control phenotypes are expected to be encoded as 1=unaffected (control), 2=affected (case); 0 is accepted as an alternate missing value encoding. If you use the --1 flag, 0 is interpreted as unaffected status instead, while 1 maps to affected. Note that this only affects interpretation of input files; output files still use 1=control/2=case encoding.

(Unlike PLINK 1.x, this does not force all phenotypes to be interpreted as case/control.)

Covariates

--covar ['iid-only'] <filename>
--covar-col-nums <1-based column number(s)/range(s)...>

--covar-name <column ID(s)/range(s)...>

--not-covar <covariate ID(s)...>
  (alias: --covarExcludeList)

--covar designates the file to load covariates from. The file format is the same as for --pheno (example). The main phenotype is no longer set to missing when a covariate value is missing; instead, this only happens to the temporary phenotype copies used by e.g. the linear/logistic regression routine.

Categorical covariates are now directly supported. Any nonnumeric string ('NA' and 'nan' are considered to be numbers for this purpose) is treated as a categorical covariate name.

--covar-name works like --pheno-name. It can now be used without --covar (in which case the --pheno or .psam file is the target).

--covar-col-nums works like --pheno-col-nums, and refers to the --pheno file when --covar is not specified.

When no primary header line is present, covariates are assigned the names 'COVAR1', 'COVAR2', etc. --not-covar can be used to exclude covariate(s) by name.

'Cluster' import

--within <filename> [new phenotype name]
--mwithin <n>
--family [new phenotype name]

--within constructs a PLINK 2 categorical phenotype out of a PLINK 1.x 'cluster' file.

  • The first two columns of this file must be FID/IID. By default, cluster names are loaded from column 3, but you can use --mwithin to change this to column (n+2).
  • If no phenotype name is given, it defaults to 'CATPHENO'.
  • If any category names are numeric, all category names must be numeric. In that case, 'C' is added in front of all category names.
  • 'NA' is treated as a missing value.

--family causes FID to be treated as a categorical phenotype. As with --within, if no new phenotype name is given, it defaults to 'CATPHENO'.

--missing-catname <string>
--family-missing-catname <string>

By default, missing categorical phenotype/covariate values are encoded as 'NONE'; you can change this with --missing-catname. (This value is case-sensitive.)

--family-missing-catname can be used to specify another FID for --family to treat as missing.

Reference genome

--fa <FASTA file>

Most PLINK operations don't require the full reference genome sequence behind the coordinates in the .pvar/.bim file. However, there are a few exceptions; use --fa to specify a FASTA-formatted reference genome file when it's needed. This must match the FASTA used for alignment (unless the coordinates have since been lifted over to another reference genome build, anyway). If you're unsure what that is, this blog post by Heng Li lists common choices (and provides suggestions).

hs37d5 and GRCh38_full_analysis_set_plus_decoy_hla FASTA files can be downloaded from the Resources page.

Input filtering >>