Introduction, downloads

D: 28 Oct 2018

Recent version history

What's new?

Coming next

General usage

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF (.vcf{.gz})

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 dosage

Dosage import settings

Generate random

Unusual chromosome IDs

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-fcol (was --filter)

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-{b}pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--write-samples

(TBD)

Resources

1000 Genomes phase 3

Output file list

Order of operations

Credits

File formats

Standard data input

Most of PLINK's calculations operate on tables of samples and variant calls. The following flags are available for defining the form and location of this input, and associated metadata.

Discrete calls

PLINK 1 binary

--bfile [prefix] <vzs>

The --bfile flag normally causes the binary fileset prefix.bed + prefix.bim + prefix.fam to be referenced. (The structure of these files is described in the PLINK 1.9 file formats appendix.).

PLINK 2 supports Zstandard compression of large text files. You can use the 'vzs' modifier to tell PLINK to look for prefix.bim.zst instead of a plain .bim file (the 'v' indicates that it's the variant info file which has the different file extension).

--bed [filename]
--bim [filename]
--fam [filename]

--bed, --bim, and --fam let you specify the full name of one part of the PLINK 1 binary fileset, taking precedence over --bfile. For example,

plink2 --bfile toy --bed bob --freq

would reference the binary fileset bob.bed + toy.bim + toy.fam.

(.fam files are also present in some other fileset types. The --fam flag has the same function when loading them.)

--no-fid
--no-parents
--no-sex

These allow you to use .fam files which lack family ID, parental ID, and/or sex columns. (See also --no-pheno below.)

PLINK 2 binary

--pfile [prefix] <vzs>
--pgen [filename]
--pvar [filename]
--psam [filename]

The --pfile flag usually causes the binary fileset prefix.pgen + prefix.pvar + prefix.psam to be referenced, while --pgen/--pvar/--psam let you fully name one file at a time. New features supported by these formats include:

  • Reliable tracking of REF vs. ALT alleles.
  • Computationally efficient compression of low-MAF and high-LD variants.
  • Phased genotypes.
  • Dosages.
  • VCF-style header information (including species-specific chromosome info, so you don't have to constantly use --chr-set).
  • Multiallelic variants.
  • Multiple phenotypes.
  • Named categorical phenotypes.

See the PLINK 2.0 file formats appendix for more details.

With --pfile, you can use the 'vzs' modifier to tell PLINK to look for prefix.pvar.zst instead of a plain .pvar file.

--bpfile [prefix] <vzs>

If you only need the first four features, you can also use --bpfile, which references prefix.pgen + prefix.bim + prefix.fam. Note that the .pgen file tracks whether the .bim file's sixth column can be trusted to contain only REF alleles.

You can use the 'vzs' modifier to tell PLINK to look for prefix.bim.zst instead of a plain .bim file.

--keep-autoconv

PLINK 1 binary and PLINK 2 binary are PLINK 2's preferred input formats. Most other formats are automatically converted to PLINK 2 binary before the main loading sequence1; as a consequence, if you're performing multiple operations on the same otherwise-formatted files, you may want to keep the autoconversion products and work with them, instead of repeating the conversion on every run. PLINK gives you several ways to handle this situation.

1. If you just want to convert your data, don't use any other flags besides --out. For example:

plink2 --vcf my.vcf --out binary_fileset

This entirely skips the main loading sequence, so filters like --extract, --hwe, and --snps-only are not permitted (you'll get an error if you attempt to use them).

2. You can produce a binary fileset which is a filtered version of your text data. Use --make-pgen (or --make-bed/--make-bpgen) for this.

3. You can directly analyze the text fileset. In this case, the autoconversion products are silently deleted at the end of the run2, to avoid clogging your drive with unwanted files. For example, the following command writes an allele frequency report to results.afreq, and doesn't leave any other files behind besides results.log:

plink2 --vcf my.vcf --freq --out results

4. You can analyze the text fileset while specifying (with --keep-autoconv) that you also want to keep the autoconversion products. So the following command leaves behind results.pgen, results.pvar and results.psam as well as results.afreq and results.log:

plink2 --vcf my.vcf --freq --keep-autoconv --out results

1: Since binary files are so much smaller than the equivalent text files, we expect that this will not put undue pressure on your available disk space. This architectural choice allows PLINK's core to focus entirely on efficient streaming processing of binary data; we hope the memory usage, development speed, and performance benefits we're able to deliver as a result are worth any slight inconvenience.
2: If you interrupt PLINK with e.g. Ctrl-C, or the program crashes, the files will not be deleted. You can use "rm *-temporary.*" (or "del *-temporary.*" on Windows) to clean up the mess.

Variant Call Format

--vcf [filename] <dosage=[field]>
--bcf [filename]

--vcf loads a genotype VCF file, extracting information which can be represented by the PLINK 2 binary format and ignoring everything else (after applying the load filters described below). For example, per-call read depths and quality scores are discarded, but you can filter on them first.

You can combine this with --fam/--psam. If you do, PLINK 2 will verify the sample IDs match and appear in the same order in the two files, and the sample information will be loaded. (It may be necessary to use --double-id/--const-fid/--id-delim to get the IDs to match.) Otherwise, an empty sample information file will be generated.

By default, dosage information is not imported. To import the GP field (a posterior probability per possible genotype, not phred scaled), add 'dosage=GP' (or 'dosage=GP-force', see below). To import Minimac4-style DS+HDS phased dosage, add 'dosage=HDS'. 'dosage=DS' (or anything else for now) causes the named field to be interpreted as a Minimac3-style dosage.

Note that, in the dosage=GP case, PLINK 2 collapses the probabilities down to dosages; you cannot use PLINK 2 to losslessly convert VCF FORMAT:GP data to e.g. BGEN format. To make this more obvious, PLINK 2 now errors out when dosage=GP is used without --import-dosage-certainty on a file with a FORMAT:DS header line, since dosage=DS extracts the same information more quickly in this situation. You can suppress this error with 'dosage=GP-force'.

In all of these cases, hardcalls are now regenerated from scratch from the dosages, using the --hard-call-threshold logic described below (this was not true before 14 Apr 2018). As a consequence, variants with no GT field can now be imported; they will be assumed to contain only diploid calls when HDS is also absent.

'Sites-only' VCF files, such as those released by the gnomAD project, can also be loaded with --vcf, as long as you aren't performing any operations which require sample or genotype information.

--vcf-require-gt

By default, when the GT field is absent, the variant is kept and all genotypes are set to missing (unless dosages are present). To skip all variants with no GT field instead, use --vcf-require-gt.

--vcf-min-gq [val]

--vcf-min-dp [val]
--vcf-max-dp [val]

--vcf-min-gq excludes all genotype calls with GQ below the given (nonnegative, decimal values permitted) threshold. Missing GQ values are not treated as being below the threshold.

--vcf-min-dp does the same for a DP threshold, while --vcf-max-dp excludes genotype calls with DP above the given threshold (this often corresponds to an unwanted variant calling artifact).

--vcf-half-call [mode]

The current VCF standard does not specify how '0/.' and similar GT values should be interpreted. By default (mode 'error'/'e'), PLINK errors out and reports the line number of the anomaly. Should the half-call be intentional, though (this can be the case with Complete Genomics data), you can request the following other modes:

  • 'haploid'/'h': Treat half-calls as haploid/homozygous (the PLINK 2 file format does not distinguish between the two). This maximizes similarity between the VCF and BCF2 parsers.
  • 'missing'/'m': Treat half-calls as missing.
  • 'reference'/'r': Treat the missing part as reference.
Oxford-format genotype

--data [prefix] <ref-first | ref-last> <gzs>

--bgen [filename] <snpid-chr> <ref-first | ref-last>
--gen [filename] <ref-first | ref-last>
--sample [filename]

--oxford-single-chr [chromosome code]

--data normally causes the Oxford-format text genotype fileset prefix.gen + prefix.sample to be imported.

By default, the second allele on each line is treated as reference, and flagged as provisional (so if PLINK 2 merges the fileset with a second fileset with non-provisional reference alleles, the second fileset will always be trusted when there's a discrepancy). You can use the 'ref-first' or 'ref-last' modifier to tell PLINK 2 that the first (resp. last) allele in each line really is from a reference genome.

'gzs' tells PLINK 2 to look for prefix.gen.zst instead of a plain .gen file.

--bgen, --gen, and --sample allow you to specify the filenames separately; --bgen is necessary for BGEN-format files, and --gen is necessary if your genomic data file has a .gen.gz extension.

Additional notes:

  • With .bgen input, use the 'snpid-chr' modifier to specify that chromosome codes should be read from the 'SNP ID' field. (Otherwise, the field is ignored.)
  • If a BGEN v1.2+ file contains sample IDs, it may be imported without a companion .sample file.
  • With .gen input, the first column is normally assumed to contain chromosome codes. To import a single-chromosome .gen file with an ignorable first column, use --oxford-single-chr.
  • If sex information is in the .sample file, it must be in a column titled 'sex' (capital letters ok) of type 'D' (discrete covariate), and be coded in the usual 1=male/2=female/0=unknown manner, to be loaded.
  • Binary phenotypes are converted from 1/0 to 2/1 coding, and 'C' is prepended to categorical phenotype values.

--missing-code {comma-separated list of values}
  (alias: --missing_code)

--missing-code lets you specify the set of strings to interpret as missing phenotype values in a .sample file. For example, '--missing-code -9,0,NA,na' would cause '-9', '0', 'NA', and 'na' to all be interpreted as missing phenotypes. (Note that no spaces are currently permitted between the strings.) By default, only 'NA' is interpreted as missing.

Oxford-format phased reference panel

--haps [filename] <ref-first | ref-last>
--legend [filename] [chromosome code]

--haps causes the named Oxford-format phased haplotype file to be imported. If (and only if) the --haps file does not contain header columns, it is also necessary to use --legend to specify the chromosome code and the .legend file with the rest of the variant information.

When this isn't used with --sample, the new sample IDs are of the form 'per#', starting with 'per0'.

PLINK 1 dosage

--import-dosage [allele dosage file] <noheader> <id-delim=[char]> <skip0=[i]> <skip1=[j]> <skip2=[k]> <dose1> <format=[mode]> <ref-first | ref-last> <single-chr=[code]> <chr-col-num=[col #]> <pos-col-num=[col #]>

--import-dosage causes the named PLINK 1 dosage file to be imported.

  • You must also specify a sample information file with --fam/--psam.
  • By default, PLINK assumes that the file contains a header line, which has 'SNP' in (1-based) column i+1, 'A1' in column i+j+2, 'A2' in column i+j+3, and sample FID/IIDs starting from column i+j+k+4. (i/j/k are normally zero, but can be changed with 'skip0', 'skip1', and 'skip2' respectively. FID/IID are normally assumed to be separate tokens, but if they're merged into a single token you can specify the delimiter with 'id-delim='.) If such a header line is not present, use the 'noheader' modifier; samples will then be assumed to appear in the same order as they do in the .psam/.fam file.
  • You may specify a companion .map file. If you do not,
    • 'single-chr=' can be used to specify that all variants are on the named chromosome. Otherwise, you can use 'chr-col-num=' to read chromosome codes from the given (1-based) column number.
    • 'pos-col-num=' causes base-pair coordinates to be read from the given column number.
  • The 'format=' modifier lets you specify the number of values used to represent each dosage. 'format=1' normally indicates a single 0..2 allele 1 expected count; 'dose1' modifies this to a 0..1 frequency. 'format=2' indicates a 0..1 homozygous A1 likelihood followed by a 0..1 het likelihood. 'format=3' indicates 0..1 hom A1, 0..1 het, 0..1 hom A2. 'format=infer' (the default) infers the format from the number of columns in the first nonheader line.
    Note that, for 'format=3', the third value in each triplet is not actually parsed by PLINK; as a result, '0 0 0' is interpreted as homozygous A2 rather than a missing call.
  • By default, the A2 allele is treated as the provisional reference; 'ref-first' and 'ref-last' have the usual effect.
Other formats

--file [prefix]

--ped [filename]
--map [filename]

--tfile [prefix]

--tped [filename]
--tfam [filename]

--lfile [prefix]

--lgen [filename]
--reference [filename]
--allele-count

--23file [filename] {family ID} {individual ID} {sex} {phenotype} {paternal ID} {maternal ID}

These import functions are not yet implemented in PLINK 2.0. For now, use PLINK 1.9 to convert them to PLINK 1 binary.

Sample ID conversion

--double-id
--const-fid {FID}
--id-delim {delimiter} <sid>

VCF and .bgen files just contain sample IDs, instead of the distinct family and individual IDs tracked by PLINK. We offer three ways to convert these IDs:

  • --double-id causes both family and individual IDs to be set to the sample ID.
  • --const-fid converts sample IDs to individual IDs while setting all family IDs to a single value (default '0').
  • --id-delim normally causes single-delimiter sample IDs to be parsed as [FID][delimiter][IID], and double-delimiter IDs as [FID][delim][IID][delim][SID]; the default delimiter is '_'. With the 'sid' modifier, single-delimiter IDs are parsed as [IID][delimiter][SID] instead.
    --id-delim can no longer be used with --double-id/--const-fid; it will error out if any ID lacks the delimiter.

If none of these three flags is present, the loader defaults to "--const-fid 0"; this is a change from PLINK 1.9.

--idspace-to [character]

Since PLINK sample IDs cannot contain spaces, an error is normally reported when there's a space in a VCF/.bgen sample ID. To work around this, you can use --idspace-to to convert all spaces in sample IDs to another character. This happens before regular parsing, so when the --idspace-to and --id-delim characters are identical, both space and the original --id-delim character are interpreted as FID/IID/SID delimiters.

If you only want the space character to function as a delimiter, use "--id-delim ' '".

Dosage import settings

--hard-call-threshold [max distance from nearest hardcall]

--dosage-erase-threshold [max distance from nearest hardcall]

The PLINK 2 binary file format supports allelic dosages, with ~4 decimal place precision. However, some of PLINK 2's commands do not make use of dosage data. Thus, when importing dosage data, PLINK 2 also saves (possibly missing) hardcalls for those commands to use.

By default, a proper hardcall is saved when the distance from the nearest hardcall, defined as 0.5 * sumi|xi - round(xi) (where the xi's are 0..2 allele dosages), is not greater than 0.1; i.e. for biallelic variants, the alternate allele dosage needs to be in [0, 0.1], [0.9, 1.1], or [1.9, 2.0]. Otherwise, a missing hardcall is saved. You can change the acceptable distance with --hard-call-threshold.

In some cases, e.g. when your reference allele dosage is 1.999, you may be willing to throw away the raw dosage value and only save a hardcall. --dosage-erase-threshold erases all dosage values which are no further than the specified distance from the nearest hardcall.

--hard-call-threshold and --dosage-erase-threshold can also be used with --make-{b}pgen and --make-bed; it is not necessary to perform a full-blown import to adjust the dosage-to-hardcall mapping.

--import-dosage-certainty [min certainty]

Some dosage formats include separate probabilities for every possible genotype, e.g. {P(0/0)=0.2, P(0/1)=0.52, P(1/1)=0.28} or {P(0/0)=0.005, P(0/1)=0.91, P(1/1)=0.085}. By default, PLINK 2 treats these two sets of probabilities in the same manner: in both cases, the expected alternate allele dosage is 1.08, which is in [0.9, 1.1] so a heterozygous hardcall is saved.

However, the first call is far less certain than the second; you'll frequently prefer to not save a dosage at all when the highest genotype probability is only 0.52, whereas 0.91 is a different story. During import, you can use --import-dosage-certainty to make this distinction.

Randomized data

--dummy [sample count] [SNP count] {missing geno/dosage freq} {missing pheno freq} <acgt | 1234 | 12> <pheno-ct=[count]> <scalar-pheno> <dosage-freq=[rate]>

This tells PLINK to generate a simple dataset from scratch (useful for basic software testing), with the specified number of samples and SNPs. All generated samples are females with random genotype and phenotype values, and all SNPs are on chromosome 1 with positions 0, 1, 2, etc.

  • By default, the missing genotype and phenotype frequencies are zero. These can be changed by providing 3rd and 4th numeric parameters.
  • By default, allele codes are As and Bs. The 'acgt' modifier causes alleles to be randomly selected from {A, C, G, T}, while '1234' causes them to be selected from {1, 2, 3, 4}, and '12' makes them 1s and 2s.
  • By default, one binary phenotype is generated. 'pheno-ct=' can be used to change the number of phenotypes, and 'scalar-pheno' causes these phenotypes to be N(0, 1) scalars.
  • By default, only hardcall genotypes are present. To generate some decimal dosages, use 'dosage-freq='. (These dosages are affected by --hard-call-threshold and --dosage-erase-threshold.)
Genotype encoding

--input-missing-genotype [char]

'.' is always interpreted as a missing genotype code in input files. By default, '0' also is; you can change this second missing code with --input-missing-genotype.

Nonstandard chromosome IDs

--allow-extra-chr <0>
  (alias: --aec)

Normally, PLINK reports an error if the input data contains unrecognized chromosome codes (such as hg19 haplotype chromosomes or unplaced contigs). To permit them and use the chromosome names when generating reports/new datasets, use --allow-extra-chr with no modifier. The affected variants will still be ignored by most analyses which skip unplaced regions.

The '0' modifier causes these chromosome codes to be treated as if they had been set to zero. (This is sometimes necessary to produce reports readable by older software.)

--chr-set [autosome ct] <no-x> <no-y> <no-xy> <no-mt>

--cow
--dog
--horse
--mouse
--rice
--sheep
--autosome-num [value]

--chr-set changes the chromosome set. The first parameter specifies the number of diploid autosome pairs if positive, or haploid chromosomes if negative. (Polyploid and aneuploid data are not supported, and there is currently no special handling of sex or mitochondrial chromosomes in all-haploid chromosome sets.)

Given diploid autosomes, the remaining modifiers let you indicate the absence of specific non-autosomal chromosomes, as an extra sanity check on the input data. Note that, when there are n autosome pairs, the X chromosome is assigned numeric code n+1, Y is n+2, XY (pseudo-autosomal region of X) is n+3, and MT (mitochondria) is n+4.

n is currently limited to 95, so if you're working with adder's-tongue fern genomes, you're out of luck3.

The other flags support PLINK 1.07 and GCTA semantics:

  • --cow = --chr-set 29 no-xy
  • --dog = --chr-set 38
  • --horse = --chr-set 31 no-xy no-mt
  • --mouse = --chr-set 19 no-xy no-mt
  • --rice4 = --chr-set -12
  • --sheep = --chr-set 26 no-xy no-mt
  • --autosome-num [value] = --chr-set [value] no-y no-xy no-mt

3: Just kidding. Contact us, and we'll send you a build supporting a higher autosome limit.
4: Rice genomes are actually diploid, but breeding programs frequently work with doubled haploids.

--chr-override <file>
--human

PLINK 2 saves nonhuman chromosome set information to .pvar and VCF files (in a ##chrSet header line). Thus, after you've run --dog + --make-pgen once, it is not necessary to include --dog in subsequent commands involving that dataset.

By default, if a chromosome set was explicitly specified on the command line, and it conflicts with an input file ##chrSet header line, PLINK 2 will error out. --chr-override with no parameter causes the command line to take precedence, while '--chr-override file' defers to the file.

--human can now be used to explicitly specify the human chromosome set. This can be useful as an additional sanity check: if you attempt to use --human on a fileset generated with --dog + --make-pgen (or vice versa), PLINK 2 will error out instead of accepting the incorrect chromosome set.

Phenotypes

--pheno [filename]
--pheno-col-nums [1-based column number(s)/range(s)...]

--pheno-name [column ID(s)/range(s)...]
--no-psam-pheno
  (aliases: --no-pheno, --no-fam-pheno)

--pheno causes (additional) phenotype values to be read from the specified space- or tab-delimited file. The first columns of that file must be either FID/IID or just IID (in which case the FID is assumed to be 0). A primary header line is required when using --pheno-name, and optional without it (if it's present, it should begin with 'FID' or '#FID'). Additional header lines (beginning with '#', not immediately followed by 'FID'/'IID') are permitted before the primary header line.

PLINK 2 defaults to analyzing all phenotypes in the sample information and --pheno files (as if PLINK 1.x's --all-pheno flag was always in effect). --pheno-name lets you specify a subset of phenotypes to load from the --pheno file (or if no --pheno file was specified, from the sample information file), by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges. (Spaces are not permitted immediately before or after a range-denoting dash.) --pheno-col-nums lets you do the same thing with --pheno file column numbers instead. --no-psam-pheno tells PLINK 2 to ignore all phenotype data in the sample information file and allows .fam files with no phenotype column to be loaded.

When no primary header line is present, phenotypes are assigned the names 'PHENO1', 'PHENO2', etc.

Phenotype encoding

--input-missing-phenotype [integer]

Missing case/control or quantitative phenotypes are expected to be encoded as 'NA'/'nan' (any capitalization) or -9. (Other non-numeric strings are now interpreted as categorical phenotype/covariate values.) You can change the numeric missing phenotype code to another integer with --input-missing-phenotype.

--1

Case/control phenotypes are expected to be encoded as 1=unaffected (control), 2=affected (case); 0 is accepted as an alternate missing value encoding. If you use the --1 flag, 0 is interpreted as unaffected status instead, while 1 maps to affected.

(Unlike PLINK 1.x, this does not force all phenotypes to be interpreted as case/control.)

Covariates

--covar [filename]
--covar-col-nums [1-based column number(s)/range(s)...]

--covar-name [column ID(s)/range(s)...]

--covar designates the file to load covariates from. The file format is the same as for --pheno. The main phenotype is no longer set to missing when a covariate value is missing; instead, this only happens to the temporary phenotype copies used by e.g. the linear/logistic regression routine.

Categorical covariates are now directly supported. Any nonnumeric string ('NA' and 'nan' are considered to be numbers for this purpose) is treated as a categorical covariate name.

--covar-name works like --pheno-name. It can now be used without --covar (in which case the --pheno or .psam file is the target).

--covar-col-nums works like --pheno-col-nums, and refers to the --pheno file when --covar is not specified.

When no primary header line is present, covariates are assigned the names 'COVAR1', 'COVAR2', etc.

'Cluster' import

--within [filename] {new phenotype name}
--mwithin [n]
--family {new phenotype name}

--within constructs a PLINK 2 categorical phenotype out of a PLINK 1.x 'cluster' file.

  • The first two columns of this file must be FID/IID. By default, cluster names are loaded from column 3, but you can use --mwithin to change this to column (n+2).
  • If no phenotype name is given, it defaults to 'CATPHENO'.
  • If any category names are numeric, all category names must be numeric. In that case, 'C' is added in front of all category names.
  • 'NA' is treated as a missing value.

--family causes FID to be treated as a categorical phenotype.

--missing-catname [string]
--family-missing-catname [string]

By default, missing categorical phenotype/covariate values are encoded as 'NONE'; you can change this with --missing-catname. (This value is case-sensitive.)

--family-missing-catname can be used to specify another FID for --family to treat as missing.

Reference genome

--fa [FASTA file]

Most PLINK operations don't require the full reference genome sequence behind the coordinates in the .pvar/.bim file. However, there are a few exceptions; use --fa to specify a FASTA-formatted reference genome file when it's needed. This must match the FASTA used for alignment (unless the coordinates have since been lifted over to another reference genome build, anyway). If you're unsure what that is, this blog post by Heng Li lists common choices (and provides suggestions).

Input filtering >>