Introduction, downloads

D: 28 Oct 2018

Recent version history

What's new?

Coming next

General usage

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF (.vcf{.gz})

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 dosage

Dosage import settings

Generate random

Unusual chromosome IDs

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-fcol (was --filter)

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-{b}pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--write-samples

(TBD)

Resources

1000 Genomes phase 3

Output file list

Order of operations

Credits

File formats

Input filtering

The following flags allow you to exclude samples and/or variants from an analysis batch based on a variety of criteria.

Some of these criteria are based on statistics such as estimated MAF that may vary through multiple filtering passes. If variation is problematic, use --freq/--geno-counts to export initial statistics, and then include --read-freq in all filtering passes where you want to refer back to the initial statistics.

ID lists

--keep [filename(s)...]
--remove [filename(s)...]

--keep-fam [filename(s)...]
--remove-fam [filename(s)...]

--keep accepts one or more space/tab-delimited text files with sample IDs, and removes all unlisted samples from the current analysis; --remove does the same for all listed samples. Similarly, --keep-fam and --remove-fam accept text files with family IDs in the first column, and keep or remove entire families.

--keep/--remove now support a wider variety of sample ID file formats:

  • If the first line starts with '#FID' or '#IID', it will be treated as a header line. As long as the first columns are "#FID IID", "#FID IID SID", "#IID", or "#IID SID", PLINK 2 will do the right thing.
  • If there is no header line, one-column lines are treated as IIDs (with FID assumed to be '0', playing well with "--const-fid 0"), and multicolumn lines are treated the same way as in PLINK 1.x (first two columns assumed to be FID/IID).

--extract <ibed0 | ibed1> [filename(s)...]
--exclude <ibed0 | ibed1> [filename(s)...]

--extract normally accepts one or more text file(s) with variant IDs (usually one per line, but it's okay for them to just be separated by spaces), and removes all unlisted variants from the current analysis. With the 'ibed0' or 'ibed1' modifier, the input file should be in 0-based or 1-based interval-BED format instead. For backward compatibility, 'range' is an alias for 'ibed1'.

--exclude does the same for all listed variants.

QUAL, FILTER, INFO

--var-min-qual [value]

--var-min-qual causes all variants with QUAL value smaller than the given number, or with no QUAL value at all, to be skipped.

--var-filter {exception(s)...}

To skip variants which failed one or more filters tracked by the FILTER field, use --var-filter. This can be combined with one or more (space-delimited) filter names to ignore.

##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##contig=<ID=1,length=249250621>
#CHROM POS   ID          REF ALT QUAL FILTER
1      10583 rs58108140  G   A   25   PASS
1      10611 rs189107123 C   G   11   q10
1      13302 rs180734498 C   T   32   s50
1      13327 rs144762171 G   C   30   .
1      13957 rs201747181 TC  T   3    q10;s50

For example, given the .pvar file above:

  • --var-filter with no arguments would keep only rs58108140 and rs144762171;
  • '--var-filter q10' would keep rs58108140, rs180734498, and rs144762171 (PLINK matches against the 'q10' string, instead of checking the QUAL value here; use --var-min-qual to do the latter);
  • '--var-filter LowQual s50' would keep rs58108140, rs189107123, and rs144762171; and
  • '--var-filter q10 s50' would keep all five variants.

--extract-if-info [key] [operator] [value]
  (alias: --extract-if)
--exclude-if-info [key] [operator] [value]
  (alias: --exclude-if)
--require-info [key(s)...]
--require-no-info [key(s)...]

--extract-if-info removes all variants which don't satisfy a comparison predicate on an INFO key. For numbers, the supported operators are '!=', '<', '<=', '==' (single '=' also ok), '>=', and '>'; for strings, only '!=' and '='/'==' can be used. If the key or value is missing, the predicate evaluates to true iff the operator is '!='.

Note that the '<' and '>' characters have special meanings in practically all shells; it is necessary to wrap them in quoted expressions. In bash, you can either quote the special characters individually, e.g.

--extract-if-info AFR_AF '>'= 0.05

or put the entire predicate in double-quotes:

--extract-if-info "AFR_AF >= 0.05"

Similarly, --exclude-if-info removes all variants which do satisfy such a comparison predicate.

--require-info removes all variants which don't have all of the listed INFO keys. (A key is treated as if it isn't present when the associated value is '.'.) Similarly, --require-no-info removes all variants which have any of the listed keys.

Chromosomes

--chr [number(s)/range(s)...]
--not-chr [number(s)/range(s)...]

--chr excludes all variants not on the listed chromosome(s). Normally, valid choices for humans are 0 (i.e. unknown), 1-22, X, Y, MT, PAR1/PAR2 (pseudo-autosomal region of X; see --split-par/--merge-par), and XY (deprecated PLINK 1.x code intended to refer to the pseudo-autosomal region). Separate multiple chromosomes with spaces or commas, and use dashes to specify ranges. Spaces are not permitted immediately before or after a range-denoting dash.

For example, the following are all valid and equivalent:

--chr 1-4, 22, xy
--chr 1-4 22 XY
--chr 1,2,3,4,22,25

You might wonder about the '25'. Several non-autosomal chromosomes can also be identified by numeric code: if there are n autosomes, n+1 is the X chromosome, n+2 is Y, n+3 is XY, and n+4 is MT. (However, no numeric codes are associated with PAR1/PAR2.)

--not-chr is the reverse of --chr: variants on listed chromosome(s) are excluded. So

--not-chr 0 5-21 x y mt par1 par2

is equivalent to the three --chr examples above (assuming human data). (Yes, if your data uses PAR1/PAR2 codes, "--chr xy" will not cause them to be included. If this is problematic, see --autosome-par below.)

If you specified --allow-extra-chr, you can refer to the extra chromosome codes by name. For example,

--allow-extra-chr --not-chr chr1_gl000191_random

--autosome

--autosome-par

--autosome excludes all unplaced and non-autosomal variants, while --autosome-par does not exclude XY/PAR1/PAR2. They can be combined with --not-chr, e.g.

--autosome-xy --not-chr 5-21 xy

is also equivalent to the three --chr examples.

Keep only SNPs

--snps-only <just-acgt>

--snps-only excludes all variants with one or more multi-character allele codes. With 'just-acgt', variants with single-character allele codes outside of {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't', [missing code]} are also excluded.

Simple variant window

--from [variant ID]
--to [variant ID]

--from excludes all variants on different chromosomes than the named variant, as well as those with smaller base-pair position values. --to is similar, excluding variants with larger position values instead. If they are used together but the --from variant is after the --to variant, they are automatically swapped.

--snp [variant ID]
--window [total window size, in kb]
--exclude-snp [variant ID]

--snp specifies a single variant to load by name. If it's combined with --window, all variants with physical position no more than half the specified kb distance (decimal permitted) from the named variant are loaded as well.

Similarly, --exclude-snp specifies a single variant to exclude; this can also be combined with --window.

--from-bp [pos]
--to-bp [pos]
--from-kb [kb pos]
--to-kb [kb pos]
--from-mb [mb pos]
--to-mb [mb pos]

These flags let you use physical positions to specify a variant range to load. Kilobase and megabase values can include decimals. You are required to specify a single chromosome when using these.

Multiple ranges

--snps [variant ID(s)/range(s)...]
--exclude-snps [variant ID(s)/range(s)...]

--snps accepts a collection of individual variant IDs and variant ranges. For example,

--snps rs1111-rs2222, rs3333, rs4444

tells PLINK to load all variants between rs1111 and rs2222 inclusive, as well as rs3333 and rs4444. (Syntax works the same way as --chr. If your variant IDs contain dashes, you'll want to use the --d flag as well.) If rs1111 and rs2222 are on different chromosomes i < j, then all variants on chromosomes numbered between i and j are loaded, as well as the last variants on chromosome i and the first variants on chromosome j. (You can exclude some intermediate chromosomes by combining --snps with --not-chr.)

--exclude-snps excludes all the specified variants/ranges instead.

--force-intersect

To reduce the potential for confusion, PLINK 2 normally errors out when multiple variant-inclusion filters (--extract, --from/--to, --from-bp/--to-bp, --snp, --snps) are specified, since it may not be obvious whether the intersection or union will be taken. --force-intersect allows the run to proceed; the set intersection will be taken.

Deduplicate variants

--rm-dup {mode} <list>

--rm-dup usually removes all but one instance of each duplicate-ID variant (ignoring the missing ID). With the 'list' modifier, the original duplicated IDs are written to plink2.rmdup.list.

The following modes of operation are supported:

  • 'error' (default): Check each group of duplicate-ID variants for equality. (Alleles are considered unequal even if the codes are the same, just in a different order; FILTER/INFO are considered unequal if the strings don't match exactly, even if they're semantically identical.) If any mismatches are found, this errors out, and writes a list of mismatching variant IDs to plink2.rmdup.mismatch.
  • 'retain-mismatch': When unequal duplicate-ID variants are found, keep every member of the group. The .rmdup.mismatch file is still written.
  • 'exclude-mismatch': When unequal duplicate-ID variants are found, exclude every member of the group.
  • 'exclude-all': Exclude all instances of all duplicate-ID variants.
  • 'force-first': Always keep just the first instance of each duplicate-ID variant.
Arbitrary thinning

--thin [p]
--thin-count [n]
--bp-space [bp count]
--thin-indiv [p]
--thin-indiv-count [n]
  (alias: --max-indv)

--thin removes variants at random by retaining each variant with probability p, --thin-count removes variants at random until only n remain, and --bp-space excludes one variant from each pair closer than the given bp count. (Yes, --bp-space is equivalent to VCFtools --thin; we can't do much about this mixup without breaking backward compatibility.) Note that LD-based pruning also has a variant thinning effect, and is normally more useful than these three commands.

Similarly, --thin-indiv removes samples at random by retaining each sample with probability p, while --thin-indiv-count removes samples at random until only n remain.

Phenotype/covariate-based

--keep-if [phenotype/covariate name] [operator] [value]
--remove-if [phenotype/covariate name] [operator] [value]

--keep-if removes all samples which don't satisfy a comparison predicate on a phenotype or covariate, while --remove-if does the reverse. Syntax and treatment of missing values is the same as for --extract-if-info.

--require-pheno {phenotype name(s)...}
--require-covar {covariate name(s)...}

When parameters are provided, --require-pheno removes samples missing any of the named phenotypes; otherwise, it removes samples missing any loaded phenotype. --require-covar does the same things for covariates.

--keep-cats [filename]
--keep-cat-names [name(s)...]
--remove-cats [filename]
--remove-cat-names [name(s)...]

--keep-cat-pheno [phenotype/covariate name]
--remove-cat-pheno [phenotype/covariate name]

If exactly one categorical phenotype/covariate is loaded, --keep-cats and --keep-cat-names can be used individually or in combination to define a list of categories to keep; all samples not in one of those categories are then removed from the current analysis. --keep-cats accepts a text file with one category name per line, and --keep-cat-names takes a space-delimited sequence of category names on the command line.

If multiple categorical phenotypes/covariates are loaded, use --keep-cat-pheno to specify which variable --keep-cats/--keep-cat-names should apply to. (This is still safe when only one categorical variable is present.)

Similarly, --remove-cats removes all samples in categories named in a file, --remove-cat-names removes all samples in categories named on the command line, and --remove-cat-pheno specifies which variable --remove-cats/--remove-cat-names should apply to.

String match

--keep-fcol [filename] [string(s) to match...]

--keep-fcol-name [column name]

--keep-fcol-num [n]

--keep-fcol accepts a space/tab-delimited text file with sample IDs in the first columns and a string to filter on in a later column. You can specify this column with either --keep-fcol-num or --keep-fcol-name (the latter requires a header line starting with #FID or #IID); with neither, "--keep-fcol-num 3" is assumed. All samples either missing from the file, or with a string value which doesn't match any of the strings you provided are removed from the analysis. The string comparison is case-sensitive, and numbers are not parsed, so '9', '9.0', '9e0', and '9E0' all compare unequal.

(This is a minor extension of PLINK 1.x's --filter flag.)

Missing genotype rates

--geno {maximum per-variant}
--mind {maximum per-sample}

--geno filters out all variants with missing call rates exceeding the provided value (default 0.1) to be removed, while --mind does the same for samples.

If any samples were removed by --mind, their IDs are written to plink2.mindrem.id.

Number of distinct alleles

--min-alleles [count]
--max-alleles [count]

--min-alleles excludes variants with fewer than the given number of alleles, while --max-alleles excludes variants with more. When a variant has exactly one ALT allele and it's a missing-code, these filters treat it as having only one allele.

Allele frequencies/counts

--maf {minimum freq} {mode}
  (alias: --min-af)
--max-maf [maximum freq] {mode}
  (alias: --max-af)
--mac [minimum count] {mode}
  (alias: --min-ac)
--max-mac [maximum count] {mode}
  (alias: --max-ac)

--maf filters out all variants with allele frequency below the provided threshold (default 0.01), while --max-maf imposes an upper bound. Similarly, --mac and --max-mac impose lower and upper allele count bounds, respectively.

By default, these flags operate on 'nonmajor' (i.e. sum of all but the largest value) allele frequencies/counts. Three other modes are supported: 'nref' (nonreference), 'alt1', and 'minor' (smallest). You can use bcftools-style freq:mode notation for this.

Only founders are normally considered by these filters; use --nonfounders to change this.

--maf-succ

--maf-succ causes allele frequencies to be estimated via the 'rule of succession' employed by EIGENSOFT. I.e.,

   qhat := (1 + [# of observations of current allele]) / ([# of distinct alleles] + [# of observations of any allele])

instead of the usual

   qhat := [# of observations of current allele] / [# of observations of any allele].

--read-freq [.afreq/.acount/.gcount/.freq/.frq/.frq.count/.frqx filename]

--read-freq loads allele frequency estimates from a --freq (PLINK 1.x ok), --geno-counts, or PLINK 1.9 --freqx report, instead of imputing them from the immediate dataset. It can be combined with --maf-succ if the file contains observation counts.

When a minor allele code is missing from the main dataset but present in the --read-freq file, it is not filled in by PLINK 2.

Hardy-Weinberg equilibrium tests

--hwe [p-value] <midp> <keep-fewhet>

--hwe filters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold. We recommend setting a low threshold—serious genotyping errors often yield extreme p-values like 1e-50 which are detected by any reasonable configuration of this test, while genuine SNP-trait associations can be expected to deviate slightly from Hardy-Weinberg equilibrium (so it's dangerous to choose a threshold that filters out too many variants).

On chrX, p-values are now computed using the method described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.

--hwe's 'midp' modifier applies the mid-p adjustment described in Graffelman J, Moreno V (2013) The mid p-value in exact tests for Hardy-Weinberg equilibrium. The mid-p adjustment tends to bring the null rejection rate in line with the nominal p-value, and also reduces the filter's tendency to favor retention of variants with missing data. We recommend its use.

Because of the missing data issue, you should not apply a single p-value threshold across a batch of variants with highly variable missing call rates. A warning is given whenever observation counts vary by more than 10%.

For multiallelic variants, a separate biallelic test is performed for every allele, and the variant is filtered out iff any of the tests yields a (mid-)p-value below the threshold.

Only founders are considered by this test; use --nonfounders to change this.

When significant population stratification is present, this test can be expected to fail in the too-few-hets direction on some normal variants. When using --hwe for quality control, you probably want to keep these variants; use the 'keep-fewhet' modifier to do this. (I.e. the test is made one-sided, with a threshold equal to half the given p-value.) On chrX, the ratio between the Graffelman/Weir p-value and the female-only p-value is considered here.

There is currently no special handling of case/control phenotypes;

--keep-if [phenotype name] == control

is frequently a good idea when using --hwe in a genome-wide association analysis (and matches PLINK 1.x's behavior).

Imputation quality

--mach-r2-filter {min} {max}

--mach-r2-filter excludes variants where the MaCH Rsq imputation quality metric (frequently labeled as 'INFO') is outside [0.1, 2.0]; change the bounds by providing parameters. Monomorphic variants, where Rsq == nan, are not excluded by this filter: the problem with them isn't imputation quality.

Sex

--keep-females

--keep-males

--keep-nosex
--remove-females
--remove-males
--remove-nosex

--keep-females excludes all male and unknown-sex samples, --keep-males excludes females and unknown-sex samples, and --keep-nosex excludes all known-sex samples. Conversely, --remove-females only excludes known females, --remove-males only excludes known males, and --remove-nosex only excludes unknown-sex samples.

Founder status

--keep-founders
--keep-nonfounders

--keep-founders excludes all samples with at least one known parental ID from the current analysis (note that it is not necessary for that parent to be in the current dataset), while --keep-nonfounders does the reverse.

--nonfounders

By default, nonfounders are not counted by --freq or --maf/--max-maf/--hwe. Use the --nonfounders flag to include them.

Data management >>