Introduction, downloads

D: 5 Feb 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Google groups

Credits

File formats

Quick index search

Linkage disequilibrium

All of the following calculations only consider founders. If your dataset has a shortage of them, PLINK 1.9 --make-founders may come in handy.

Since two-variant r2 only makes sense for biallelic variants, these collapse multiallelic variants down to most common allele vs. the rest (unless REF-based statistics are explicitly requested, in which case it's REF vs. all ALTs combined).

Variant pruning

--indep-pairwise <window size>['kb'] [step size (variant ct)]
                 <unphased-hardcall-r^2 threshold>
--indep-pairphase <window size>['kb'] [step size (variant ct)]
                  <phased-hardcall-r^2 threshold>

--indep <window size>['kb'] [step size (variant ct)] <VIF threshold>

--indep-order <mode>

These commands produce a pruned subset of variants that are in approximate linkage equilibrium with each other, writing the IDs to plink2.prune.in (and the IDs of all excluded variants to plink2.prune.out). These files are valid input for --extract/--exclude in a future PLINK run; and, for backward compatibility, they do not affect the set of variants in the current run.

Since the only output of these commands is a pair of variant-ID lists, they now error out when variant IDs are not unique.

--indep-pairwise is the simplest approach, which only considers correlations between unphased-hardcall allele counts. It takes three parameters: a required window size in variant count or kilobase (if the 'kb' modifier is present) units, an optional variant count to shift the window at the end of each step (default 1, and now required to be 1 when a kilobase window is used), and a required r2 threshold. At each step, pairs of variants in the current window with squared correlation greater than the threshold are noted, and variants are greedily pruned from the window until no such pairs remain.

--indep-pairphase is similar, except that it requires all genotypes to be phased (this is a change from PLINK 1.9), and looks at haplotype correlations.

Additional notes:

  • --indep-order controls the order in which variant pairs are checked within each window.
    • '1': Imitate PLINK 1.x.
    • '2': Scan backwards within each window (default since 16 May 2023). This is usually faster.
  • This operation can be slow, particularly when "--indep-order 1" is specified. You'll usually want to perform e.g. MAF-based filtering beforehand.
  • On human data, some reasonable parameter settings are, in order of increasing strictness:
        "--indep-pairwise 100kb 0.8"
        "--indep-pairwise 200kb 0.5"
        "--indep-pairwise 500kb 0.2"
  • If you do want to spend the extra compute for dosage-aware pruning, one way to do so is running --clump on the output of --freq.

--indep-preferred <filename>

By default, when given a choice, the variant-pruning commands preferentially keep variants with higher nonmajor allele frequencies. However, if you provide a list of variant IDs to --indep-preferred, all variants in that list are prioritized over all variants outside it. (Allele frequencies will still be used for tiebreaking.)

LD statistic reports

--r[2]-[un]phased [{square | square0 | triangle | inter-chr}] ['yes-really']
                  [{zs | bin | bin4}] ['ref-based']
                  ['allow-ambiguous-allele'] ['cols='<column set descriptor>]
--ld-window <max variant ct + 1>

--ld-window-kb <#kb>
--ld-window-cm <#cm>
--ld-window-r2 <min>

--ld-snp <variant ID>
--ld-snps <variant ID(s)/range(s)...>
--ld-snp-list <filename>

--r2-phased computes the textbook haplotype-frequency-based r2, and corresponds to PLINK 1.9 --r2's behavior when the 'd', 'dprime', or 'dprime-signed' modifier was present. --r2-unphased computes the simpler r2 squared-correlation between (unphased) dosage vectors, and corresponds to how PLINK 1.x --r2 behaved without a D'-related modifier. You are now required to explicitly specify which of these r2 statistics you want.

--r-phased and --r-unphased report signed (and of course unsquared) values, with positive sign when the two major (or, when 'ref-based' is specified, REF) alleles are positively correlated with each other.

Dosages are now used when present. (In the diploid case, an unphased dosage of x is interpreted as P(0/0) = 1 - x, P(0/1) = x when x is in 0..1.) Note that you can generate a dosage-free copy of your data with "--make-pgen erase-dosage" when this behavior is unwanted.

Phase information is used when both variants are on the same chromosome.

By default, tabular output is written to plink2.vcor[.zst]. The following filters apply to tabular output:

  • --ld-window controls the maximum number of other variants (after generic variant filters) allowed between variant-pairs in the report. All variant-pairs with at least (<--ld-window argument> - 1) variants between them are ineligible. The default setting is infinity (i.e. this filter does nothing unless explicitly specified); this is a change from PLINK 1.x.
  • --ld-window-kb controls the maximum kilobase distance. The default setting is 1000, so if you want to look at all same-chromosome variant-pairs, you need to explicitly disable this default with e.g. "--ld-window-kb 9999999". Or if you want to look at inter-chromosomal variant-pairs as well, use --r[2]-[un]phased's 'inter-chr' modifier.
  • When centimorgan coordinates are present, --ld-window-cm controls the maximum centimorgan distance. The default setting is infinity.
  • --ld-window-r2 controls the minimum r2 for a variant-pair to be included in the report. With a negative setting, 'nan' values are not filtered out. The default setting is 0.2 (for --r-phased and --r-unphased, this means |r|≥sqrt(0.2)).
  • You can restrict the first variant in each variant-pair to come from a limited set of variant ID(s). --ld-snp specifies a single ID, --ld-snps accepts one or more variant ranges (same syntax as --snps), and --ld-snp-list specifies a file to load variant IDs from. These flags no longer cause self-comparisons or duplicate entries to appear in the report.

Refer to the file format entry for output details and optional columns (e.g. allele frequency, D').

By default, when multiallelic variants are present, --r2-[un]phased will error out when the 'maj' and 'nonmaj' (or, if 'ref-based' is specified, 'ref' and 'alt') column-sets are absent. Similarly, --r-[un]phased will error out in this sort of case regardless of whether multiallelic variants are present. The 'allow-ambiguous-allele' modifier overrides this behavior.

To request all-pairs matrix output instead, specify a matrix shape ('square', 'square0', 'triangle') and/or encoding ('bin', 'bin4') modifier; these have the same behavior as with --make-rel.

  • As with --make-rel and similar commands, if a shape is specified without an encoding, encoding defaults to text; and if an encoding is specified without a shape, shape defaults to square.
  • Since there is no header line specifying phased vs. unphased, r vs. r2, or text vs. binary vs. Zstd-compressed-text, this information is embedded in the main matrix filename's extension. ".[un]phased.vcor{1|2}{|.bin|.zst}" summarizes the twelve possibilities.
  • A file named <matrix filename without .zst>.vars is also written, containing the matrix's variant ID sequence. (Relatedly, variant IDs are now required to be unique.)
  • Since the resulting file can easily be huge, you're required to add the 'yes-really' modifier when requesting an unfiltered (or only-nan-filtered), non-distributed all-pairs computation on more than 400k variants. 'inter-chr' with a nonpositive --ld-window-r2 setting also triggers this sanity check.

With either output type, the computation can be subdivided with --parallel.

--ld <variant ID> <variant ID> ['hwe-midp']

To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. (The latter two values are calculated in the same manner as they are for --r2-phased.)

When unphased calls are present, and there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r[2]-phased), along with HWE exact test statistics.

--bad-ld

PLINK 2 cannot estimate LD effectively when very few founders are present, so it normally errors out when there are less than 50. If you can't solve the problem with PLINK 1.9 --make-founders, you can use --bad-ld as a last resort to force PLINK 2 to proceed.

Sample-distance and similarity matrices >>