Introduction, downloads

S: 11 Dec 2023 (b7.2)

D: 11 Dec 2023

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Linkage disequilibrium

PLINK 1.9 includes much faster implementations of PLINK 1.07's LD-based variant pruner and haplotype block estimator, and commands to explicitly report LD statistics.

All of the following calculations only consider founders. If your dataset has a shortage of them, --make-founders may come in handy.

Variant pruning

--indep <window size>['kb'] <step size (variant ct)> <VIF threshold>
--indep-pairwise <window size>['kb'] <step size (variant ct)> <r^2 threshold>

--indep-pairphase <window size>['kb'] <step size (variant ct)> <r^2 threshold>

These commands produce a pruned subset of markers that are in approximate linkage equilibrium with each other, writing the IDs to plink.prune.in (and the IDs of all excluded variants to plink.prune.out). They are currently based on correlations between genotype allele counts; phase is not considered. (Results may be slightly different from PLINK 1.07, due to a minor bugfix in the r2 computation when missing data is present, and more systematic handling of multicollinearity.) Output files are valid input for --extract/--exclude in a future PLINK run.

--indep requires three parameters: a window size in variant count or kilobase (if the 'kb' modifier is present) units, a variant count to shift the window at the end of each step, and a variance inflation factor (VIF) threshold. At each step, all variants in the current window with VIF exceeding the threshold are removed. See the PLINK 1.07 documentation for some discussion of parameter choices.

--indep-pairwise takes the same first two parameters as --indep. Its third parameter is a pairwise r2 threshold: at each step, pairs of variants in the current window with squared correlation greater than the threshold are noted, and variants are greedily pruned from the window until no such pairs remain. Since it does not need to keep the entire <window size> x <window size> correlation matrix in memory, it is usually capable of handling 6-digit window sizes well outside --indep's reach.

Finally, --indep-pairphase is just like --indep-pairwise, except that its r2 values are based on maximum likelihood phasing (like "--r2 dprime" below).

LD statistic reports

--r [{square | square0 | triangle | inter-chr}] [{gz | bin | bin4}] ['spaces'] ['in-phase'] [{d | dprime | dprime-signed}] ['with-freqs'] ['yes-really']
--r2 [{square | square0 | triangle | inter-chr}] [{gz | bin | bin4}] ['spaces'] ['in-phase'] [{d | dprime | dprime-signed}] ['with-freqs'] ['yes-really']

--ld-window <max variant ct + 1>
--ld-window-kb <kbs>

--ld-window-cm <cms>

--ld-window-r2 <val>
--ld-snp <variant ID>

--ld-snps <variant ID(s)/range(s)...>

--ld-snp-list <filename>

By default, --r calculates and reports raw inter-variant allele count correlations, while --r2 reports squared correlations. You can request values for all pairs in matrix format (if you specify 'bin', 'bin4', and/or one of the matrix shape modifiers), all pairs in table format (with 'inter-chr'), or a limited window in table format (this is the default). Results are saved to plink.ld{|.gz|.bin}.

  • The 'gz' modifier causes the output text file to be gzipped.
  • 'bin' causes the output matrix to be written in double-precision binary format, while 'bin4' specifies single-precision binary. The matrix is square if no shape is explicitly specified.
  • By default, text matrices are tab-delimited. To make them space-delimited instead, use the 'spaces' modifier.
  • 'in-phase' adds a column with in-phase allele pairs to table-formatted reports. (This cannot be used with very long allele codes.)
  • 'dprime' adds the absolute value of Lewontin's D-prime statistic to table-formatted reports, and forces both r/r2 and D-prime to be based on the maximum likelihood solution to the cubic equation discussed in Gaunt T, Rodríguez S, Day I (2007) Cubic exact solutions for the estimation of pairwise haplotype frequencies. 'dprime-signed' keeps the sign, while 'd' skips division by Dmax.
  • 'with-freqs' adds MAF columns to table-formatted reports.
  • Since it is disturbingly easy to request a report that won't fit on your hard drive (given calls at a few million variants, an all pairs report can consume tens of terabytes), you're now required to add the 'yes-really' modifier when requesting an unfiltered, non-distributed all pairs computation on more than 400k variants.
  • These computations can be subdivided with --parallel (even when the 'square' modifier is active).
  • By default, when a limited window report is requested, every pair of variants with at least (10-1) variants between them, or more than 1000 kilobases apart, is ignored. You can change the first threshold with --ld-window, and the second threshold with --ld-window-kb.
  • If centimorgan coordinates are present, you can also impose a maximum centimorgan distance with --ld-window-cm.
  • When a table format report is requested, pairs with r2 values less than 0.2 are normally filtered out of the report. (This still happens with 'inter-chr'.) Use --ld-window-r2 to adjust this threshold.
  • When a table format report is requested, --ld-snp forces the first variant in each pair to be the one named on the command line, --ld-snps accepts one or more variant ranges (same syntax as --snps), and --ld-snp-list loads variant IDs from the given file. (Pedantic note: --r and --r2 usually avoid reporting the same pair twice since they loop over just the upper or just the lower triangle, but that's no longer true when one of these three flags is active.)

--ld <variant ID> <variant ID> ['hwe-midp']

To inspect the relation between a single pair of variants in more detail, you can use the --ld flag, which displays observed and expected (based on MAFs) frequencies of each haplotype, as well as haplotype-based r2 and D'. When there are multiple biologically possible solutions to the haplotype frequency cubic equation, all are displayed (instead of just the maximum likelihood solution identified by --r/--r2), along with HWE exact test statistics.

List tagging variants

--show-tags <target variant ID file>
--show-tags all

--list-all
--tag-kb <kbs>
--tag-r2 <val>
--tag-mode2

To help with tag SNP selection, --show-tags determines all variants which have allele count squared correlation ≥ 0.8 with a target variant. This command generates one or two files:

  • if the '--list-all' flag or the --show-tags 'all' modifier is specified, a per-target-variant tag report is written to plink.tags.list. 'all' causes all variants to be target variants, while --list-all restricts the target variant set to those in the file named in the --show-tags parameter.
  • when not in 'all' mode, a single list of tags for the entire target variant set is written to plink.tags. A variant is considered to tag itself here, so this will be a superset of the target variant list unless the latter contains variant IDs missing from the current dataset.

In addition,

  • by default, the scan for potential tags is limited to variants within 250 kilobases of the target. This value is adjustable with --tag-kb.
  • --tag-r2 allows the minimum tag r2 to be adjusted.
  • when not in 'all' mode, --tag-mode2 changes the --show-tags target variant input format to two-column, with IDs in the first column and '0'/'1' values in the second (where target variants are marked by '1's). The .tags output format is also changed to two-column.
X chromosome model

--ld-xchr <mode number>

Handling of the X chromosome by --indep{-pairwise}, --r/--r2 (without 'dprime'), --flip-scan, and --show-tags can be adjusted with the --ld-xchr flag. There are currently three modes:

  1. (default) Males are coded 0/1 and females are coded 0/1/2, based on A1 allele dosage. This is PLINK 1.07's behavior.
  2. Males are coded 0/2.
  3. Males are coded 0/2, but females have twice the weight of males in the computation (currently only supported by --indep[-pairwise]).
Haplotype block estimation

--blocks ['no-pheno-req'] ['no-small-max-span']

--blocks-max-kb <kbs>
--blocks-min-maf <cutoff>
--blocks-strong-lowci <x>
--blocks-strong-highci <x>
--blocks-recomb-highci <x>
--blocks-inform-frac <x>

--blocks estimates haplotype blocks, via Haploview's interpretation of the block definition suggested by Gabriel S et al. (2002) The Structure of Haplotype Blocks in the Human Genome. Each block's variant IDs are written to plink.blocks, and a longer report with position information is written to plink.blocks.det.

  • To maintain backwards compatibility, this computation normally does not consider either nonfounders or samples with missing phenotypes. The 'no-pheno-req' modifier lifts the phenotype restriction.
  • Normally, size-2 blocks may not span more than 20kb, and size-3 blocks are limited to 30kb. The 'no-small-max-span' modifier removes these limits.
  • By default, only pairs of variants within 200 kilobases of each other are considered; this bound can, and usually should, be increased with --blocks-max-kb.
  • All variants with MAF < 0.05 are normally ignored by this procedure. Use --blocks-min-maf to adjust this threshold.
  • Two variants are normally considered by this procedure to be in "strong LD" if the bottom of the 90% D-prime confidence interval is greater than 0.70, and the top of the confidence interval is at least 0.98. Use --blocks-strong-lowci and --blocks-strong-highci to change these values.
    Note that Haploview/PLINK 1.07 has a minor quirk in its handling of the low CI bound: a potential haploblock's outermost variant pair is considered to be in "strong LD" even if the bottom of its D-prime confidence interval is exactly equal to the --blocks-strong-lowci value. To maximize backward compatibility, we have replicated this quirk. Use a lower bound that isn't a perfect multiple of 0.01 (e.g. "--blocks-strong-lowci 0.7005") if you don't want outermost pairs to receive special treatment.
  • By default, this procedure treats confidence interval tops smaller than 0.90 as strong evidence for historical recombination; use --blocks-recomb-highci to adjust this threshold.
  • Normally, the number of "strong LD" pairs within a haploblock must be more than 0.95 times the total number of informative pairs (i.e. either "strong LD" or 'recombination'). This threshold can be adjusted with --blocks-inform-frac.

The .blocks file is valid input for PLINK 1.07's --hap command. However, the --hap... family of flags has not been reimplemented in PLINK 1.9 due to poor phasing accuracy (and, consequently, inferior haplotype likelihood/frequency estimates) relative to other software; for now, we recommend using BEAGLE 3.3.2 instead of PLINK for case/control haplotype association analysis. (You can use "--recode beagle" to export data.) We apologize for the inconvenience, and plan to develop variants of the --hap... flags which handle pre-phased data effectively.

Distance matrices >>