Introduction, downloads

S: 11 Dec 2023 (b7.2)

D: 11 Dec 2023

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Distance matrices

Identity-by-state/Hamming

--distance [{square | square0 | triangle}] [{gz | bin | bin4}] ['ibs'] ['1-ibs'] ['allele-ct'] ['flat-missing']
--distance-wts exp=<x>
--distance-wts <filename> ['noheader']

--distance is the primary interface to PLINK 1.9's IBS and Hamming distance calculation engine.

Output formats
By default, --distance causes a lower-triangular tab-delimited text file to be written to plink.dist, and a list of corresponding sample IDs to plink.dist.id. The first five modifiers allow you to change the output format.

  • 'square', 'square0', and 'triangle' affect the shape of the output matrix. 'square' yields a symmetric matrix; 'triangle' (normally the default) yields a lower-triangular matrix where the first row contains only the <genome 1-genome 2> distance, the second row has the <genome 1-genome 3> and <genome 2-genome 3> distances in that order, etc.; and 'square0' yields a square matrix with all cells in the upper right triangle zeroed out.
  • 'gz' causes a gzipped file to be written to plink.dist.gz instead.
  • 'bin' causes the matrix to be written to plink.dist.bin using little-endian IEEE-754 double encoding (suitable for loading from R). When using 'bin', the default output shape is 'square' instead of 'triangle'.
  • 'bin4' uses IEEE-754 single-precision encoding, and is otherwise identical to 'bin'. This saves disk space, but you'll need to specify 4-byte single-precision input for your next analysis step. The following does so in R:

    readBin('<filename>', what="numeric", n=<number of entries>, size=4)

    (Omit "size=4" to load the usual 8-byte encoding.)

Units
By default, distances are expressed as allele counts. 'ibs' causes an identity-by-state matrix to be written to plink.mibs instead (and the corresponding ID file is written to plink.mibs.id), while '1-ibs' causes distances expressed as genomic proportions (i.e. 1 minus the identity-by-state value) to be written to plink.mdist. You can request multiple units in a single run; e.g. "--distance ibs allele-ct" causes both a .mibs and a .dist file to be written.

Distance weights
The --distance-wts flag allows you to weight the variants in an arbitrary manner.

  • 'exp=<x>' causes a weight of (2q(1-q))-x to be applied to each variant, where q is the loaded or inferred MAF.
  • If a filename is provided instead, variant IDs are loaded from the first column and weights from the second. The first nonempty line of the file is normally skipped; add the 'noheader' modifier to keep it.

Missingness correction
When missing calls are present, PLINK 1.9 defaults to dividing each observed genomic distance by (1-<sum of missing variants' average contribution to distance>). If MAF is nearly independent of missingness, this treatment is more accurate than the usual flat (1-<missing call frequency>) denominator. However, if independence is a poor assumption, you can use the 'flat-missing' modifier to force PLINK 1.9 to apply the flat missingness correction.

Distributed computation
--distance jobs using the 'square0' or 'triangle' output shapes can be subdivided with the --parallel flag.

Backwards compatibility

--distance-matrix
--ibs-matrix

These deprecated flags generate space-delimited text matrices, and are included for backwards compatibility with scripts relying on the corresponding PLINK 1.07 flags. New scripts should migrate to "--distance 1-ibs flat-missing" and "--distance ibs flat-missing".

Note that you are no longer required to use these flags in conjunction with --cluster.

Reloading

--read-dists <distance file> [ID file]

If you've previously generated a distance matrix using "--distance triangle bin", this lets you reload it for --cluster, --neighbour, and the distance-phenotype analyses below. When no ID file is named, it is assumed that the distance matrix was generated with the same samples in the same order as in the current PLINK run.

We are likely to extend this flag to support more --distance output formats in the future.

Relationship/covariance

--make-rel [{square | square0 | triangle}] [{gz | bin | bin4}] [{cov | ibc2 | ibc3}]

--make-rel is the primary interface to PLINK 1.9's realized relationship matrix and covariance matrix calculator. (See Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: A Tool for Genome-wide Complex Trait Analysis for discussion of relationship matrix definition and usage. Note that this calculation is not LD-sensitive; if that's a problem, we currently recommend using Doug Speed et al.'s LDAK software instead.)

Output formats
The 'square', 'square0', 'triangle', 'gz', 'bin', and 'bin4' modifiers have essentially the same effects as they do when used with --distance; the only difference is that 'square0' and 'triangle' no longer zero out/exclude the matrix diagonal. Depending on which of these modifiers are present, the output matrix's file extension is .rel, .rel.bin, or .rel.gz.

Variance-standardization
By default, the sample covariance at each SNP is divided by that SNP's variance (calculated from observed, or loaded, MAF). To disable this and calculate a straight covariance matrix, use the 'cov' modifier.

Inbreeding estimates on diagonal
Diagonal elements are set to (1 + Fhat), where Fhat is one of GCTA's inbreeding estimators. The default choice is Fhat1, which is equivalent to not having any special handling of the diagonal at all (this matches all versions of GCTA from 0.93.0 on). If you wish to follow the guidance of the original GCTA paper (implemented by very old GCTA versions), use the 'ibc3' modifier. 'ibc2' causes Fhat2 (PLINK 1.07's inbreeding estimator) to be used.

Distributed computation
--make-rel jobs using the 'square0' or 'triangle' output shapes can be subdivided with the --parallel flag.

Exporting to GCTA

--make-grm-gz ['no-gz'] [{cov | ibc2 | ibc3}]
--make-grm-bin [{cov | ibc2 | ibc3}]

--make-grm-gz and --make-grm-bin perform the same calculation as --make-rel (so the 'cov', 'ibc2', and 'ibc3' modifiers have the same effect), but produce a .grm.gz or .grm.bin-format file for GCTA to process. (--make-grm-gz's 'no-gz' modifier turns off gzipping of the main output file.)

The --make-grm-bin computation was switched from single-precision to double-precision internal arithmetic in Nov 2014; see e.g. this real-world instance of insufficient precision leading to flawed science for motivation. (We don't actually expect any of GCTA's results to be dangerously inaccurate, especially when less than ~10 million markers are involved, but we figure a 1.2x-2x speed penalty here is an acceptable price to pay for peace of mind.)

These computations can be subdivided with --parallel.

Relationship-based pruning

--rel-cutoff [maximum]
  (alias: --grm-cutoff)

If used in conjunction with a later calculation (see the order of operations page for details), --rel-cutoff excludes one member of each pair of samples with observed genomic relatedness greater than the given cutoff value (default 0.025) from the analysis. Alternatively, you can invoke this on its own to write a pruned list of sample IDs to plink.rel.id.

PLINK tries to maximize the final sample size, but this maximum independent set problem is NP-hard, so we use a greedy algorithm which does not guarantee an optimal result. In practice, PLINK --rel-cutoff does yield a maximum set whenever there aren't too many intertwined close relations, and it outperforms GCTA --grm-cutoff when there are (we chose our greedy algorithm carefully); but if you want to try to beat both programs, use the --make-rel and --keep/--remove flags and patch your preferred approximation algorithm in between. (We may add one or two levels of backtracking to our --rel-cutoff if its level of imperfection becomes problematic.)

Note that, while it is possible to use --rel-cutoff on a previously calculated relationship matrix by combining it with --grm-gz/--grm-bin (like how GCTA --grm-cutoff is used), we do not expect that to be the typical workflow.

Distributed computation

--make-rel and --make-grm-gz/--make-grm-bin jobs can be subdivided with the --parallel flag.

However, --rel-cutoff cannot run concurrently with parallel relationship matrix evaluation; instead, it must act on the final assembled matrix. This is the primary use case for --grm-gz/--grm-bin.

Distance-phenotype analysis

Case/control

--ibs-test [permutation count]

--groupdist [iteration count] [d]

--ibs-test and --groupdist consider three subsets of the distance matrix: pairs of affected samples, affected-unaffected pairs, and pairs of unaffected samples. Each of these subsets has a distribution of pairwise genomic distances; --ibs-test uses permutation to estimate p-values re: which types of pairs are most similar (see here for details), while --groupdist focuses on the differences between the centers of these distributions and estimates standard errors via delete-d jackknife.

To perform this type of analysis with scalar phenotype data, you may combine --ibs-test/--groupdist with the --tail-pheno flag. However, the distance-phenotype regression described next should be more informative.

If --ibs-test is run with no parameters, 100000 permutations are used. If --groupdist is run with less than two parameters, d is set to <number of people>0.6 rounded down; with no parameters, 100000 jackknife iterations are run.

When combining these commands with --read-dists, units must match: "--distance triangle bin ibs" goes with --ibs-test, while "--distance triangle bin" goes with --groupdist.

Distance-QT regression

--regress-distance [iteration count] [d]
--regress-rel [iteration count] [d]

These flags perform simple linear regressions and evaluate delete-d jackknife standard error estimates. --regress-distance regresses genomic distances on pairwise average phenotypes and vice versa, while --regress-rel regresses genomic relationships on pairwise average phenotypes and vice versa.

With less than two parameters, d is set to <number of people>0.6 rounded down. With no parameters, 100000 jackknife iterations are run.

A previously calculated triangular binary distance matrix can be loaded as input to --regress-distance using --read-dists. There is currently no similar shortcut for --regress-rel.

Identity-by-descent >>