Introduction, downloads

D: 24 Jan 2020

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF (.vcf[.gz])

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 dosage

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--pgen-info

Linkage disequilibrium

--indep...

--ld

Sample comparison

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--adjust-file

Linear scoring

--score

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

Errors and warnings

Output file list

Order of operations

Google groups

Credits

File formats

Quick index search

Sample-distance and similarity matrices

Relationship/covariance

--make-rel ['cov'] ['meanimpute'] [{square | square0 | triangle}]
           [{zs | bin | bin4}]

--make-rel is the primary interface to PLINK's realized relationship matrix and covariance matrix calculator. (See Yang J, Lee SH, Goddard ME, Visscher PM (2011) GCTA: A Tool for Genome-wide Complex Trait Analysis for discussion of relationship matrix definition and usage.)

Output format
By default, --make-rel causes a lower-triangular tab-delimited text file to be written to plink2.rel and a list of corresponding sample IDs to plink2.rel.id.

  • The 'square', 'square0', and 'triangle' modifiers affect the shape of the output matrix. 'square' yields a symmetric matrix; 'triangle' (normally the default) yields a lower-trianglar matrix where the first row contains only the <sample #1-sample #1> relationship, the second row has the <sample #1-sample #2> and <sample #2-sample #2> relationships in that order, etc.; and 'square0' yields a square matrix with all cells in the upper right triangle zeroed out.
  • The 'bin' modifier causes the matrix to be written to plink2.rel.bin using little-endian IEEE-754 double encoding (suitable for loading from R). When using 'bin', the default output shape is 'square' instead of 'triangle'.
  • 'bin4' uses IEEE-754 single-precision encoding, and is otherwise identical to 'bin'. This saves disk space, but you'll need to specify 4-byte single-precision input for your next analysis step. The following does so in R:
    readBin('<filename>', what="numeric", n=<number of entries>, size=4)
  • As usual, 'zs' requests Zstd compression of the matrix file. Note that it cannot be combined with binary output, since general-purpose compression is much less effective in that context.

Variance-standardization
By default, the sample covariance for each allele is divided by the variant's variance (calculated from observed, or loaded, allele frequencies). (As a consequence, it is critical to filter out very-low-MAF variants before performing the default computation.) To disable this and calculate a straight covariance matrix, use the 'cov' modifier.

Distributed computation
--make-rel jobs using the 'square0' or 'triangle' output shapes can be subdivided with the --parallel flag. (This is why the 'square0' mode exists.)

Other notes:

  • This calculation is not LD-sensitive; if that's a problem, an alternative is Doug Speed et al.'s LDAK software.
  • Dosages are used when available.
  • Multiallelic variants are handled properly; as of 30 Dec 2019, PLINK 2 no longer collapses all minor allele dosages together. (The sum over variants in the numerator of the original biallelic formula is replaced by a sum over alleles, and the denominator is doubled.)
  • By default, mean-imputation is not performed for missing values, and we generally recommend using dedicated imputation software instead. However, "--pca approx" is based on the relationship matrix with mean-imputed values, and in practice this has been good enough for --pca's usual applications when the missingness rate isn't too high. To force mean-imputation here, add the 'meanimpute' modifier.
  • Special handling of the diagonal is no longer supported.
Exporting to GCTA

--make-grm-list ['cov'] ['meanimpute'] ['zs'] [{id-header | iid-only}]
--make-grm-bin ['cov'] ['meanimpute'] [{id-header | iid-only}]

--make-grm-list and --make-grm-bin perform the same calculation as --make-rel (so the 'cov' and 'meanimpute' modifiers have the same effect), but produce a .grm or .grm.bin-format file for GCTA to process.

These computations can be subdivided with --parallel.

KING-robust kinship estimator

The relationship matrix computed by --make-rel/--make-grm-list/--make-grm-bin can be used to reliably identify close relations within a single population, if your MAFs are decent. However, Manichaikul et al.'s KING-robust estimator can also be mostly trusted on mixed-population datasets (with one uncommon exception noted below), and doesn't require MAFs at all. Therefore, we have added this computation to PLINK 2, and the relationship-based pruner is now based on KING-robust.

The exception is that KING-robust underestimates kinship when the parents are from very different populations. You may want to have some special handling of this case; --pca can help detect it.

Note that KING kinship coefficients are scaled such that duplicate samples have kinship 0.5, not 1. First-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc. It is conventional to use a cutoff of ~0.354 (the geometric mean of 0.5 and 0.25) to screen for monozygotic twins and duplicate samples, ~0.177 to add first-degree relations, etc.

--make-king [{square | square0 | triangle}] [{zs | bin | bin4}]
--make-king-table ['zs'] ['counts'] ['rel-check'] ['cols='<col. set descrip.>]
--king-table-filter <min. kinship coefficient>
--king-table-subset <.kin0 file> [min. kinship coefficient]

--make-king writes KING-robust coefficients in matrix form to plink2.king[.zst] or plink2.king.bin, while --make-king-table writes them in table form to plink2.kin0[.zst]. (See above for matrix-output options.)

  • Only autosomes are included in this computation.
  • Pedigree information is currently ignored; the between-family estimator is used for all pairs.
  • For multiallelic variants, REF allele counts are used.
  • --make-king jobs with the 'square0' or 'triangle' output shapes and all --make-king-table jobs can be subdivided with --parallel.

In addition, with --make-king-table,

  • The 'counts' modifier causes counts rather than 0..1 frequencies to be reported in the output columns that support both.
  • The 'rel-check' modifier causes only same-FID pairs to be reported. (The between-family KING estimator is still used.)
  • --king-table-filter causes only kinship coefficients ≥ the given threshold to be reported.
  • --king-table-subset causes only sample-pairs mentioned in the given .kin0 file (and optionally passing a kinship-coefficient threshold) to be processed. This allows you to start with a screening step which considers all sample pairs but only a small number of variants scattered across the genome (try --maf + --bp-space), and follow up with accurate kinship-coefficient computations for just the sample pairs identified as possible relations during the screening step. (This two-step approach remains practical with millions of samples!)
  • Refer to the file format entry for other output details and optional columns. --make-king-table now covers much of PLINK 1.x --genome's functionality.

See also the original KING software package, which has some useful two-step workflows directly built in, along with handy additional features like pedigree inference.

Relationship-based pruning

--king-cutoff [.king.bin + .king.id fileset prefix] <threshold>

If used in conjunction with a later calculation (see the order of operations page for details), --king-cutoff excludes one member of each pair of samples with kinship coefficient greater than the given threshold. (See above for threshold suggestions.) Alternatively, you can invoke this on its own to write a pruned list of sample IDs to plink2.king.cutoff.in.id, and excluded IDs to plink2.king.cutoff.out.id.

PLINK tries to maximize the final sample size, but this maximum independent set problem is NP-hard, so we use a greedy algorithm which does not guarantee an optimal result. In practice, --king-cutoff does yield a maximum set whenever there aren't too many intertwined close relations, but if you want to try to beat it (or optimize a fancier function that takes the exact kinship-coefficient values into account), use the --make-king and --keep/--remove flags and patch your preferred algorithm in between.

--king-cutoff usually computes kinship coefficients from scratch. However, you can provide a precomputed kinship-coefficient matrix (must be --make-king binary format, triangular shape, either precision ok) as input; this is a time-saver when experimenting with different thresholds.

Population stratification >>