Introduction, downloads

S: 11 Dec 2023 (b7.2)

D: 11 Dec 2023

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Population stratification

Clustering

--cluster ['cc'] [{group-avg | old-tiebreaks}] ['missing'] ['only2']

--cluster uses IBS values calculated via "--distance ibs"/--ibs-matrix/--genome to perform complete linkage clustering. The clustering process can be customized in a variety of ways.

  • By default, everyone starts in their own cluster. However, if the --within or --family flag is present, that cluster assignment is used as the starting point instead.
  • The 'cc' modifier prevents two all-case or two all-control clusters from being merged. For consistency with --mcc, missing-phenotype samples are treated as controls (this is a change from PLINK 1.07).
  • By default, the distance between two clusters is defined as the maximum pairwise distance between a member of the first cluster and a member of the second cluster. The 'group-avg' modifier causes average pairwise distance to be used instead.
  • The 'missing' modifier causes clustering to be based on identity-by-missingness instead of identity-by-state. It also causes an identity-by-missingness matrix to be written to plink.mdist.missing. The ID order in this file is identical to that in the .cluster3.missing file.
  • By default, three files describing the dendrogram are produced; the 'only2' modifier causes only the .cluster2 (which describes only the final cluster configuration, and can be read with --within) file to be written.
  • When equal IBS values are present, PLINK 1.9 normally does not try to break the tie in the same manner as PLINK 1.07, so the final cluster solutions tend to differ. This is generally harmless—there is no a priori reason to prefer one tiebreak scheme to the other. However, for testing purposes, you can use the 'old-tiebreaks' modifier to force PLINK 1.9 to emulate the old algorithm. (Due to floating point imprecision, it would be pointless to use this with 'group-avg'.)

--cluster automatically launches an appropriate IBS calculation when necessary, so you don't have to use it with --distance/--ibs-matrix/--genome unless you want to save the distance matrix to disk.

Reusing an IBS/IBD calculation

--read-genome <filename>

--read-genome lets you use the (possibly gzipped) results of a previous --genome run as the basis for clustering, instead of recomputing IBS and PPC test results from scratch. If any pair of samples is missing from the input file, an error is reported.

You can also invoke --read-dists to reuse the results of a "--distance triangle bin ibs" run. If --read-dists and --ppc are present in the same run, PPC test p-values are calculated from scratch (or loaded via --read-genome) while distances are loaded from the .mibs.bin file.

Adding clustering constraints

--ppc <minimum p-value>

--mc <maximum cluster size>
--mcc <maximum cases/cluster> <maximum controls/cluster>
--K <minimum final cluster count>
--ibm <minimum identity-by-missingness>

  • --ppc prevents two clusters from being merged if, for any cross-cluster pair of samples, the --genome PPC test p-value is below the given threshold.
  • --mc prevents two clusters from being merged if the new cluster would be larger than the given size.
  • --mcc prevents two clusters from being merged if the new cluster would contain too many cases or controls. Missing-phenotype samples are treated as controls.
  • --K stops cluster merging once there are no more than the given number of clusters remaining.
  • --ibm prevents two clusters from being merged if any cross-cluster pair of samples has identity-by-missingness smaller than the given value.

If the initial cluster assignment violates any of these constraints, a warning will be printed.

--match <filename> [missing value]
--match-type <filename>
--qmatch <filename> [missing value]
--qt <filename>

Given a file where each line has the following fields:

  • 1. Family ID
  • 2. Within-family ID
  • 3. Covariate 1
  • 4. Covariate 2
  • ...
  • M+2. Covariate M

--match prevents any pair of samples which differ on at least one covariate from being merged into the same cluster. If you provide a second parameter, all covariates with that value are treated as missing (i.e. they don't induce any merge restrictions).

To instead force members of the same cluster to differ on some or all of these covariates, you can combine --match with --match-type. Its input file should contain a single line with up to M fields, each of which is '0', '1', or '-1' (or equivalently, '-', '+', or '*'); '0'/'-' entries specify "negative matches" (samples with equal covariate values cannot be in the same cluster), '1'/'+' entries specify "positive matches" (samples with differing covariate values cannot be in the same cluster), and '-1'/'*' indicates the covariate should be ignored. Thus, using --match without --match-type is equivalent to loading a --match-type file with M '1's.

To enforce within-cluster similarity (but not uniformity) on some quantitative trait(s), you can use --qmatch in combination with --qt. In this case, the --qmatch input file has the same structure as a --match input file (with the additional restriction that all covariates must be numeric), while the --qt input file should contain up to M lines with a single nonnegative tolerance per line (or '-1' to specify that the covariate should be ignored). Merges involving any pair of samples which differ by more than the tolerance for any --qmatch covariate will not be permitted.

For backwards compatibility, if no second parameter is provided to --qmatch, the --missing-phenotype value (default '-9') is still treated as missing.

If there are fewer than M entries in the --match-type/--qt file, the trailing fields in the --match/--qmatch file are ignored.

--match and --qmatch can be used in the same run (in which case their input files don't have to contain the same number of covariates).

If the initial cluster assignment violates a --match or --qmatch constraint, a warning will be printed.

Dimension reduction

PLINK 1.9 provides two dimension reduction routines: --pca, for principal components analysis (PCA) based on the variance-standardized relationship matrix, and --mds-plot, for multidimensional scaling (MDS) based on raw Hamming distances. Top principal components are generally used as covariates in association analysis regressions to help correct for population stratification, while MDS coordinates help with visualizing genetic distances.

--pca [count] ['header'] ['tabs'] ['var-wts']

--pca-cluster-names <name(s)...>
--pca-clusters <filename>

By default, --pca extracts the top 20 principal components of the variance-standardized relationship matrix; you can change the number by passing a numeric parameter. Eigenvectors are written to plink.eigenvec, and top eigenvalues are written to plink.eigenval. The 'header' modifier adds a header line to the .eigenvec file(s), and the 'tabs' modifier makes the .eigenvec file(s) tab- instead of space-delimited. You can request variant weights with the 'var-wts' modifier, and dump the matrix by using --pca in combination with --make-rel/--make-grm-gz/--make-grm-bin.

This is a simple port of GCTA's --pca flag, which generates the same files from a previously computed relationship matrix. For more full-featured principal component analysis, including automatic outlier removal, high-speed randomized approximation for very large datasets, and LD regression, try EIGENSOFT 6.

If clusters are defined (via --within), you can base the principal components off a subset of samples and then project everyone else onto those PCs with --pca-cluster-names and/or --pca-clusters. --pca-cluster-names accepts a space-delimited sequence of cluster names on the command line, while --pca-clusters takes the name of a file with one cluster name per line. If you also want the MAFs used in the relationship matrix calculation to be based on only samples in those clusters, dump those MAFs in a separate run with --freqx + --keep-cluster-names/--keep-clusters, and then load them during your PCA run with --read-freq.

--mds-plot <dimension count> ['by-cluster'] ['eigendecomp'] ['eigvals']

In combination with --cluster, --mds-plot produces a Haploview-friendly multidimensional scaling report. By default, multidimensional scaling is performed on an inter-sample distance matrix; use the 'by-cluster' modifier to perform it on an inter-cluster distance matrix (calculated by averaging all inter-sample distances for each cluster pair) instead.

The default, singular value decomposition-based algorithm is designed to give the same results as PLINK 1.07 and the R cmdscale() function (up to rounding errors and sign flips, anyway). The 'eigendecomp' modifier requests a faster eigendecomposition-based algorithm which yields slightly different results.

The 'eigvals' modifier causes top eigenvalues to be written to plink.mds.eigvals (one per line; first value corresponds to the first dimension in the .mds file, etc.).

Outlier detection diagnostics

--neighbour <n1> <n2>
  (alias: --neighbor)

For each sample, --neighbour looks at genomic distances to the n1th- through n2th-nearest neighbors, and reports how they compare with the same statistics for other samples. See the PLINK 1.07 documentation for discussion of this diagnostic.

Note that PLINK 1.9 does not require --neighbour to be used with --cluster.

Association analysis >>