Introduction, downloads

D: 4 Oct 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Quick index search

Population stratification

--pca [count] [{approx | meanimpute}] ['scols='<col set descrip.>]
--pca [{allele-wts | biallelic-var-wts}] [count] [{approx | meanimpute}]
      ['vzs'] ['scols='<col set descrip.>] ['vcols='<col set descrip.>]

--pca extracts top principal components from the variance-standardized relationship matrix computed by --make-rel/--make-grm-{bin,list}. The main plink2.eigenvec output file can be read by --covar, and can be used to correct for population stratification in --glm regressions...

  • ...assuming that the top principal components in your genomic dataset actually reflect broad population structure, rather than genotyping/sequencing-batch-related error patterns, small-scale family structure or sample duplication, crazy outliers... The .eigenvec file can be easily loaded and plotted in R; this should help you find significant batch effects and outliers. --king-cutoff removes duplicate samples and close relations.
  • Since this is based on the relationship matrix, it is critical to remove very-low-MAF variants before performing this computation.
  • LD pruning (using e.g. --indep-pairwise) reduces the risk of getting PCs based on just a few genomic regions, and tends to prevent deflation of --glm test statistics.

Technical details:

  • By default, 10 PCs are extracted; you can adjust this by passing a numeric parameter.
    • This was reduced from PLINK 1.9's default of 20, since (i) the randomized algorithm would otherwise require ~4x as much memory, and (ii) in practice, 10 PCs has been effective across a wide range of studies.
  • The 'approx' modifier causes the standard deterministic computation to be replaced with the randomized algorithm originally implemented for Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL (2016) Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. This can be a good idea when you have >5000 samples, and is almost required once you have >50000.
    • The primary memory allocations during "--pca approx" add up to
      Nsample * NPC * (NPC+1) * 16 + Nvariant * NPC * (NPC+1) * 16 + <larger of previous two terms> + 5760 * Nsample
      bytes.
    • If substantially more memory and threads are available, PLINK 2 will attempt to use them to speed up the calculation. (The effectiveness of this is highly situational, but it shouldn't hurt.)
  • The randomized algorithm always mean-imputes missing genotype calls. For comparison purposes, you can use the 'meanimpute' modifier to request this behavior for the standard computation.
  • 'scols=' can be used to customize how sample IDs appear in the .eigenvec file. (maybefid, fid, maybesid, and sid column sets are supported; the default is maybefid,maybesid.)
  • The 'allele-wts' modifier requests an additional one-line-per-allele .eigenvec.allele file with PCs expressed as allele weights instead of sample scores. When it's present, 'vzs' causes the .eigenvec.allele file to be Zstd-compressed.
    'vcols=' can be used to customize the .eigenvec.allele report columns; refer to the file format entry for details.
  • If all your variants are biallelic, you can instead use the 'biallelic-var-wts' modifier to request the old .eigenvec.var format instead.
  • Given an allele-weight or variant-weight file, you can now use --score for PCA projection. This replaces PLINK 1.9's --pca-clusters/--pca-cluster-names projection flags.

You may also want to look at EIGENSOFT 7, which has additional features like automatic outlier removal, LD regression, and Tracy-Widom significance testing of PCs.

Association analysis >>