Introduction, downloads

D: 17 Jun 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies



'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file




SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition


Category subset


Missing genotypes

Number of distinct alleles

Allele frequencies/counts


Imputation quality


Founder status

Main functions

Data management




















Basic statistics










Pairwise diffs



Linkage disequilibrium




Sample-distance matrices





Population stratification


PCA projection

Association analysis


--glm ERRCODE values



Report postprocessing


Linear scoring



Distributed computation

Command-line help


Flag/parameter reuse

System resource usage


.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation


1000 Genomes phase 3


FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library


Adding new functionality

Discussion forums


File formats

Quick index search

Population stratification

--pca [count] [{approx | meanimpute}] ['scols='<col set descrip.>]
--pca [{allele-wts | biallelic-var-wts}] [count] [{approx | meanimpute}]
      ['vzs'] ['scols='<col set descrip.>] ['vcols='<col set descrip.>]

--pca extracts top principal components from the variance-standardized relationship matrix computed by --make-rel/--make-grm-{bin,list}. The main plink2.eigenvec output file can be read by --covar, and can be used to correct for population stratification in --glm regressions...

  • ...assuming that the top principal components in your genomic dataset actually reflect broad population structure, rather than genotyping/sequencing-batch-related error patterns, small-scale family structure or sample duplication, crazy outliers... The .eigenvec file can be easily loaded and plotted in R; this should help you find significant batch effects and outliers. --king-cutoff removes duplicate samples and close relations.
  • Since this is based on the relationship matrix, it is critical to remove very-low-MAF variants before performing this computation.
  • LD pruning (using e.g. --indep-pairwise) reduces the risk of getting PCs based on just a few genomic regions, and tends to prevent deflation of --glm test statistics.

Technical details:

  • By default, 10 PCs are extracted; you can adjust this by passing a numeric parameter.
    • This was reduced from PLINK 1.9's default of 20, since (i) the randomized algorithm would otherwise require ~4x as much memory, and (ii) in practice, 10 PCs has been effective across a wide range of studies.
  • The 'approx' modifier causes the standard deterministic computation to be replaced with the randomized algorithm originally implemented for Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL (2016) Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. This can be a good idea when you have >5000 samples, and is almost required once you have >50000.
    • The primary memory allocations during "--pca approx" add up to
      Nsample * NPC * (NPC+1) * 16 + Nvariant * NPC * (NPC+1) * 16 + <larger of previous two terms> + 5760 * Nsample
    • If substantially more memory and threads are available, PLINK 2 will attempt to use them to speed up the calculation. (The effectiveness of this is highly situational, but it shouldn't hurt.)
  • The randomized algorithm always mean-imputes missing genotype calls. For comparison purposes, you can use the 'meanimpute' modifier to request this behavior for the standard computation.
  • 'scols=' can be used to customize how sample IDs appear in the .eigenvec file. (maybefid, fid, maybesid, and sid column sets are supported; the default is maybefid,maybesid.)
  • The 'allele-wts' modifier requests an additional one-line-per-allele .eigenvec.allele file with PCs expressed as allele weights instead of sample scores. When it's present, 'vzs' causes the .eigenvec.allele file to be Zstd-compressed.
    'vcols=' can be used to customize the .eigenvec.allele report columns; refer to the file format entry for details.
  • If all your variants are biallelic, you can instead use the 'biallelic-var-wts' modifier to request the old .eigenvec.var format instead.
  • Given an allele-weight or variant-weight file, you can now use --score for PCA projection. This replaces PLINK 1.9's --pca-clusters/--pca-cluster-names projection flags.

You may also want to look at EIGENSOFT 7, which has additional features like automatic outlier removal, LD regression, and Tracy-Widom significance testing of PCs.

Association analysis >>