Population stratification
--pca [count] [{approx | meanimpute}] ['scols='<col set descrip.>]
--pca [{allele-wts | biallelic-var-wts}] [count] [{approx | meanimpute}]
['vzs'] ['scols='<col set descrip.>] ['vcols='<col set descrip.>]
--pca extracts top principal components from the variance-standardized relationship matrix computed by --make-rel/--make-grm-{bin,list}. The main plink2.eigenvec output file can be read by --covar, and can be used to correct for population stratification in --glm regressions...
- ...assuming that the top principal components in your genomic dataset actually reflect broad population structure, rather than genotyping/sequencing-batch-related error patterns, small-scale family structure or sample duplication, crazy outliers... The .eigenvec file can be easily loaded and plotted in R; this should help you find significant batch effects and outliers. --king-cutoff removes duplicate samples and close relations.
- Since this is based on the relationship matrix, it is critical to remove very-low-MAF variants before performing this computation.
- LD pruning (using e.g. --indep-pairwise) reduces the risk of getting PCs based on just a few genomic regions, and tends to prevent deflation of --glm test statistics.
Technical details:
- By default, 10 PCs are extracted; you can adjust this by passing a numeric parameter.
- This was reduced from PLINK 1.9's default of 20, since (i) the randomized algorithm would otherwise require ~4x as much memory, and (ii) in practice, 10 PCs has been effective across a wide range of studies.
- The 'approx' modifier causes the standard deterministic computation to be replaced with the randomized algorithm originally implemented for Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL (2016) Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. This can be a good idea when you have >5000 samples, and is almost required once you have >50000.
- The primary memory allocations during "--pca approx" add up to
Nsample * NPC * (NPC+1) * 16 + Nvariant * NPC * (NPC+1) * 16 + <larger of previous two terms> + 5760 * Nsample
bytes.
- If substantially more memory and threads are available, PLINK 2 will attempt to use them to speed up the calculation. (The effectiveness of this is highly situational, but it shouldn't hurt.)
- The randomized algorithm always mean-imputes missing genotype calls. For comparison purposes, you can use the 'meanimpute' modifier to request this behavior for the standard computation.
- 'scols=' can be used to customize how sample IDs appear in the .eigenvec file. (maybefid, fid, maybesid, and sid column sets are supported; the default is maybefid,maybesid.)
- The 'allele-wts' modifier requests an additional one-line-per-allele .eigenvec.allele file with PCs expressed as allele weights instead of sample scores. When it's present, 'vzs' causes the .eigenvec.allele file to be Zstd-compressed.
'vcols=' can be used to customize the .eigenvec.allele report columns; refer to the file format entry for details.
- If all your variants are biallelic, you can instead use the 'biallelic-var-wts' modifier to request the old .eigenvec.var format instead.
- Given an allele-weight or variant-weight file, you can now use --score for PCA projection. This replaces PLINK 1.9's --pca-clusters/--pca-cluster-names projection flags.
You may also want to look at EIGENSOFT 7, which has additional features like automatic outlier removal, LD regression, and Tracy-Widom significance testing of PCs.
Association analysis >>
|