Introduction, downloads

S: 18 Aug 2024 (b7.4)

D: 18 Aug 2024

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Discussion forums

plink2-users

Credits

File formats

Quick index search

Epistasis tests

Fast scan, case/control phenotype

--fast-epistasis [{boost | joint-effects | no-ueki}] ['case-only'] [{set-by-set | set-by-all}] ['nop']

--gap <min kb gap for case-only test>
--epi1 <p-value max for inclusion in main report>
--epi2 <p-value max to be counted by N_SIG>
--je-cellmin <count>

--fast-epistasis starts an imprecise but fast scan for epistasis based on inspection of 3x3 joint genotype count tables. For large datasets, it is reasonable to start with this command (using liberal p-value thresholds) to identify candidate pairs for further investigation, and then follow up with a more rigorous and computationally expensive analysis on those pairs, such as the --epistasis logistic regression below. Results are usually written to plink.epi.cc and .epi.cc.summary.

By default, the original allele-based test (see the PLINK 1.07 documentation for details) is applied to these tables. Two newer tests are now supported: 'boost' invokes an extended version (missing data is now permitted, and df is properly adjusted when e.g. a variant lacks homozygous minor genotype observations) of the likelihood ratio test introduced by Wan X et al. (2010) BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies, while 'joint-effects' applies the joint effects test introduced in Ueki M, Cordell HJ (2012) Improved statistics for genome-wide interaction analysis.

Results for the original test normally differ slightly from PLINK 1.07 since we apply the variance and empty cell corrections suggested in Ueki and Cordell's paper. To disable these corrections for testing purposes, use the 'no-ueki' modifier.

To perform a case-only test instead of a case/control test, add the 'case-only' modifier. Since this test assumes the two variants are in linkage equilibrium in the general population, pairs closer than 1000 kb are normally skipped; this setting can be adjusted with --gap.

All pairs of polymorphic variants on autosomal diploid chromosomes are normally tested. To just test pairs of variants within a single set, add the 'set-by-set' modifier and load exactly one set with --set/--make-set; with exactly two sets loaded, all variants in one set are tested against all variants in the other. 'set-by-all' tests all variants in one set against the entire genome instead.

--epi1 adjusts the (screening, for the 'boost' test) p-value for inclusion of pairs in the main report; if not specified, it defaults to 0.0001 (5e-6 for 'boost'). (With small datasets, "--epi1 1" makes sense; but it may fill up your hard drive for little reason when used on large ones.) Usually both raw chi-square statistics and p-values are reported; 'nop' removes the p-values.

--epi2 adjusts the p-value threshold (default 0.01) for qualification as a "significant epistatic test result" counted in the .cc.summary report's third column. For the 'boost' test, --epi2 applies to the screening p-value unless its parameter is no larger than the --epi1 parameter.

The joint-effects test normally skips marker pairs with fewer than 5 observations in any 3x3x2 contingency table cell (cases and controls are considered separately); you can adjust this threshold with --je-cellmin.

Linear/logistic regression-based test

--epistasis [{set-by-set | set-by-all}]

Given a quantitative trait, --epistasis uses linear regression to fit the model

Y = β0 + β1gA + β2gB + β3gAgB

for each inspected variant pair (A, B), where gA and gB are allele counts; then the β3 coefficients are tested for significance, and results are written to plink.epi.qt and .epi.qt.summary. Similarly, given a case/control phenotype, --epistasis uses logistic regression to fit

ln (P(Y = case)/P(Y = control)) = β0 + β1gA + β2gB + β3gAgB

and writes results to plink.epi.cc and plink.epi.cc.summary.

--epi1, --epi2, and the 'set-by-set'/'set-by-all' modifiers behave as they do with --fast-epistasis. The linear regression's multicollinearity check can be tuned with --vif.

Distributed computation

--epistasis-summary-merge <common file prefix> <count>

--fast-epistasis and --epistasis jobs can be subdivided with the --parallel flag; however, the variant-based summary files require a specialized merge at the end. --epistasis-summary-merge takes care of this; its first parameter is the common filename prefix up to but not including '.summary.', while the second parameter is the number of files to merge. For example, if you split

plink --bfile main_data --fast-epistasis boost --parallel 1 3 --out epi_part

plink --bfile main_data --fast-epistasis boost --parallel 2 3 --out epi_part

plink --bfile main_data --fast-epistasis boost --parallel 3 3 --out epi_part

across three machines, and then gather the output files (epi_part.epi.cc.{1,2,3}, epi_part.epi.cc.summary.{1,2,3}) in one place, you'd merge the main reports with

cat epi_part.epi.cc.1 epi_part.epi.cc.2 epi_part.epi.cc.3 > epi_final.epi.cc

as usual, and handle the summaries with

plink --epistasis-summary-merge epi_part.epi.cc 3 --out epi_final

If these functions are still insufficient for your epistasis scanning needs, and you are sure you want more brute force rather than a different kind of analysis, we recommend trying the GPU-based GBOOST tool.

Single interaction

--twolocus <variant ID> <variant ID>

--twolocus writes tables of joint genotype counts and frequencies between the two specified variants to plink.twolocus. With a case/control phenotype, counts and frequencies are also reported for just cases and just controls.

Allelic scoring >>