Introduction, downloads

S: 18 Aug 2024 (b7.4)

D: 18 Aug 2024

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Discussion forums

plink2-users

Credits

File formats

Quick index search

Basic statistics

Allele frequency

--freq [{counts | case-control}] ['gz']

--freqx ['gz']
  (alias: --frqx)

By itself, --freq writes a minor allele frequency report to plink.frq. If you add the 'counts' modifier, an allele count report is written to plink.frq.count instead. Alternatively, you can use --freq with --within/--family to write a cluster-stratified frequency report to plink.frq.strat, or use the 'case-control' modifier to write a case/control phenotype-stratified report to plink.frq.cc.

--freqx writes a more informative genotype count report to plink.frqx.

For both flags, gzipped output can be requested with the 'gz' modifier.

Nonfounders are normally excluded from these counts/frequencies; use --nonfounders to change this.

All of these reports (except for --freq + --within/--family) are valid input for --read-freq; --freqx is the most powerful when used in that capacity, since it preserves deviation from Hardy-Weinberg equilibrium.

Missing data

--missing ['gz']

--missing produces sample-based and variant-based missing data reports. If run with --within/--family, the variant-based report is stratified by cluster. 'gz' causes the output files to be gzipped.

--test-mishap

--test-mishap tests whether genotype calls at the two adjacent variants can be used to predict missingness status of the current variant, writing results to plink.missing.hap. This can help one judge the safety of assuming missing calls are randomly distributed. Only autosomal diploid variants with at least 5 missing calls are included, and flanking haplotypes with frequency lower than the --maf threshold are ignored. (Nonfounders are no longer ignored.)

The PLINK 1.07 documentation has further discussion of this test. See also --test-missing, which checks for association between missingness and a case/control phenotype.

Hardy-Weinberg equilibrium

--hardy ['midp'] ['gz']

--hardy writes a list of genotype counts and Hardy-Weinberg equilibrium exact test statistics to plink.hwe. With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion). 'gz' causes the output file to be gzipped.

When the samples are case/control, three separate sets of Hardy-Weinberg equilibrium statistics are computed: one considering both cases and controls, one considering only cases, and one considering only controls. These are distinguished by 'ALL', 'AFF', and 'UNAFF' in the TEST column, respectively. If the phenotype is quantitative or nonexistent instead, there is just one line per variant, labeled 'ALL(QT)' or 'ALL(NP)' respectively.

By default, only founders are considered when generating this report, so if you are working with e.g. a sibling-only dataset, you won't get any results. Use --nonfounders to include everyone.

Unlike PLINK 1.07, PLINK 1.9 does not automatically filter out variants with H-W p-value less than 0.001 when --hardy is invoked. Combine --hardy with --hwe if you still want that to happen.

Mendel errors

--mendel ['summaries-only']

--mendel-duos
--mendel-multigen

--mendel scans the dataset for Mendel errors, writing a set of reports to plink{.mendel,.imendel,.fmendel,.lmendel}. Haploid and mitochondrial data are ignored. The errors are classified as follows, where '1' refers to the A1 (usually minor) allele and '2' refers to A2:

CodePat. genotypeMat. genotypeChild genotypeSamples implicated
1111112all
2222212all
32211/12/missing11father, child
411/12/missing2211mother, child
5222211child
61112/22/missing22father, child
712/22/missing1122mother, child
8111122child
9(Xchr male)1122mother, child
10(Xchr male)2211mother, child

By default, samples with only one parent in the dataset are not considered, and when parental genotype data is missing, (great-)grandparental data is not checked; this can now be changed with --mendel-duos and --mendel-multigen, respectively. (Note that --mendel-multigen is best used on data which has not yet been subject to --set-me-missing.)

If you only want summary statistics, use the 'summaries-only' modifier; this causes the .mendel file (which can be very large) to be skipped.

When PLINK 1.07 --mendel was used either with --set-me-missing or without --make-bed/--recode, it would set some Mendel errors to missing before all errors were identified, and as a consequence some other errors were not noticed at all if overlapping trios were present. This no longer happens.

Inbreeding

--het ['small-sample'] ['gz']

--ibc

--het computes observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (<observed hom. count> - <expected count>) / (<total observations> - <expected count>)) to plink.het. (The 'gz' modifier has the usual effect.)

Expected counts are based on loaded (via --read-freq) or imputed MAFs; if there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case. Also, due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution (e.g. >90% EUR but a few samples with ancestry primarily from other continents), you may need to process the rare-ancestry samples separately.

By default, the n/(n-1) multiplier in Nei's expected homozygosity formula is now omitted, since n may be unknown when using --read-freq. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use imputed MAFs (and known ns) from founders in the immediate dataset. (--maf-succ is not applied here.)

--ibc (ported from GCTA) calculates three inbreeding coefficients for each sample, and writes a report to plink.ibc. Briefly, Fhat1 is the usual variance-standardized relationship minus 1, Fhat2 is similar to the --het estimate, and Fhat3 is based on the correlation between uniting gametes.

These calculations do not take LD into account. It is usually a good idea to perform some form of LD-based pruning before invoking them.

Sex imputation

--check-sex [female max F] [male min F]
--impute-sex [female max F] [male min F]

--check-sex ycount [female max F] [male min F] [female max Y obs] [male min Y obs]
--impute-sex ycount [female max F] [male min F] [female max Y obs] [male min Y obs]
--check-sex y-only [female max Y obs] [male min Y obs]
--impute-sex y-only [female max Y obs] [male min Y obs]

--check-sex normally compares sex assignments in the input dataset with those imputed from X chromosome inbreeding coefficients, and writes a report to plink.sexcheck.

  • Make sure that the X chromosome pseudo-autosomal region has been split off (with e.g. --split-x) before using this.
  • By default, F estimates smaller than 0.2 yield female calls, and values larger than 0.8 yield male calls. If you pass numeric parameter(s) to --check-sex (without 'y-only'), the first two control these thresholds.
  • Since this function is based on the same F coefficient as --het/--ibc, it requires reasonable MAF estimates (so it's essential to use --read-freq if there are very few samples in your immediate fileset), and it's best used on marker sets in approximate linkage equilibrium.
    As a concrete example, if you ignore the Y chromosome and don't first perform LD-based pruning, --check-sex makes flat-out incorrect calls on several 1000 Genomes phase 1 female samples: NA19332 has an F estimate just under 0.88, and there are others with F > 0.8. However, if you first run "--indep-pairphase 20000 2000 0.5", the largest female F estimate drops to about 0.66.
  • Due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution, you may need to process the rare-ancestry samples separately.
  • 0.66 is, of course, still much larger than 0.2, and in most contexts it still justifies a female call. We suggest running --check-sex once without parameters, eyeballing the distribution of F estimates (there should be a clear gap between a very tight male clump at the right side of the distribution and the females everywhere else), and then rerunning with parameters corresponding to the empirical gap.

There are now two modes which consider Y chromosome data.

  • In 'ycount' mode, gender is still imputed from the X chromosome, but female calls are downgraded to ambiguous whenever more than 0 nonmissing Y genotypes are present, and male calls are downgraded when fewer than 0 are present. (Note that these are counts, not rates.) These thresholds are controllable with --check-sex ycount's optional 3rd and 4th numeric parameters.
  • In 'y-only' mode, gender is imputed from nonmissing Y genotype counts, and the X chromosome is ignored. The male minimum threshold defaults to 1 instead of zero in this case. This is intended to recover previously-determined gender, rather than determine it from scratch, since Y chromosome data may be scarce for older males: see e.g. Dumanski JP et al. (2014) Smoking is associated with mosaic loss of chromosome Y.

--impute-sex changes sex assignments to the imputed values, while generating the .sexcheck report as well. To minimize surprises, we now force it to be used with --make-bed/--recode/--write-covar and no other commands. In the common case where sexes were known or imputed earlier in the pipeline but didn't make it into the .fam file for whatever reason, all male F estimates should be 1 after --split-x, so something as extreme as "--impute-sex 0.9 0.99" (or "--impute-sex y-only") should work.

Wright's FST

--fst ['case-control']
  (alias: --Fst)

Given a set of subpopulations defined via --within, --fst writes FST estimates for each autosomal diploid variant (computed using the method introduced in Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure) to plink.fst, and reports raw and weighted global means to the log.

  • This is a basic port of VCFtools --weir-fst-pop. The VCFtools implementation also provides windowed modes, which we have not ported (--recode vcf may be handy there).
  • Only genotype calls for the specified subpopulations are considered in this computation. --read-freq and founder status are ignored.
  • If you're interested in the global means, it is usually best to perform this calculation on a marker set in approximate linkage equilibrium.
  • If you have only two subpopulations, you can represent them with case/control status instead of clusters, and use the 'case-control' modifier to request FST estimates based on them.
  • This flag is recognized by the PLINK 1.07 command-line parser—it won't error out—but the actual computation was not yet implemented.

Linkage disequilibrium >>