Introduction, downloads

D: 7 Jul 2025

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

EIGENSOFT binary

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Same-indiv selection

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Mendel errors

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-me-missing

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--mendel

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

(--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

bcftools

Quick index search

Basic statistics

Allele frequency

--freq ['zs'] ['counts'] ['cols='<column set descriptor>]
['refbins='<comma-separated bin boundaries> | 'refbins-file='<filename>]
['alt1bins='<comma-separated bin bounds> | 'alt1bins-file='<filename>]
['bins-only']

--freq normally writes an empirical allele frequency report to plink2.afreq[.zst]. With the 'counts' modifier, an allele count/dosage report is written to plink2.acount[.zst] instead.

Allele frequency is defined as <# of observations of current allele> / <# of observations of any allele> (unless a pseudocount is requested with --af-pseudocount). Note that there's only one allele observation per male for chrX variants, and two per female.
Unknown-sex samples are treated as female in the main allele-frequency computation.
When pedigree information is present, and 'counts' is not specified, PLINK 2 defaults to excluding nonfounders from this calculation; this can be changed with --nonfounders. There is no longer an analogous default in 'counts' mode; you now must explicitly specify how you want nonfounders to be handled (with --nonfounders or --ac-founders) in that case.
Phenotype- and category-stratified frequency reports are no longer directly supported. However, you can use --keep-if to filter on a phenotype condition, and --loop-cats to filter on each category in turn. --variant-score can also be employed for these use cases when you have no missing genotypes (or mean-imputation is acceptable).
This file is valid input for --read-freq. "--freq counts" output contains enough information for perfect reconstruction of allele frequencies (this was not true for dosage data before 22 Nov 2019).
Refer to the file format entry for output details and optional columns.

--freq can now report histograms summarizing the allele frequency spectrum. When the 'refbins=' modifier is present, its argument is interpreted as a sequence of comma-separated REF frequency/count bin boundaries, and the corresponding histogram is written to plink2.afreq.ref.bins or plink2.acount.ref.bins. Alternatively, when 'refbins-file=' is present, the named file is interpreted as a sequence of whitespace-separated bin boundaries. 'alt1bins='/'alt1bins-file=' use the same syntax, and report ALT1 frequency/count histograms to plink2.afreq.alt1.bins or plink2.acount.alt1.bins.

Genotype hardcall counts

--geno-counts ['zs'] ['cols='<column set descriptor>]

--geno-counts writes a genotype hardcall count report to plink2.gcount[.zst]; refer to the file format entry for output details and optional columns. (Note that unlike --freq, this report is not restricted to founders, unless you explicitly request that with e.g. --keep-founders.)

Since this doesn't support dosages, "--freq counts" is now a better way to generate an input file for --read-freq's use.

Sample variant-counts

--sample-counts ['zs'] ['cols='<column set descriptor>]

--sample-counts reports the number of observed variants (relative to the reference genome) per sample, subdivided into various classes.

This is a highly optimized implementation of the "Per-sample counts" report added by the -s flag to "bcftools stats". If your variants have been left-normalized and split, and your single-letter allele codes are restricted to {A, C, G, T, a, c, g, t}, the SNP counts reported by PLINK 2 and bcftools should be identical.
Homozygous-ALT genotypes only count as 1 variant, for consistency with bcftools.
To keep non-reference, non-missing counts constant through variant splits and joins, we count heterozygous ALTx/ALTy genotypes as 2 variants. This is an intentional change from bcftools.
Unknown-sex samples are treated as female.
Heterozygous haploid calls (MT included) are treated as missing.
As with other commands, SNPs that have not been left-normalized are counted as non-SNP non-symbolic.
Refer to the file format entry for output details and optional columns.

Missing data

--missing ['zs'] [{sample-only | variant-only}] ['scols='<col. set descriptor>]
['vcols='<col. set descriptor>]

--missing produces sample-based and variant-based missing data reports (or just one of these reports, with ('sample-only'/'variant-only').

This report is not restricted to founders.
By default, this summarizes hardcall missingness. There are optional output columns summarizing dosage missingness, as well as heterozygous haploid (including mixed MT) counts; refer to the file format entries for details.

--genotyping-rate ['dosage']

PLINK 1.x almost always computed the overall missing-genotype frequency and reported it to the log, even when no other operation in the run required the entire genotype table to be scanned. As a performance optimization, PLINK 2 no longer defaults to printing it, but you can opt-in with --genotyping-rate.

The 'dosage' modifier causes the missing-dosage frequency (which can be smaller than the missing-genotype frequency) to be reported instead.

Hardy-Weinberg equilibrium

--hardy ['zs'] ['midp'] ['log10'] ['redundant'] ['cols='<col set descriptor>]

--hardy writes autosomal Hardy-Weinberg equilibrium exact test statistics to plink2.hardy[.zst], and/or chrX test statistics to plink2.hardy.x[.zst]. The latter report is based on the computation described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.

By default, only founders are considered; this can be changed with --nonfounders.
For variants with j alleles where j>2, j separate 'biallelic' tests are performed, each reported on its own line. However, biallelic variants are normally reported on a single line, since the counts/frequencies would be mirror-images and the p-values would be the same. You can add the 'redundant' modifier to force biallelic variant results to be reported on two lines for parsing convenience.
With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion).
The 'log10' modifier causes (mid-)p-values to be reported in -log10(p) form. Note that PLINK 2 accurately calculates and reports p-values smaller than DBL_MIN, but other software may not be able to read such values unless they are reported in -log10(p) form.
Since multiple case/control phenotypes can now be loaded simultaneously, this no longer automatically computes separate statistics for just controls or just cases. Call this with e.g. --keep-if to report phenotype-stratified stats.
Refer to the file format entries for output details and optional columns.

Mendel errors

--mendel ['zs'] ['summaries-only'] ['cols='<column set descriptor>]

--mendel-duos
--mendel-multigen

--mendel scans the dataset for (hardcall) Mendel errors, writing a set of reports to plink2{.mendel,.imendel,.fmendel,.lmendel}.

chrY and chrM are no longer excluded from this analysis.
Heterozygous chrX male and chrY genotypes are treated as missing.
On chrM, when a child has a mixed hardcall, there is only a Mendel error if the mother has a nonmissing hardcall matching neither allele (see code 10 below).

The errors are classified as follows, where 'R' refers to the REF allele, 'A' refers to an ALT allele carried by the child, 'X' refers to a parental allele that matches a child allele, and 'x' refers to a parental allele that matches neither child allele:

Code	Pat. genotype	Mat. genotype	Child genotype	Samples implicated
1	RR, Rx, or R	RR or Rx	RA	all
2	AA, Ax, or A	AA or Ax	RA	all
3	xx or x	XX, Xx, or missing	AA or RA	father, child
4	XX, Xx, X, or missing	xx	AA or RA	mother, child
5	xx or x	xx	AA or RA	child
6	xx or x	RR, Rx, or missing	RR	father, child
7	RR, Rx, R, or missing	xx	RR	mother, child
8	xx or x	xx	RR	child
9	(chrX male, or any chrM)	x or xx	R	mother, child
10	(chrX male, or any chrM)	x or xx	A or RA	mother, child
11	x	(chrY non-female)	R	father, child
12	x	(chrY non-female)	A	father, child

(This generalizes the PLINK 1.9 Mendel error table to multiallelic variants.)

By default, samples with only one parent in the dataset are not considered, and when parental genotype data is missing, (great-)grandparental data is not checked; this can now be changed with --mendel-duos and --mendel-multigen, respectively. (Note that --mendel-multigen is best used on data which has not yet been subject to --set-me-missing.)
If you only want summary statistics, use the 'summaries-only' modifier; this causes the .mendel[.zst] file (which can be very large) to be skipped.

Inbreeding

--het ['zs'] ['small-sample'] ['cols='<col. set descriptor>]

--het computes observed and expected homozygous/heterozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (1 - (<observed het. count> / <expected het. count>))) to plink2.het[.zst].

Multiallelic variants are handled properly.
This function requires decent MAF estimates. If there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case. Also, due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution (e.g. >90% EUR but a few samples with ancestry primarily from other continents), you may need to process the rare-ancestry samples separately.
It's usually best to perform this calculation on a variant set in approximate Hardy-Weinberg and linkage equilibrium.
By default, --het omits the n/(n-1) multiplier in Nei's expected homozygosity formula. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use MAFs imputed from founders in the immediate dataset.

Sex imputation

--check-sex ['max-female-xf='<x>] ['min-male-xf='<y>]
['max-female-ycount='<z>] ['min-male-ycount='<w>]
['max-female-yrate='<v>] ['min-male-yrate='<u>]
['cols='<col. set descriptor>]
--impute-sex ['max-female-xf='<x>] ['min-male-xf='<y>]
['max-female-ycount='<z>] ['min-male-ycount='<w>]
['max-female-yrate='<v>] ['min-male-yrate='<u>]
['cols='<col. set descriptor>]

--check-sex compares sex assignments in the input dataset with those imputed from chrX inbreeding coefficients and/or chrY valid genotype call count/rate (heterozygous genotype calls are invalid on chrY), and writes a report to plink2.sexcheck. Specifically:

If 'max-female-xf=' and/or 'min-male-xf=' are specified, chrX is used if present.
If 'max-female-ycount=', 'min-male-ycount=', 'max-female-yrate=', or 'min-male-yrate=' are specified, chrY is used if present.
If both chrX and chrY are usable, sex is only called if both conditions are satisfied. Similarly, if both count and rate are specified for chrY, the strictest condition must be satisfied.
If no thresholds are specified at all, a warning is printed, and then the run proceeds as if the parameters were "min-male-xf=1 max-female-yrate=0". In this case, unless you're just sanity-checking pre-cleaned data, you should look at the distributions of xf and yrate in the .sexcheck output file, and then rerun --check-sex with data-derived thresholds.

On chrX, male F-statistics should be in a big clump near 1, while female F-statistics should be centered near zero but can be widely dispersed.
On chrY, female valid-genotype rates should be in a big clump near 0, while male valid-genotype rates should be consistently higher but can be dispersed.

Other notes:

Make sure that the chrX pseudo-autosomal region has been split off (with e.g. --split-par) before using this.
You also need decent MAF estimates (so, with very few samples in your immediate fileset, use --read-freq), and it's best for your variants to be in approximate Hardy-Weinberg and linkage equilibrium.
For samples which barely fail the max-female-xf threshold, you may want to check autosomal inbreeding coefficients (--het). When that is also high, you're probably dealing with a highly-inbred female.

--impute-sex changes sex assignments to the imputed values, and is otherwise identical to --check-sex. It must be used with --make-[b]pgen/--make-bed/--export/--write-covar and no other commands.

Pairwise fixation index

--fst <categorical or binary phenotype name> ['method='<method name>]
['blocksize='<jackknife block size>] ['cols='<column set descriptor>]
['report-variants'] ['zs'] ['vcols='<column set descriptor>]
['base='<pop. ID> | 'ids='<pop. ID> | 'file='<pop.-ID-pair file>]
[other population ID(s) for base=/ids=...]

Given a categorical or binary phenotype defining a set of subpopulations, --fst computes Wright's F_ST estimates between each pair of populations, writing results to plink2[.x].fst.summary.

Two methods are supported:
- 'hudson': Bhatia G, Patterson N, Sankararaman S, Price AL (2013) Estimating and interpreting F_ST: The impact of rare variants, which elaborates on Hudson RR, Slatkin M, Maddison WP (1992) Estimation of Levels of Gene Flow from DNA Sequence Data. This is now the default.
- 'wc': Weir BS, Cockerham CC (1984) Estimating F-statistics for the analysis of population structure.
In both cases, the final estimate is a ratio-of-averages.
If chrX is present, its results are written to separate file(s) with ".x" in the extension when the Hudson method is used. (chrX is skipped under the Weir-Cockerham method.)
To get block-jackknife-based standard error estimates, provide a 'blocksize=' value.
You can request per-variant F_ST estimates with the 'report-variants' modifier; this generates a separate plink2[.x].<popID1>.<popID2>.fst.var[.zst] file for each population pair. (The 'zs' modifier causes these files to be Zstd-compressed.)
By default, all pairs of populations are compared. If you only want to compare some pairs, there are three ways to do this.
- 'base=' specifies one base population to be compared with all others; or if you specify more population ID(s) afterward, just the other populations you've listed.
- 'ids=' specifies an all-vs.-all comparison within the given set of populations.
- 'file=' specifies a file containing one population pair per line.
Note that 'base='/'ids='/'file=' must be positioned after all other modifiers on the command line.

.pgen header info

--pgen-info

Given an input .pgen file, --pgen-info prints the following information about it:

Number of variants
Number of samples
Are all REF alleles 'known', 'provisional', or a mix?
Maximum allele count for a single variant (exact value may require .pvar input)
Are phased hardcalls present?
Are dosages present? Are any of them explicitly phased?

All values except for "maximum allele count for a single variant" can be determined from a quick scan of the .pgen's header.

Pairwise diffs >>