Basic statistics
--freq ['zs'] ['counts'] ['cols='<column set descriptor>]
['refbins='<comma-separated bin boundaries> | 'refbins-file='<filename>]
['alt1bins='<comma-separated bin bounds> | 'alt1bins-file='<filename>]
['bins-only']
--freq normally writes an empirical allele frequency report to plink2.afreq[.zst]. With the 'counts' modifier, an allele count/dosage report is written to plink2.acount[.zst] instead.
- Allele frequency is defined as <# of observations of current allele> / <# of observations of any allele> (unless a pseudocount is requested with --af-pseudocount). Note that there's only one allele observation per male for chrX variants, and two per female.
- Unknown-sex samples are treated as female in the main allele-frequency computation.
- When pedigree information is present, and 'counts' is not specified, PLINK 2 defaults to excluding nonfounders from this calculation; this can be changed with --nonfounders. There is no longer an analogous default in 'counts' mode; you now must explicitly specify how you want nonfounders to be handled (with --nonfounders or --ac-founders) in that case.
- Phenotype- and category-stratified frequency reports are no longer directly supported. However, you can use --keep-if to filter on a phenotype condition, and --loop-cats to filter on each category in turn. --variant-score can also be employed for these use cases when you have no missing genotypes (or mean-imputation is acceptable).
- This file is valid input for --read-freq. "--freq counts" output contains enough information for perfect reconstruction of allele frequencies (this was not true for dosage data before 22 Nov 2019).
- Refer to the file format entry for output details and optional columns.
--freq can now report histograms summarizing the allele frequency spectrum. When the 'refbins=' modifier is present, its argument is interpreted as a sequence of comma-separated REF frequency/count bin boundaries, and the corresponding histogram is written to plink2.afreq.ref.bins or plink2.acount.ref.bins. Alternatively, when 'refbins-file=' is present, the named file is interpreted as a sequence of whitespace-separated bin boundaries. 'alt1bins='/'alt1bins-file=' use the same syntax, and report ALT1 frequency/count histograms to plink2.afreq.alt1.bins or plink2.acount.alt1.bins.
--geno-counts ['zs'] ['cols='<column set descriptor>]
--geno-counts writes a genotype hardcall count report to plink2.gcount[.zst]; refer to the file format entry for output details and optional columns. (Note that unlike --freq, this report is not restricted to founders, unless you explicitly request that with e.g. --keep-founders.)
Since this doesn't support dosages, "--freq counts" is now a better way to generate an input file for --read-freq's use.
--sample-counts ['zs'] ['cols='<column set descriptor>]
--sample-counts reports the number of observed variants (relative to the reference genome) per sample, subdivided into various classes.
- This is a highly optimized implementation of the "Per-sample counts" report added by the -s flag to "bcftools stats". If your variants have been left-normalized and split, and your single-letter allele codes are restricted to {A, C, G, T, a, c, g, t}, the SNP counts reported by PLINK 2 and bcftools should be identical.
- Homozygous-ALT genotypes only count as 1 variant, for consistency with bcftools.
- To keep non-reference, non-missing counts constant through variant splits and joins, we count heterozygous ALTx/ALTy genotypes as 2 variants. This is an intentional change from bcftools.
- Unknown-sex samples are treated as female.
- Heterozygous haploid calls (MT included) are treated as missing.
- As with other commands, SNPs that have not been left-normalized are counted as non-SNP non-symbolic.
- Refer to the file format entry for output details and optional columns.
--missing ['zs'] [{sample-only | variant-only}] ['scols='<col. set descriptor>]
['vcols='<col. set descriptor>]
--missing produces sample-based and variant-based missing data reports (or just one of these reports, with ('sample-only'/'variant-only').
- This report is not restricted to founders.
- By default, this summarizes hardcall missingness. There are optional output columns summarizing dosage missingness, as well as heterozygous haploid (including mixed MT) counts; refer to the file format entries for details.
--genotyping-rate ['dosage']
PLINK 1.x almost always computed the overall missing-genotype frequency and reported it to the log, even when no other operation in the run required the entire genotype table to be scanned. As a performance optimization, PLINK 2 no longer defaults to printing it, but you can opt-in with --genotyping-rate.
The 'dosage' modifier causes the missing-dosage frequency (which can be smaller than the missing-genotype frequency) to be reported instead.
--hardy ['zs'] ['midp'] ['log10'] ['redundant'] ['cols='<col set descriptor>]
--hardy writes autosomal Hardy-Weinberg equilibrium exact test statistics to plink2.hardy[.zst], and/or chrX test statistics to plink2.hardy.x[.zst]. The latter report is based on the computation described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.
- By default, only founders are considered; this can be changed with --nonfounders.
- For variants with k alleles where k>2, k separate 'biallelic' tests are performed, each reported on its own line. However, biallelic variants are normally reported on a single line, since the counts/frequencies would be mirror-images and the p-values would be the same. You can add the 'redundant' modifier to force biallelic variant results to be reported on two lines for parsing convenience.
- With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion).
- The 'log10' modifier causes (mid-)p-values to be reported in -log10(p) form.
- Since multiple case/control phenotypes can now be loaded simultaneously, this no longer automatically computes separate statistics for just controls or just cases. Call this with e.g. --keep-if to report phenotype-stratified stats.
- Refer to the file format entries for output details and optional columns.
--het ['zs'] ['small-sample'] ['cols='<col. set descriptor>]
--het computes observed and expected homozygous/heterozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (1 - (<observed het. count> / <expected het. count>))) to plink2.het[.zst].
- Multiallelic variants are handled properly.
- This function requires decent MAF estimates. If there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case. Also, due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution (e.g. >90% EUR but a few samples with ancestry primarily from other continents), you may need to process the rare-ancestry samples separately.
- It's usually best to perform this calculation on a variant set in approximate linkage equilibrium.
- By default, --het omits the n/(n-1) multiplier in Nei's expected homozygosity formula. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use MAFs imputed from founders in the immediate dataset.
--fst <categorical or binary phenotype name> ['method='<method name>]
['blocksize='<jackknife block size>] ['cols='<column set descriptor>]
['report-variants'] ['zs'] ['vcols='<column set descriptor>]
['base='<pop. ID> | 'ids='<pop. ID> | 'file='<pop.-ID-pair file>]
[other population ID(s) for base=/ids=...]
Given a categorical or binary phenotype defining a set of subpopulations, --fst computes Wright's FST estimates between each pair of populations, writing results to plink2[.x].fst.summary.
- Two methods are supported:
In both cases, the final estimate is a ratio-of-averages.
- If chrX is present, its results are written to separate file(s) with ".x" in the extension when the Hudson method is used. (chrX is skipped under the Weir-Cockerham method.)
- To get block-jackknife-based standard error estimates, provide a 'blocksize=' value.
- You can request per-variant FST estimates with the 'report-variants' modifier; this generates a separate plink2[.x].<popID1>.<popID2>.fst.var[.zst] file for each population pair. (The 'zs' modifier causes these files to be Zstd-compressed.)
- By default, all pairs of populations are compared. If you only want to compare some pairs, there are three ways to do this.
- 'base=' specifies one base population to be compared with all others; or if you specify more population ID(s) afterward, just the other populations you've listed.
- 'ids=' specifies an all-vs.-all comparison within the given set of populations.
- 'file=' specifies a file containing one population pair per line.
Note that 'base='/'ids='/'file=' must be positioned after all other modifiers on the command line.
--pgen-info
Given an input .pgen file, --pgen-info prints the following information about it:
- Number of variants
- Number of samples
- Are all REF alleles 'known', 'provisional', or a mix?
- Maximum allele count for a single variant (exact value may require .pvar input)
- Are phased hardcalls present?
- Are dosages present? Are any of them explicitly phased?
All values except for "maximum allele count for a single variant" can be determined from a quick scan of the .pgen's header.
Pairwise diffs >>
|