Basic statistics
--freq ['zs'] ['counts'] ['cols='<column set descriptor>]
['refbins='<comma-separated bin boundaries> | 'refbins-file='<filename>]
['alt1bins='<comma-separated bin bounds> | 'alt1bins-file='<filename>]
['bins-only']
--freq normally writes an empirical allele frequency report to plink2.afreq[.zst]. With the 'counts' modifier, an allele count/dosage report is written to plink2.acount[.zst] instead.
- Allele frequency is defined as <# of observations of current allele> / <# of observations of any allele> (unless a pseudocount is requested with --af-pseudocount). Note that there's only one allele observation per male for chrX variants, and two per female.
- Unknown-sex samples are treated as female in the main allele-frequency computation.
- When pedigree information is present, and 'counts' is not specified, PLINK 2 defaults to excluding nonfounders from this calculation; this can be changed with --nonfounders. There is no longer an analogous default in 'counts' mode; you now must explicitly specify how you want nonfounders to be handled (with --nonfounders or --ac-founders) in that case.
- Phenotype- and category-stratified frequency reports are no longer directly supported. However, you can use --keep-if to filter on a phenotype condition, and --loop-cats to filter on each category in turn. --variant-score can also be employed for these use cases when you have no missing genotypes (or mean-imputation is acceptable).
- This file is valid input for --read-freq. "--freq counts" output contains enough information for perfect reconstruction of allele frequencies (this was not true for dosage data before 22 Nov 2019).
- Refer to the file format entry for output details and optional columns.
--freq can now report histograms summarizing the allele frequency spectrum. When the 'refbins=' modifier is present, its argument is interpreted as a sequence of comma-separated REF frequency/count bin boundaries, and the corresponding histogram is written to plink2.afreq.ref.bins or plink2.acount.ref.bins. Alternatively, when 'refbins-file=' is present, the named file is interpreted as a sequence of whitespace-separated bin boundaries. 'alt1bins='/'alt1bins-file=' use the same syntax, and report ALT1 frequency/count histograms to plink2.afreq.alt1.bins or plink2.acount.alt1.bins.
--geno-counts ['zs'] ['cols='<column set descriptor>]
--geno-counts writes a genotype hardcall count report to plink2.gcount[.zst]; refer to the file format entry for output details and optional columns. (Note that unlike --freq, this report is not restricted to founders, unless you explicitly request that with e.g. --keep-founders.)
Since this doesn't support dosages, "--freq counts" is now a better way to generate an input file for --read-freq's use.
--sample-counts ['zs'] ['cols='<column set descriptor>]
--sample-counts reports the number of observed variants (relative to the reference genome) per sample, subdivided into various classes.
- This is a highly optimized implementation of the "Per-sample counts" report added by the -s flag to "bcftools stats". If your variants have been left-normalized and split, and your single-letter allele codes are restricted to {A, C, G, T, a, c, g, t}, the SNP counts reported by PLINK 2 and bcftools should be identical.
- Homozygous-ALT genotypes only count as 1 variant, for consistency with bcftools.
- To keep non-reference, non-missing counts constant through variant splits and joins, we count heterozygous ALTx/ALTy genotypes as 2 variants. This is an intentional change from bcftools.
- Unknown-sex samples are treated as female.
- Heterozygous haploid calls (MT included) are treated as missing.
- As with other commands, SNPs that have not been left-normalized are counted as non-SNP non-symbolic.
- Refer to the file format entry for output details and optional columns.
--missing ['zs'] [{sample-only | variant-only}] ['scols='<col. set descriptor>]
['vcols='<col. set descriptor>]
--missing produces sample-based and variant-based missing data reports (or just one of these reports, with ('sample-only'/'variant-only').
- This report is not restricted to founders.
- By default, this summarizes hardcall missingness. There are optional output columns summarizing dosage missingness, as well as heterozygous haploid (including mixed MT) counts; refer to the file format entries for details.
--genotyping-rate ['dosage']
PLINK 1.x almost always computed the overall missing-genotype frequency and reported it to the log, even when no other operation in the run required the entire genotype table to be scanned. As a performance optimization, PLINK 2 no longer defaults to printing it, but you can opt-in with --genotyping-rate.
The 'dosage' modifier causes the missing-dosage frequency (which can be smaller than the missing-genotype frequency) to be reported instead.
--hardy ['zs'] ['midp'] ['log10'] ['redundant'] ['cols='<col set descriptor>]
--hardy writes autosomal Hardy-Weinberg equilibrium exact test statistics to plink2.hardy[.zst], and/or chrX test statistics to plink2.hardy.x[.zst]. The latter report is based on the computation described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.
- By default, only founders are considered; this can be changed with --nonfounders.
- For variants with j alleles where j>2, j separate 'biallelic' tests are performed, each reported on its own line. However, biallelic variants are normally reported on a single line, since the counts/frequencies would be mirror-images and the p-values would be the same. You can add the 'redundant' modifier to force biallelic variant results to be reported on two lines for parsing convenience.
- With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion).
- The 'log10' modifier causes (mid-)p-values to be reported in -log10(p) form.
- Since multiple case/control phenotypes can now be loaded simultaneously, this no longer automatically computes separate statistics for just controls or just cases. Call this with e.g. --keep-if to report phenotype-stratified stats.
- Refer to the file format entries for output details and optional columns.
--het ['zs'] ['small-sample'] ['cols='<col. set descriptor>]
--het computes observed and expected homozygous/heterozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (1 - (<observed het. count> / <expected het. count>))) to plink2.het[.zst].
- Multiallelic variants are handled properly.
- This function requires decent MAF estimates. If there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case. Also, due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution (e.g. >90% EUR but a few samples with ancestry primarily from other continents), you may need to process the rare-ancestry samples separately.
- It's usually best to perform this calculation on a variant set in approximate Hardy-Weinberg and linkage equilibrium.
- By default, --het omits the n/(n-1) multiplier in Nei's expected homozygosity formula. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use MAFs imputed from founders in the immediate dataset.
--check-sex ['max-female-xf='<x>] ['min-male-xf='<y>]
['max-female-ycount='<z>] ['min-male-ycount='<w>]
['max-female-yrate='<v>] ['min-male-yrate='<u>]
['cols='<col. set descriptor>]
--impute-sex ['max-female-xf='<x>] ['min-male-xf='<y>]
['max-female-ycount='<z>] ['min-male-ycount='<w>]
['max-female-yrate='<v>] ['min-male-yrate='<u>]
['cols='<col. set descriptor>]
--check-sex compares sex assignments in the input dataset with those imputed from chrX inbreeding coefficients and/or chrY valid genotype call count/rate (heterozygous genotype calls are invalid on chrY), and writes a report to plink2.sexcheck. Specifically:
- If 'max-female-xf=' and/or 'min-male-xf=' are specified, chrX is used if present.
- If 'max-female-ycount=', 'min-male-ycount=', 'max-female-yrate=', or 'min-male-yrate=' are specified, chrY is used if present.
- If both chrX and chrY are usable, sex is only called if both conditions are satisfied. Similarly, if both count and rate are specified for chrY, the strictest condition must be satisfied.
- If no thresholds are specified at all, a warning is printed, and then the run proceeds as if the parameters were "min-male-xf=1 max-female-yrate=0". In this case, unless you're just sanity-checking pre-cleaned data, you should look at the distributions of xf and yrate in the .sexcheck output file, and then rerun --check-sex with data-derived thresholds.
- On chrX, male F-statistics should be in a big clump near 1, while female F-statistics should be centered near zero but can be widely dispersed.
- On chrY, female valid-genotype rates should be in a big clump near 0, while male valid-genotype rates should be consistently higher but can be dispersed.
Other notes:
- Make sure that the chrX pseudo-autosomal region has been split off (with e.g. --split-par) before using this.
- You also need decent MAF estimates (so, with very few samples in your immediate fileset, use --read-freq), and it's best for your variants to be in approximate Hardy-Weinberg and linkage equilibrium.
- For samples which barely fail the max-female-xf threshold, you may want to check autosomal inbreeding coefficients (--het). When that is also high, you're probably dealing with a highly-inbred female.
--impute-sex changes sex assignments to the imputed values, and is otherwise identical to --check-sex. It must be used with --make-[b]pgen/--make-bed/--export/--write-covar and no other commands.
--fst <categorical or binary phenotype name> ['method='<method name>]
['blocksize='<jackknife block size>] ['cols='<column set descriptor>]
['report-variants'] ['zs'] ['vcols='<column set descriptor>]
['base='<pop. ID> | 'ids='<pop. ID> | 'file='<pop.-ID-pair file>]
[other population ID(s) for base=/ids=...]
Given a categorical or binary phenotype defining a set of subpopulations, --fst computes Wright's FST estimates between each pair of populations, writing results to plink2[.x].fst.summary.
- Two methods are supported:
In both cases, the final estimate is a ratio-of-averages.
- If chrX is present, its results are written to separate file(s) with ".x" in the extension when the Hudson method is used. (chrX is skipped under the Weir-Cockerham method.)
- To get block-jackknife-based standard error estimates, provide a 'blocksize=' value.
- You can request per-variant FST estimates with the 'report-variants' modifier; this generates a separate plink2[.x].<popID1>.<popID2>.fst.var[.zst] file for each population pair. (The 'zs' modifier causes these files to be Zstd-compressed.)
- By default, all pairs of populations are compared. If you only want to compare some pairs, there are three ways to do this.
- 'base=' specifies one base population to be compared with all others; or if you specify more population ID(s) afterward, just the other populations you've listed.
- 'ids=' specifies an all-vs.-all comparison within the given set of populations.
- 'file=' specifies a file containing one population pair per line.
Note that 'base='/'ids='/'file=' must be positioned after all other modifiers on the command line.
--pgen-info
Given an input .pgen file, --pgen-info prints the following information about it:
- Number of variants
- Number of samples
- Are all REF alleles 'known', 'provisional', or a mix?
- Maximum allele count for a single variant (exact value may require .pvar input)
- Are phased hardcalls present?
- Are dosages present? Are any of them explicitly phased?
All values except for "maximum allele count for a single variant" can be determined from a quick scan of the .pgen's header.
Pairwise diffs >>
|