Introduction, downloads

D: 3 Dec 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--check-sex/--impute-sex

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Tutorials

Setup

Rules of Thumb

Data Exploration 1 — HWE, Allele Frequency Spectrum

Data Exploration 2 — Genomic Structure

Linkage

Relationship Matrix

Genome-Wide Assocation Analyses (GWAS)

Regressions

Post-Hoc

Formatting Files

bcftools

Variant IDs

Reference Alleles

Format for R

Shortcuts

Quick index search

Basic statistics

Allele frequency

--freq ['zs'] ['counts'] ['cols='<column set descriptor>]
       ['refbins='<comma-separated bin boundaries> | 'refbins-file='<filename>]
       ['alt1bins='<comma-separated bin bounds> | 'alt1bins-file='<filename>]
       ['bins-only']

--freq normally writes an empirical allele frequency report to plink2.afreq[.zst]. With the 'counts' modifier, an allele count/dosage report is written to plink2.acount[.zst] instead.

  • Allele frequency is defined as <# of observations of current allele> / <# of observations of any allele> (unless a pseudocount is requested with --af-pseudocount). Note that there's only one allele observation per male for chrX variants, and two per female.
  • Unknown-sex samples are treated as female in the main allele-frequency computation.
  • When pedigree information is present, and 'counts' is not specified, PLINK 2 defaults to excluding nonfounders from this calculation; this can be changed with --nonfounders. There is no longer an analogous default in 'counts' mode; you now must explicitly specify how you want nonfounders to be handled (with --nonfounders or --ac-founders) in that case.
  • Phenotype- and category-stratified frequency reports are no longer directly supported. However, you can use --keep-if to filter on a phenotype condition, and --loop-cats to filter on each category in turn. --variant-score can also be employed for these use cases when you have no missing genotypes (or mean-imputation is acceptable).
  • This file is valid input for --read-freq. "--freq counts" output contains enough information for perfect reconstruction of allele frequencies (this was not true for dosage data before 22 Nov 2019).
  • Refer to the file format entry for output details and optional columns.

--freq can now report histograms summarizing the allele frequency spectrum. When the 'refbins=' modifier is present, its argument is interpreted as a sequence of comma-separated REF frequency/count bin boundaries, and the corresponding histogram is written to plink2.afreq.ref.bins or plink2.acount.ref.bins. Alternatively, when 'refbins-file=' is present, the named file is interpreted as a sequence of whitespace-separated bin boundaries. 'alt1bins='/'alt1bins-file=' use the same syntax, and report ALT1 frequency/count histograms to plink2.afreq.alt1.bins or plink2.acount.alt1.bins.

Genotype hardcall counts

--geno-counts ['zs'] ['cols='<column set descriptor>]

--geno-counts writes a genotype hardcall count report to plink2.gcount[.zst]; refer to the file format entry for output details and optional columns. (Note that unlike --freq, this report is not restricted to founders, unless you explicitly request that with e.g. --keep-founders.)

Since this doesn't support dosages, "--freq counts" is now a better way to generate an input file for --read-freq's use.

Sample variant-counts

--sample-counts ['zs'] ['cols='<column set descriptor>]

--sample-counts reports the number of observed variants (relative to the reference genome) per sample, subdivided into various classes.

  • This is a highly optimized implementation of the "Per-sample counts" report added by the -s flag to "bcftools stats". If your variants have been left-normalized and split, and your single-letter allele codes are restricted to {A, C, G, T, a, c, g, t}, the SNP counts reported by PLINK 2 and bcftools should be identical.
  • Homozygous-ALT genotypes only count as 1 variant, for consistency with bcftools.
  • To keep non-reference, non-missing counts constant through variant splits and joins, we count heterozygous ALTx/ALTy genotypes as 2 variants. This is an intentional change from bcftools.
  • Unknown-sex samples are treated as female.
  • Heterozygous haploid calls (MT included) are treated as missing.
  • As with other commands, SNPs that have not been left-normalized are counted as non-SNP non-symbolic.
  • Refer to the file format entry for output details and optional columns.
Missing data

--missing ['zs'] [{sample-only | variant-only}] ['scols='<col. set descriptor>]
          ['vcols='<col. set descriptor>]

--missing produces sample-based and variant-based missing data reports (or just one of these reports, with ('sample-only'/'variant-only').

  • This report is not restricted to founders.
  • By default, this summarizes hardcall missingness. There are optional output columns summarizing dosage missingness, as well as heterozygous haploid (including mixed MT) counts; refer to the file format entries for details.

--genotyping-rate ['dosage']

PLINK 1.x almost always computed the overall missing-genotype frequency and reported it to the log, even when no other operation in the run required the entire genotype table to be scanned. As a performance optimization, PLINK 2 no longer defaults to printing it, but you can opt-in with --genotyping-rate.

The 'dosage' modifier causes the missing-dosage frequency (which can be smaller than the missing-genotype frequency) to be reported instead.

Hardy-Weinberg equilibrium

--hardy ['zs'] ['midp'] ['log10'] ['redundant'] ['cols='<col set descriptor>]

--hardy writes autosomal Hardy-Weinberg equilibrium exact test statistics to plink2.hardy[.zst], and/or chrX test statistics to plink2.hardy.x[.zst]. The latter report is based on the computation described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.

  • By default, only founders are considered; this can be changed with --nonfounders.
  • For variants with j alleles where j>2, j separate 'biallelic' tests are performed, each reported on its own line. However, biallelic variants are normally reported on a single line, since the counts/frequencies would be mirror-images and the p-values would be the same. You can add the 'redundant' modifier to force biallelic variant results to be reported on two lines for parsing convenience.
  • With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion).
  • The 'log10' modifier causes (mid-)p-values to be reported in -log10(p) form.
  • Since multiple case/control phenotypes can now be loaded simultaneously, this no longer automatically computes separate statistics for just controls or just cases. Call this with e.g. --keep-if to report phenotype-stratified stats.
  • Refer to the file format entries for output details and optional columns.
Inbreeding

--het ['zs'] ['small-sample'] ['cols='<col. set descriptor>]

--het computes observed and expected homozygous/heterozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (1 - (<observed het. count> / <expected het. count>))) to plink2.het[.zst].

  • Multiallelic variants are handled properly.
  • This function requires decent MAF estimates. If there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case. Also, due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution (e.g. >90% EUR but a few samples with ancestry primarily from other continents), you may need to process the rare-ancestry samples separately.
  • It's usually best to perform this calculation on a variant set in approximate Hardy-Weinberg and linkage equilibrium.
  • By default, --het omits the n/(n-1) multiplier in Nei's expected homozygosity formula. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use MAFs imputed from founders in the immediate dataset.
Sex imputation

--check-sex ['max-female-xf='<x>] ['min-male-xf='<y>]
            ['max-female-ycount='<z>] ['min-male-ycount='<w>]
            ['max-female-yrate='<v>] ['min-male-yrate='<u>]
            ['cols='<col. set descriptor>]
--impute-sex ['max-female-xf='<x>] ['min-male-xf='<y>]
             ['max-female-ycount='<z>] ['min-male-ycount='<w>]
             ['max-female-yrate='<v>] ['min-male-yrate='<u>]
             ['cols='<col. set descriptor>]

--check-sex compares sex assignments in the input dataset with those imputed from chrX inbreeding coefficients and/or chrY valid genotype call count/rate (heterozygous genotype calls are invalid on chrY), and writes a report to plink2.sexcheck. Specifically:

  • If 'max-female-xf=' and/or 'min-male-xf=' are specified, chrX is used if present.
  • If 'max-female-ycount=', 'min-male-ycount=', 'max-female-yrate=', or 'min-male-yrate=' are specified, chrY is used if present.
  • If both chrX and chrY are usable, sex is only called if both conditions are satisfied. Similarly, if both count and rate are specified for chrY, the strictest condition must be satisfied.
  • If no thresholds are specified at all, a warning is printed, and then the run proceeds as if the parameters were "min-male-xf=1 max-female-yrate=0". In this case, unless you're just sanity-checking pre-cleaned data, you should look at the distributions of xf and yrate in the .sexcheck output file, and then rerun --check-sex with data-derived thresholds.
    • On chrX, male F-statistics should be in a big clump near 1, while female F-statistics should be centered near zero but can be widely dispersed.
    • On chrY, female valid-genotype rates should be in a big clump near 0, while male valid-genotype rates should be consistently higher but can be dispersed.

Other notes:

  • Make sure that the chrX pseudo-autosomal region has been split off (with e.g. --split-par) before using this.
  • You also need decent MAF estimates (so, with very few samples in your immediate fileset, use --read-freq), and it's best for your variants to be in approximate Hardy-Weinberg and linkage equilibrium.
  • For samples which barely fail the max-female-xf threshold, you may want to check autosomal inbreeding coefficients (--het). When that is also high, you're probably dealing with a highly-inbred female.

--impute-sex changes sex assignments to the imputed values, and is otherwise identical to --check-sex. It must be used with --make-[b]pgen/--make-bed/--export/--write-covar and no other commands.

Pairwise fixation index

--fst <categorical or binary phenotype name> ['method='<method name>]
      ['blocksize='<jackknife block size>] ['cols='<column set descriptor>]
      ['report-variants'] ['zs'] ['vcols='<column set descriptor>]
      ['base='<pop. ID> | 'ids='<pop. ID> | 'file='<pop.-ID-pair file>]
      [other population ID(s) for base=/ids=...]

Given a categorical or binary phenotype defining a set of subpopulations, --fst computes Wright's FST estimates between each pair of populations, writing results to plink2[.x].fst.summary.

.pgen header info

--pgen-info

Given an input .pgen file, --pgen-info prints the following information about it:

  • Number of variants
  • Number of samples
  • Are all REF alleles 'known', 'provisional', or a mix?
  • Maximum allele count for a single variant (exact value may require .pvar input)
  • Are phased hardcalls present?
  • Are dosages present? Are any of them explicitly phased?

All values except for "maximum allele count for a single variant" can be determined from a quick scan of the .pgen's header.

Pairwise diffs >>