Introduction, downloads

D: 14 Aug 2022

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--adjust-file

Linear scoring

--score

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Compilation

Adding new functionality

Google groups

Credits

File formats

Quick index search

Basic statistics

Allele frequency

--freq ['zs'] ['counts'] ['cols='<column set descriptor>]
       ['refbins='<comma-separated bin boundaries> | 'refbins-file='<filename>]
       ['alt1bins='<comma-separated bin bounds> | 'alt1bins-file='<filename>]
       ['bins-only']

--freq normally writes an empirical allele frequency report to plink2.afreq[.zst]. With the 'counts' modifier, an allele count/dosage report is written to plink2.acount[.zst] instead.

  • Allele frequency is defined as <# of observations of current allele> / <# of observations of any allele> (unless a pseudocount is requested with --af-pseudocount). Note that there's only one allele observation per male for chrX variants, and two per female.
  • Unknown-sex samples are treated as female in the main allele-frequency computation.
  • By default, only founders are considered; this can be changed with --nonfounders.
  • Phenotype- and category-stratified frequency reports are no longer directly supported. However, you can use --keep-if to filter on a phenotype condition, and --loop-cats to filter on each category in turn. --variant-score can also be employed for these use cases when you have no missing genotypes (or mean-imputation is acceptable).
  • This file is valid input for --read-freq. "--freq counts" output contains enough information for perfect reconstruction of allele frequencies (this was not true for dosage data before 22 Nov 2019).
  • Refer to the file format entry for output details and optional columns.

--freq can now report histograms summarizing the allele frequency spectrum. When the 'refbins=' modifier is present, its argument is interpreted as a sequence of comma-separated REF frequency/count bin boundaries, and the corresponding histogram is written to plink2.afreq.ref.bins or plink2.acount.ref.bins. Alternatively, when 'refbins-file=' is present, the named file is interpreted as a sequence of whitespace-separated bin boundaries. 'alt1bins='/'alt1bins-file=' use the same syntax, and report ALT1 frequency/count histograms to plink2.afreq.alt1.bins or plink2.acount.alt1.bins.

Genotype hardcall counts

--geno-counts ['zs'] ['cols='<column set descriptor>]

--geno-counts writes a genotype hardcall count report to plink2.gcount[.zst]; refer to the file format entry for output details and optional columns. (Note that unlike --freq, this report is not restricted to founders, unless you explicitly request that with e.g. --keep-founders.)

Since this doesn't support dosages, "--freq counts" is now a better way to generate an input file for --read-freq's use.

Sample variant-counts

--sample-counts ['zs'] ['cols'=<column set descriptor>]

--sample-counts reports the number of observed variants (relative to the reference genome) per sample, subdivided into various classes.

  • This is a highly optimized implementation of the "Per-sample counts" report added by the -s flag to "bcftools stats". If your variants have been left-normalized and split, and your single-letter allele codes are restricted to {A, C, G, T, a, c, g, t}, the SNP counts reported by PLINK 2.0 and bcftools should be identical.
  • Homozygous-ALT genotypes only count as 1 variant, for consistency with bcftools.
  • To keep non-reference, non-missing counts constant through variant splits and joins, we count heterozygous ALTx/ALTy genotypes as 2 variants. This is an intentional change from bcftools.
  • Unknown-sex samples are treated as female.
  • Heterozygous haploid calls (MT included) are treated as missing.
  • As with other commands, SNPs that have not been left-normalized are counted as non-SNP non-symbolic.
  • Refer to the file format entry for output details and optional columns.
Missing data

--missing ['zs'] [{sample-only | variant-only}] ['scols='<col. set descriptor>]
          ['vcols='<col. set descriptor>]

--missing produces sample-based and variant-based missing data reports (or just one of these reports, with ('sample-only'/'variant-only').

  • This report is not restricted to founders.
  • This command can be used to view heterozygous haploid (including mixed MT) counts; refer to the file format entries for more details.

--genotyping-rate ['dosage']

PLINK 1.x almost always computed the overall missing-genotype frequency and reported it to the log, even when no other operation in the run required the entire genotype table to be scanned. As a performance optimization, PLINK 2 no longer defaults to printing it, but you can opt-in with --genotyping-rate.

The 'dosage' modifier causes the missing-dosage frequency (which can be smaller than the missing-genotype frequency) to be reported instead.

Hardy-Weinberg equilibrium

--hardy ['zs'] ['midp'] ['redundant'] ['cols='<col. set descriptor>]

--hardy writes autosomal Hardy-Weinberg equilibrium exact test statistics to plink2.hardy[.zst], and/or chrX test statistics to plink2.hardy.x[.zst]. The latter report is based on the computation described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.

  • By default, only founders are considered; this can be changed with --nonfounders.
  • For variants with k alleles where k>2, k separate 'biallelic' tests are performed, each reported on its own line. However, biallelic variants are normally reported on a single line, since the counts/frequencies would be mirror-images and the p-values would be the same. You can add the 'redundant' modifier to force biallelic variant results to be reported on two lines for parsing convenience.
  • With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion).
  • Since multiple case/control phenotypes can now be loaded simultaneously, this no longer automatically computes separate statistics for just controls or just cases. Call this with e.g. --keep-if to report phenotype-stratified stats.
  • Refer to the file format entries for output details and optional columns.
Inbreeding

--het ['zs'] ['small-sample'] ['cols='<col. set descriptor>]

--het computes observed and expected homozygous/heterozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (1 - (<observed het. count> / <expected het. count>))) to plink2.het[.zst].

  • Multiallelic variants are handled properly.
  • This function requires decent MAF estimates. If there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case. Also, due to the use of allele frequencies, if your dataset has a highly imbalanced ancestry distribution (e.g. >90% EUR but a few samples with ancestry primarily from other continents), you may need to process the rare-ancestry samples separately.
  • It's usually best to perform this calculation on a variant set in approximate linkage equilibrium.
  • By default, --het omits the n/(n-1) multiplier in Nei's expected homozygosity formula. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use MAFs imputed from founders in the immediate dataset.
Pairwise fixation index

--fst <categorical or binary phenotype name> ['method='<method name>]
      ['blocksize='<jackknife block size>] ['cols='<column set descriptor>]
      ['report-variants'] ['zs'] ['vcols='<column set descriptor>]
      ['base='<pop. ID> | 'ids='<pop. ID> | 'file='<pop.-ID-pair file>]
      [other population ID(s) for base=/ids=...]

Given a categorical or binary phenotype defining a set of subpopulations, --fst computes Wright's FST estimates between each pair of populations, writing results to plink2[.x].fst.summary.

.pgen header info

--pgen-info

Given an input .pgen file, --pgen-info prints the following information about it:

  • Number of variants
  • Number of samples
  • Are all REF alleles 'known', 'provisional', or a mix?
  • Maximum allele count for a single variant (exact value may require .pvar input)
  • Are phased hardcalls present?
  • Are dosages present? Are any of them explicitly phased?

All values except for "maximum allele count for a single variant" can be determined from a quick scan of the .pgen's header.

Pairwise diffs >>