Introduction, downloads

D: 20 Oct 2020

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Linkage disequilibrium

--indep...

--ld

Sample comparison

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--adjust-file

Linear scoring

--score

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

Errors and warnings

Output file list

Order of operations

Google groups

Credits

File formats

Quick index search

Basic statistics

Allele frequency

--freq ['zs'] ['counts'] ['cols='<column set descriptor>]
       ['refbins='<comma-separated bin boundaries> | 'refbins-file='<filename>]
       ['alt1bins='<comma-separated bin bounds> | 'alt1bins-file='<filename>]
       ['bins-only']

--freq normally writes an empirical allele frequency report to plink2.afreq[.zst]. With the 'counts' modifier, an allele count/dosage report is written to plink2.acount[.zst] instead.

  • Allele frequency is defined as <# of observations of current allele> / <# of observations of any allele> (unless a pseudocount is requested with --af-pseudocount). Note that there's only one allele observation per male for chrX variants, and two per female.
  • Unknown-sex samples are treated as female in the main allele-frequency computation.
  • By default, only founders are considered; this can be changed with --nonfounders.
  • Phenotype- and category-stratified frequency reports are no longer directly supported. However, you can use --keep-if to filter on a phenotype condition, and --loop-cats to filter on each category in turn. --variant-score can also be employed for these use cases when you have no missing genotypes (or mean-imputation is acceptable).
  • This file is valid input for --read-freq. "--freq counts" output contains enough information for perfect reconstruction of allele frequencies (this was not true for dosage data before 22 Nov 2019).
  • Refer to the file format entry for output details and optional columns.

--freq can now report histograms summarizing the allele frequency spectrum. When the 'refbins=' modifier is present, its argument is interpreted as a sequence of comma-separated REF frequency/count bin boundaries, and the corresponding histogram is written to plink2.afreq.ref.bins or plink2.acount.ref.bins. Alternatively, when 'refbins-file=' is present, the named file is interpreted as a sequence of whitespace-separated bin boundaries. 'alt1bins='/'alt1bins-file=' use the same syntax, and report ALT1 frequency/count histograms to plink2.afreq.alt1.bins or plink2.acount.alt1.bins.

Genotype hardcall counts

--geno-counts ['zs'] ['cols='<column set descriptor>]

--geno-counts writes a genotype hardcall count report to plink2.gcount[.zst]; refer to the file format entry for output details and optional columns. (Note that unlike --freq, this report is not restricted to founders, unless you explicitly request that with e.g. --keep-founders.)

Since this doesn't support dosages, "--freq counts" is now a better way to generate an input file for --read-freq's use.

Sample variant-counts

--sample-counts ['zs'] ['cols'=<column set descriptor>]

--sample-counts reports the number of observed variants (relative to the reference genome) per sample, subdivided into various classes.

  • This is a highly optimized implementation of the "Per-sample counts" report added by the -s flag to "bcftools stats". If your variants have been left-normalized and split, and your single-letter allele codes are restricted to {A, C, G, T, a, c, g, t}, the SNP counts reported by PLINK 2.0 and bcftools should be identical.
  • Homozygous-ALT genotypes only count as 1 variant, for consistency with bcftools.
  • To keep non-reference, non-missing counts constant through variant splits and joins, we count heterozygous ALTx/ALTy genotypes as 2 variants. This is an intentional change from bcftools.
  • Unknown-sex samples are treated as female.
  • Heterozygous haploid calls (MT included) are treated as missing.
  • As with other commands, SNPs that have not been left-normalized are counted as non-SNP non-symbolic.
  • Refer to the file format entry for output details and optional columns.
Missing data

--missing ['zs'] [{sample-only | variant-only}] ['scols='<col. set descriptor>]
          ['vcols='<col. set descriptor>]

--missing produces sample-based and variant-based missing data reports (or just one of these reports, with ('sample-only'/'variant-only').

  • This report is not restricted to founders.
  • This command can be used to view heterozygous haploid (including mixed MT) counts; refer to the file format entries for more details.

--genotyping-rate ['dosage']

PLINK 1.x almost always computed the overall missing-genotype frequency and reported it to the log, even when no other operation in the run required the entire genotype table to be scanned. As a performance optimization, PLINK 2 no longer defaults to printing it, but you can opt-in with --genotyping-rate.

The 'dosage' modifier causes the missing-dosage frequency (which can be smaller than the missing-genotype frequency) to be reported instead.

Hardy-Weinberg equilibrium

--hardy ['zs'] ['midp'] ['redundant'] ['cols='<col. set descriptor>]

--hardy writes autosomal Hardy-Weinberg equilibrium exact test statistics to plink2.hardy[.zst], and/or chrX test statistics to plink2.hardy.x[.zst]. The latter report is based on the computation described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome.

  • By default, only founders are considered; this can be changed with --nonfounders.
  • For variants with k alleles where k>2, k separate 'biallelic' tests are performed, each reported on its own line. However, biallelic variants are normally reported on a single line, since the counts/frequencies would be mirror-images and the p-values would be the same. You can add the 'redundant' modifier to force biallelic variant results to be reported on two lines for parsing convenience.
  • With the 'midp' modifier, a mid-p adjustment is applied (see --hwe for discussion).
  • Since multiple case/control phenotypes can now be loaded simultaneously, this no longer automatically computes separate statistics for just controls or just cases. Call this with e.g. --keep-if to report phenotype-stratified stats.
  • Refer to the file format entries for output details and optional columns.
Inbreeding

--het ['zs'] ['small-sample'] ['cols='<col. set descriptor>]

--het computes observed and expected homozygous/heterozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates (i.e. (1 - (<observed het. count> / <expected het. count>))) to plink2.het[.zst].

  • Multiallelic variants are handled properly.
  • This function requires decent MAF estimates. If there are very few samples in your immediate fileset, --read-freq is practically mandatory since imputed MAFs are wildly inaccurate in that case.
  • It's usually best to perform this calculation on a variant set in approximate linkage equilibrium.
  • By default, --het omits the n/(n-1) multiplier in Nei's expected homozygosity formula. The 'small-sample' modifier causes the multiplier to be included, while forcing --het to use MAFs imputed from founders in the immediate dataset.
Pairwise fixation index

--fst <categorical or binary phenotype name> ['method='<method name>]
      ['blocksize='<jackknife block size>] ['cols='<column set descriptor>]
      ['report-variants'] ['zs'] ['vcols='<column set descriptor>]
      ['base='<pop. ID> | 'ids='<pop. ID> | 'file='<pop.-ID-pair file>]
      [other population ID(s) for base=/ids=...]

Given a categorical or binary phenotype defining a set of subpopulations, --fst computes Wright's FST estimates between each pair of populations, writing results to plink2[.x].fst.summary.

.pgen header info

--pgen-info

Given an input .pgen file, --pgen-info prints the following information about it:

  • Number of variants
  • Number of samples
  • Are all REF alleles 'known', 'provisional', or a mix?
  • Maximum allele count for a single variant (exact value may require .pvar input)
  • Are phased hardcalls present?
  • Are dosages present? Are any of them explicitly phased?

All values except for "maximum allele count for a single variant" can be determined from a quick scan of the .pgen's header.

Linkage disequilibrium >>