Introduction, downloads

S: 11 Dec 2023 (b7.2)

D: 11 Dec 2023

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Report postprocessing

Variant annotation

--annotate <PLINK report> ['attrib='<file>] ['ranges='<file>] ['filter='<file>] ['snps='<file>] [{NA | prune}] ['block'] ['subset='<file>] ['minimal'] ['distance']

--border <kbs>

--annotate-snp-field <field name>

--annotate reads a variant-based PLINK report, and writes an annotated version to plink.annot.

This requires an annotation source:

  • 'attrib=<filename>' specifies a (possibly gzipped) file with lines of the form "<variant ID> <attribute names...>", where the attribute names are space-separated. See snp129.attrib.gz on the resources page for an example.
  • 'ranges=<filename>' specifies a gene/range list file (e.g. glist-hg19).

(Both source types can be specified simultaneously.) The following options are also supported:

  • 'filter=<filename>' causes only variants within one of the ranges in the given range list file to be included in the new report.
  • 'snps=<filename>' causes only variants named in the given file to be included in the new report.
  • The 'NA' modifier causes unannotated variants to have 'NA' instead of '.' in the new report's ANNOT column, while the 'prune' modifier excludes them entirely.
  • Normally, --annotate appends a single ANNOT column to each line. The 'block' modifier replaces this single column with a distinct 0/1-coded column for each possible annotation.
  • With 'attrib' and 'snps', variant IDs are normally read from the PLINK report's 'SNP' column; this can be changed with --annotate-snp-field.
  • With 'ranges',
    • the PLINK report must contain 'CHR' and 'BP' columns.
    • 'subset=<filename>' causes only intervals named in the subset file to be loaded from the ranges file.
    • interval annotations normally come with a parenthesized signed distance to the interval boundary (0 if the variant is located inside the interval; this is always true without --border). They can be excluded with the 'minimal' modifier.
    • the 'distance' modifier adds 'DIST' and 'SGN' columns describing the smallest (in absolute value) signed distance among the interval annotations.
  • If the PLINK report contains a 'P' column, you can use --pfilter to filter out lines with high p-values.
  • --border extends 'ranges' and 'filter' interval bounds out by the given number of kilobases.
LD-based result clumping

--clump <PLINK report filename(s)...>

--clump-p1 <index variant p-value threshold>
--clump-p2 <SP2 column p-value threshold>
--clump-r2 <r^2 threshold>
--clump-kb <clump kb radius>

--clump-snp-field <field name(s)...>

--clump-field <field name(s)...>
--clump-allow-overlap
--clump-verbose

--clump-annotate <header(s)...>
--clump-range <bp range file>

--clump-range-border <kbs>
--clump-index-first
--clump-replicate

--clump-best

When there are multiple significant association p-values in the same region, LD should be taken into account when interpreting the results. The --clump command is designed to help with this.

--clump loads the named PLINK-format association report(s) (text files with a header line, a column containing variant IDs, and another column containing p-values) and groups results into LD-based clumps, writing a new report to plink.clumped. Gzipped reports are permitted. Multiple filenames can be separated by spaces or commas.

Warning (added 22 Aug 2023): When using this command with --linear/--logistic, all p-values are considered, including those of covariates. Thus, you probably want to run --linear/--logistic with the 'hide-covar' modifier. (This footgun will be removed in PLINK 2.0.)

  • Clumps are formed around central "index variants" which, by default, must have p-value no larger than 0.0001; change this threshold with --clump-p1. Index variants are chosen greedily starting with the lowest p-value. Variants which meet the --clump-p1 threshold, but have already been assigned to another clump, do not start their own clumps unless --clump-best was specified.
  • Sites which are less than 250 kb away from an index variant and have r2 larger than 0.5 with it are assigned to that index variant's clump (unless they have been previously been assigned to another clump, and --clump-allow-overlap is not in effect). These two thresholds can be changed with --clump-kb and --clump-r2, respectively.
  • The r2 values computed by --clump are based on maximum likelihood haplotype frequency estimates; you can use '--r2 dprime' to dump them all.
  • As usual, only founders are considered in the r2 computation. If your dataset has a shortage of them, --make-founders may come in handy.
  • Sites within the clump which have association p-value smaller than 0.01 are listed in the 'SP2' column of the main report. This threshold can be adjusted with --clump-p2.
  • By default, variant IDs are expected to be in the 'SNP' column. You can change this with the --clump-snp-field flag, which takes a space-delimited sequence of field names to search for. With multiple field names, earlier names take precedence over later ones.
  • By default, p-values are expected to be in the 'P' column; change this with --clump-field. This has the same semantics as --clump-snp-field.
  • By default, no variant may belong to more than one clump; remove this restriction with --clump-allow-overlap.
  • --clump-verbose requests an extended report (see the file format appendix for details).
  • With --clump-verbose and/or --clump-best, --clump-annotate takes a space/comma-separated sequence of additional fields to copy into the final report. They will appear in the order you specified on the command line (this is a minor change from PLINK 1.07 when the fields appear in another order in the file).
  • Given a gene region file, --clump-range causes overlaps between regions and clumps to be reported. --clump-range-border extends each --clump-range region's bounds by the given number of kilobases.
  • With multiple --clump files, --clump-index-first forces all index variants to be drawn from the first file, while --clump-replicate excludes clumps which contain secondary results from only one file from the report.
  • With exactly one --clump file, --clump-best generates an additional .clumped.best report describing just the best proxy for each index variant (in the sense of having the highest r2 with it). With exactly two --clump files and --clump-index-first, proxies are not drawn from the first file—this imitates the old special-case behavior of --clump-replicate in the presence of this flag combination. (If --clump-replicate is actually specified, it is ignored for backwards compatibility.)
    We have provisionally retired --clump-best's other functionality; contact us if this is a problem.

The PLINK 1.07 documentation has more discussion of these flags, including a few detailed examples.

Gene-based reporting

--gene-report <PLINK report> <gene region file>

--gene-list-border <kbs>
--gene-subset <filename>

--gene-report-snp-field <field name>

Given a gene region file and a PLINK report with CHR and BP columns, --gene-report filters out lines where the coordinate is not contained in at least one gene, reorganizes the remainder by gene name, and writes the result to plink.range.report.

  • --gene-list-border extends each gene's bounds out by the given number of kilobases.
  • --gene-subset causes only genes named in the given file to be loaded from the gene region file.
  • When --extract (without 'range') is present, PLINK report lines with variant IDs not contained in the --extract file are filtered out. By default, variant IDs are assumed to be in the 'SNP' column; you can change this with --gene-report-snp-field.
  • If the PLINK report contains a 'P' column, you can use --pfilter to filter out lines with high p-values.
Meta-analysis

--meta-analysis <PLINK report filenames...>
--meta-analysis <PLINK report filenames...> + [{logscale | qt}] [{no-map | no-allele}] ['study'] ['report-all'] ['weighted-z']

--meta-analysis-snp-field <field name(s)...>
--meta-analysis-a1-field <field name(s)...>
--meta-analysis-a2-field <field name(s)...>
--meta-analysis-p-field <field name(s)...>
--meta-analysis-chr-field <field name(s)...>
--meta-analysis-bp-field <field name(s)...>
--meta-analysis-se-field <field name(s)...>
--meta-analysis-ess-field <field name(s)...>
--meta-analysis-report-dups

Given multiple PLINK-format association reports, --meta-analysis performs basic fixed-effects and random-effects meta-analysis of the data, writing results to plink.meta.

All input files must contain variant ID and standard error. Note that the original association analyses might not report standard errors except when --ci is specified.

  • Normally, an 'OR' odds ratio field must also be present in each input file. With 'logscale', 'BETA' log-odds values/regression coefficients are expected instead, but the generated report will still contain odds ratio estimates. With 'qt' both input and output values are regression betas.
  • 'CHR', 'BP', and A1 allele fields are also normally required. 'no-map' causes them to all be ignored, while 'no-allele' causes just A1 to be ignored.
  • Odds ratios in both input and output are with respect to the A1 allele. I.e. if it's greater than 1, that implies the A1 allele increases risk relative to A2.
  • When --extract (without 'range') is present, only variants named in the --extract file are considered.
  • Unless 'no-map' is specified, chromosome filters are also respected.
  • If A2 allele fields are present, and neither 'no-map' nor 'no-allele' was specified, A1/A2 allele flips are handled properly. Otherwise, A1 mismatches are thrown out.
  • CHR/BP values are permitted to differ. When they do, the .meta report will contain CHR/BP values from the first input file containing the variant.
  • Problematic line(s) in the input files are reported to plink.prob, and otherwise skipped.
  • If a variant appears more than once in the same file (e.g. --linear/--logistic output), only the first appearance is considered. Add the --meta-analysis-report-dups flag if you want the later appearances to be logged in the .prob file and included in the "problematic line" count.
  • 'study' causes study-specific effect estimates to be collated in the meta-analysis report.
  • 'report-all' causes variants present in only a single input file to be included in the meta-analysis report.
  • When using optional modifier(s) with --meta-analysis, you must include a '+' in the parameter list to separate input filenames from modifier names. Note that there must be spaces both before and after the '+'.
  • By default, variant IDs are expected to be in the 'SNP' column. You can change this with the --meta-analysis-snp-field flag, which takes a space-delimited sequence of field names to search for. With multiple field names, earlier names take precedence over later ones.
  • The name search orders for the chromosome, position, standard error, A1 allele, and A2 allele fields can be changed in the same manner, with --meta-analysis-chr-field, --meta-analysis-bp-field, --meta-analysis-se-field, --meta-analysis-a1-field, and --meta-analysis-a2-field respectively (defaults are 'CHR', 'BP', 'SE', 'A1', and 'A2').
  • 'weighted-z' requests weighted Z-score-based p-values (as computed by the Abecasis Lab's METAL software) in addition to PLINK's usual inverse variance-based analysis. This requires P and effective sample size fields (default column names are 'P' and 'NMISS'; these can be changed with --meta-analysis-p-field and --meta-analysis-ess-field). Note that, if the numbers of cases and controls are unequal, effective sample size should be 4 / (1/<# of cases> + 1/<# of controls>).
  • Nonstandard chromosome names are currently not supported; contact us if you want us to allow them.

Epistasis >>