Introduction, downloads

D: 2 Apr 2019

Recent version history

What's new?

Coming next

General usage

Getting started

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF (.vcf{,.gz})

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 dosage

Dosage import settings

Generate random

Unusual chromosome IDs

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-fcol (was --filter)

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-{,b}pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--write-samples

(TBD)

Resources

1000 Genomes phase 3

Output file list

Order of operations

Credits

File formats

General usage

Getting started

First, if plink and/or plink2 are not installed on your system, download and unzip the appropriate binaries (v1.9, v2.0). (Or clone from GitHub and recompile.) As alpha and beta testing continue, plink2 will become increasingly usable on its own, but for now it's better to think of it as a supplement to rather than a replacement for v1.9.

Then you can verify that both programs are functional with the following pair of commands:

./plink --dummy 2 2 --freq --make-bed --out toy_data

./plink2 --bfile toy_data --freq --out test2

You should see something like:

PLINK v1.90b6.9 64-bit (4 Mar 2019)            www.cog-genomics.org/plink/1.9/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to toy_data.log.
Options in effect:
  --dummy 2 2
  --freq
  --make-bed
  --out toy_data

16384 MB RAM detected; reserving 8192 MB for main workspace.
Dummy data (2 people, 2 SNPs) written to toy_data-temporary.bed +
toy_data-temporary.bim + toy_data-temporary.fam .
2 variants loaded from .bim file.
2 people (0 males, 2 females) loaded from .fam.
2 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2 founders and 0 nonfounders present.
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to toy_data.frq .
2 variants and 2 people pass filters and QC.
Among remaining phenotypes, 1 is a case and 1 is a control.
--make-bed to toy_data.bed + toy_data.bim + toy_data.fam ... done.

PLINK v2.00a2 AVX2 (6 Mar 2019)                www.cog-genomics.org/plink/2.0/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test2.log.
Options in effect:
  --bfile toy_data
  --freq
  --out test2

Start time: Wed Mar  6 09:47:56 2019
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
2 samples (2 females, 0 males; 2 founders) loaded from toy_data.fam.
2 variants loaded from toy_data.bim.
1 binary phenotype loaded (1 case, 1 control).
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to test2.afreq .
End time: Wed Mar  6 09:47:56 2019

(Remove the './' prefix if the program was installed earlier, or if you've added it to the system PATH.) If either command fails, verify that you downloaded the correct binaries for your machine, and consult the plink2-users Google group if you're still stuck.

Okay, what did these commands mean? And what just happened?

PLINK parses each command line as a collection of flags (each of which starts with two dashes1), plus parameters (which immediately follow a flag, and never start with a dash unless that dash is immediately followed by a digit) for those flags. The first command included four flags: --dummy, --freq, --make-bed, and --out. They specify the following three things, which are part of almost every PLINK run:

  • Input data: '--dummy 2 2' tells PLINK 1.9 to generate a new random dataset with 2 samples and 2 variants. You'll see several other ways to specify input data on the next page.
  • Operation(s) to perform: --freq tells PLINK to generate an allele frequency report, and --make-bed tells PLINK to save the data in PLINK 1 binary format. The full range of supported operations is summarized under 'Main functions' in the sidebar, and the formats of all reports are described in the file format appendices (v1.9, v2.0).
  • An output file prefix: We'll elaborate on this in a moment.

So this particular combination makes PLINK 1.9 generate a new 2x2 dataset, write an allele frequency report to toy_data.frq, and save the dataset to toy_data.bed + .bim + .fam. Similarly, the second command makes PLINK 2.0 write its own allele frequency report to plink2.afreq.

1: Actually, that was a lie. With the exceptions of --1 and --23file, PLINK 1.9 and 2.0 allow you to use a single dash in front of each flag. In exchange for saving you some keystrokes, please do yourself a favor and avoid filenames that begin with a dash.

The allele frequency reports are different?...

You may have noticed that the file extensions of the v1.9 and v2.0 allele frequency reports aren't the same, and there are several formatting differences between the two files, though they clearly contain the same information. This is true for many commands; PLINK 2.0 cannot generally be used as a drop-in replacement for previous PLINK versions. We realize this can be a major annoyance, and will continue maintaining v1.9 for a long time to come for those who need full backward compatibility. However, v2.0's reports are better-standardized (header lines preceded by '#', tab-delimited, column headers are consistent with VCF, etc.) and more flexible (lots of optional column sets); hopefully, this'll make your life easier and be worth some minor transitional headaches.

Interpreting our flag usage summaries

The rest of this documentation has many one-line summaries describing the parameter sets accepted by particular flags, followed by discussions of flag functionality and the effects of optional parameters. We use the following conventions in our one-line usage summaries (these were adjusted in March 2019 to be more consistent with community norms):

  • <angle brackets> denote a required parameter, where the text between the brackets describes its nature.
  • ['square brackets + single-quotes'] denotes an optional modifier. Use the EXACT text in the quotes; e.g. '--freq counts' is valid given the summary

--freq ['counts'] ...

  • [{bar|separated|braced|bracketed|values}] denotes a collection of mutually exclusive optional modifiers (again, the exact text must be used). When there are no outer square brackets, one of the choices must be selected.
  • ['quoted_text='<description of value>] denotes an optional modifier that must begin with the quoted text, and be followed by a value with no whitespace in between. '|' may also be used here to indicate mutually exclusive options. E.g. '--glm perm' and '--glm mperm=10000' are both valid, and '--glm perm mperm=10000' invalid, given the summary

--glm ['perm' | 'mperm='<value>] ...

  • [square brackets without quotes or braces] denote an optional parameter, where the text between the brackets describes its nature.
  • An ellipsis (...) indicates that you can enter multiple parameters of the specified type.
  • Many PLINK 2.0 commands accept a "column set descriptor". For example, the help text for --make-king-table is

    --make-king-table ['zs'] ['counts'] ['cols='<column set descriptor>]
      Similar to --make-king, except results are reported in the original .kin0
      text table format (with minor changes, e.g. row order is more friendly to
      incremental addition of samples), and --king-table-filter can be used to
      restrict the report to high kinship values.
      Supported column sets are:
        maybefid: FID1/FID2, if that column was in the input.   Requires 'id'.
        fid: Force FID1/FID2 even when FID was absent in the input.
        id: IID1/IID2 (column headers are actually 'ID1'/'ID2' to match KING).
        maybesid: SID1/SID2, if that column was in the input. Requires 'id'.
        sid: Force SID1/SID2 even when SID was absent in the input.
        nsnp: Number of variants considered (autosomal, neither call missing).
        hethet: Proportion/count of considered call pairs which are het-het.
        ibs0: Proportion/count of considered call pairs which are opposite homs.
        ibs1: HET1_HOM2 and HET2_HOM1 proportions/counts.
        kinship: KING-robust between-family kinship estimator.
      The default is maybefid,id,maybesid,nsnp,hethet,ibs0,kinship.
      hethet/ibs0/ibs1 values are proportions unless the 'counts' modifier is
      present.  If id is omitted, a .kin0.id file is also written.

    A valid descriptor is either
    • a comma-separated sequence of column set names (e.g. 'cols=maybefid,id,nsnp,hethet,ibs0,ibs1,kinship' would add HET1_HOM2 and HET2_HOM1 columns, while ensuring that SID columns do not appear), or
    • a comma-separated sequence of column set names where every name is preceded by a plus or minus (in which case the column sets are added/subtracted from the default, e.g. 'cols=+ibs1,-maybesid' is a shorter way to add HET1_HOM2/HET2_HOM1 and exclude SID1/SID2).
  • Background color summarizes degree of similarity to PLINK 1.9. Green signals maximal compatibility: there will usually be a minor difference in output file formats, but all information in the PLINK 1.9 output file will also be present in the PLINK 2.0 output file when the same flag and modifiers are used. (Note that green does not guarantee the absence of additional options.) Yellow signals slightly different functionality and/or command-line usage, and blue signals that the flag is new to PLINK 2.0.
  • If parts of our current implementation are known or strongly suspected to be incomplete, that is signaled with red text. So red text on a green background indicates that we plan to provide perfect compatibility, but we have more coding and/or testing to do before we get there.

If you're already familiar with PLINK, this should help you skim over stuff you already know. If there are just one or two flags you need to look up, you can quickly find what you need in the sidebar; try the search box if the correct page isn't immediately apparent.

For the newer bioinformaticians out there, here's our first full flag description.

Setting the output file prefix

--out <prefix>

By default, the output files generated by PLINK 2.0 all have names of the form 'plink2.<one of these extensions>'. This is fine for a single run, but as soon as you make more use of PLINK, you'll start causing results from previous runs to be overwritten.

Therefore, you usually want to choose a different output file prefix for each run. --out causes 'plink2' to be replaced with the prefix you provide. E.g. in the example above, '--out test2' caused PLINK 2 to create a file named test2.afreq instead of plink2.afreq.

Since the prefix is a required parameter, invoking --out without it will cause PLINK 2 to quit during command line parsing:

[chrchang:~/plink-ng]$./plink2 --bfile toy_data --freq --out
PLINK v2.00a2 AVX2 (6 Mar 2019)                www.cog-genomics.org/plink/2.0/
(C) 2005-2019 Shaun Purcell, Christopher Chang   GNU General Public License v3
Error: Missing --out parameter.
For more info, try 'plink2 --help <flag name>' or 'plink2 --help | more'.

In the rest of this documentation, we will continue highlighting full command lines in purple, default parameter values in orange, and sample parameter values you can freely change in green.

Citation instructions

If you use PLINK 2.0 in any published work, please cite both the software (as an electronic resource/URL):

Package : PLINK [version]
Authors : Shaun Purcell, Christopher Chang
URL     : www.cog-genomics.org/plink/2.0/

and the manuscript(s) describing the methods you used. Our primary methods paper is:

Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4.

PLINK 2.0 includes implementations of many analyses that were developed by other teams. The original sources are summarized below.

Standard data input >>