Introduction, downloads

D: 20 Jun 2018

Recent version history

What's new?

Coming next

General usage

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF (.vcf{.gz})

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 dosage

Dosage import settings

Generate random

Unusual chromosome IDs

Phenotypes

Covariates

'Cluster' import

Input filtering

Sample ID file

Variant ID file

Interval-BED file

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-fcol (was --filter)

Missing genotypes

Minor allele frequencies

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-{b}pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--ref-allele

--ref-from-fa

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--write-samples

(TBD)

Output file list

Credits

File formats

PLINK 2.00 alpha

PLINK 2.0 alpha was developed by Christopher Chang, with support from GRAIL, Inc. and Human Longevity, Inc., and substantial input from Stanford's Department of Biomedical Data Science. (More detailed credits.)

Binary downloads

Build
Operating systemDevelopment (20 Jun)Alpha 1 final (11 Feb)
Linux AVX2 Intel1downloaddownload
Linux 64-bit Intel1downloaddownload
Linux 32-bitdownloaddownload
OS X AVX2downloaddownload
OS X 64-bitdownloaddownload
Windows AVX2downloaddownload
Windows 64-bitdownloaddownload
Windows 32-bitdownloaddownload

1: These builds can still run on AMD processors, but they're statically linked to Intel MKL, so some linear algebra operations will be slow. We will try to provide an AMD Zen-optimized build as soon as supporting libraries are available.

Source code and build instructions are available on GitHub. (Here's another copy of the source code.)

Recent version history

20 Jun 2018: Fix GRM/PCA/score-computation bug introduced on 30 May. If you used the 30 May or an early June build for GRM/--pca/--score, you should repeat the operation(s) with this build; apologies for the error.

19 Jun: Fixed rare --ref-allele/--alt1-allele corner case which could occur when a missing allele was replaced with a very long allele.

5 Jun: VCF import uninitialized-variable bugfix. --score 'ignore-dup-ids' modifier added.

30 May: "--export haps{legend}" bugfixes and bgzip support. "--export vcf vcf-dosage=DS" no longer exports undeclared HDS values when phase information is present. Unbreak --import-dosage + --map, for real this time.

21 May: --pgen-info command added (displays basic information about a .pgen file, such as whether it has any phase or dosage data).

20 May: Unbreak --import-dosage + --map.

17 May: --import-dosage and .gen import were broken for the last several weeks; this should be fixed now. A1 column added to --adjust output in preparation for multiallelic variants. --glm 'a0-ref' modifier renamed to 'omit-ref'.

15 May: Fixed chrX allele frequency computation bug when dosages are present. --ld modified to be based on major instead of reference alleles, to play better with multiallelic variants. --hardy header line and allele columns changed in preparation for multiallelic variant support.

8 May: --vcf dosage=HDS should now handle files with no DS field properly.

7 May: Fixed rare I/O deadlock. Improved VCF-import parallelism.

4 May: Fixed --bgen import/export when dosage precision bits isn't a multiple of 8 (previously misinterpreted the spec for those cases, sorry about that).

3 May: --bgen can now import variant records with up to 28 bits of dosage precision (though only ~15 bits will survive). "--export vcf-dosage=HDS-force" bugfix.

2 May: --vcf dosage= import no longer requires GT field to be present. Fixed potential --vcf dosage=HDS buffer overflow.

28 Apr: Fixed a --glm bug which occurred when autosomes and sex chromosome(s) were both present, or both chrX and chrY were present. If you performed a whole-genome --glm run with the 9 Feb 2018 build or later, you should rerun with the latest build. However, single-chromosome and autosome-only --glm runs were unaffected by the bug.

24 Apr: VCF phased-dosage import ("--vcf dosage=HDS") and export ("--export vcf vcf-dosage=HDS"). --pca and GRM computation now use correct variance for all-haploid genomes.

22 Apr: --export bgen-1.2/bgen-1.3 should now work for chrX/chrY/chrM; also fixed import bugs for those chromosomes.

16 Apr: --ref-from-fa contig line parsing bugfix.

14 Apr: --export bgen-1.2/bgen-1.3 implemented for autosomal diploid data. Operations like --pca which require decent allele frequencies now error out when frequencies are being estimated from less than 50 samples, unless you add the --bad-freqs flag. Phased dosage support implemented. Sample missingness rate in exported .sample files is now based on dosages rather than hardcalls. Non-AVX2 phase subsetting bugfix. --vcf + --psam bugfix. --vcf dosage= now ignores the hardcall when a dosage is present; instead, it's regenerated under --hard-call-threshold 0.1 (unless you specified a different threshold). --bgen 'ref-second' modifier renamed to 'ref-last', to generalize properly to multiallelic variants.

31 Mar: --export haps{legend} should now work properly when --ref-allele/--ref-from-fa/etc. flips some alleles in the same run.

29 Mar: --set-{missing,all}-var-ids non-AVX2 bugfix. --pheno/--covar autonaming bugfix.

28 Mar: --bgen 1-bit phased haplotype import implemented.

26 Mar: --make-bed + --indiv-sort bugfix.

23 Mar: Windows builds should work properly again (the 20-21 Mar Windows builds were badly broken). --glm now supports log-pvalue output (add the 'log10' modifier), and these remain accurate below the double-precision floating point limit of p=5e-324.

21 Mar: 3-column .sample file loading works properly again. Fixed a file-reading race condition.

20 Mar: Fix possible deadlock in recent builds when loading very long lines.

19 Mar: Fix --sample segfault in recent builds. .bgen import/export speed improvement. --oxford-single-chr wasn't extended correctly in the 4 Mar build; this should be fixed now.

11 Mar: Fix --pheno segfault in last week's builds that could occur when the file didn't have a header line.

9 Mar: Fix "File write failure" bug that occurred when a single write operation was larger than 2 GB (this could occur when running --make-bed with more than 128k samples). Reduced --make-bed memory requirement.

7 Mar: Fixed potential file-reading deadlock in recent builds (23 Feb or later).

5 Mar: --glm local-covar= should work properly again.

4 Mar: --oxford-single-chr can now be used on .bgen files. --make-pgen partially-phased data handling bugfix.

26 Feb: --keep/--remove/etc. should work properly now on IID-only files with no header line.

23 Feb: Fixed alpha 2 --vcf + --id-delim bug. Improved parsing speed for compressed VCF and .pvar files.

20 Feb: "--xchr-model 1" should work properly now.

16 Feb 2018 (alpha 2): This makes the following potentially compatibility-breaking changes:

  • FID is now an optional field: if it isn't in the input .psam file, it's omitted from several output files by default (these now have 'maybefid' and 'fid' column sets, where the default set includes 'maybefid'), and treated as always-'0' by any operation which requires FID values (such as --make-bed). When exporting genomic data files, 'maybefid' also treats the column as missing if all remaining values are '0'.
  • Relatedly, when importing sample IDs from a VCF or .bgen file, the default mode is now "--const-fid 0", and no FID column will be written to disk at all. --keep, --remove, and similar commands also now have "--const-fid 0" semantics when an input line contains only one token. You can now act as if IID is the only sample ID component, if that's what makes the most sense for your workflow. Conversely, it is now necessary to explicitly use --id-delim when you want to split the VCF/.bgen sample IDs into multiple components.
  • MT is treated as a haploid chromosome again. In PLINK 1.9 and earlier plink2 builds, MT was treated as diploid-ish to avoid throwing away information about heteroplasmic mutations; as a consequence, the --glm(/--linear/--logistic) genotype column and commands like "--freq counts" used a 0..2 scale. Now that plink2 has proper support for dosages, this kludge is no longer necessary.
  • --glm's 't' column set has been renamed to 'tz', to reflect it being a T-statistic for linear regression but a Wald Z-score for logistic/Firth. The corresponding column in .glm.logistic{.hybrid} and .glm.firth files now has 'Z_STAT' in the header line.

Also, --glm now defaults to regressing on minor instead of ALT allele dosages (this can be overridden with 'a0-ref').

The final alpha 1 build has been tagged in GitHub, and will remain downloadable from here for the next few months.

11 Feb: .king.cutoff.in/.king.cutoff.out files now end in .id, for consistency with other output files with sample IDs and no other information. Similarly, --mind's output file now has the extension .mindrem.id and defaults to having a header line. You can now use --no-id-header to suppress the header line (and force the columns to be FID/IID) in all .id output files.

10 Feb: --update-sex 'male0' option added, and custom column selection interface changed (now 'col-num='). --glm 'gcountcc' column names updated (now 'CASE_NON_A1_CT', 'CASE_HET_A1_CT', etc.) in preparation for switch to A1=major allele. --make-just-pvar + --ref-allele/--ref-from-fa no longer treats all initial reference alleles as provisional when the input .pvar has a header line.

9 Feb: Forcing .pvar QUAL/FILTER output when no such values are loaded no longer causes a segfault.

5 Feb: AVX2 phase-subsetting bugfix.

3 Feb: --score 'dominant' and 'recessive' modifiers added.

30 Jan: Fix .pgen writing bug which occurred when the number of variants was a multiple of 64 and the number of samples was large.

24 Jan: "--export oxford" now supports bgzipped output.

21 Jan: --glm now always reports an additional 'A1' column, indicating which allele(s) correspond to positive genotype column values. --glm column sets have been changed to revolve around A1 instead of ALT, so minor script modifications may be necessary when switching to this build.
In this build, A1 and ALT are still synonymous. This will change in alpha 2: A1 will default to the minor allele(s) to reduce multicollinearity (imitating PLINK 1.x's behavior in the absence of --keep-allele-order), though you will still have the option of forcing A1=ALT.

12 Jan: Fixed "--glm interaction" bug that occurred when multiple consecutive variants had no missing calls. We recommend redoing all --glm runs with the 'interaction' modifier which were performed with a build produced between 27 Nov 2017 and 10 Jan 2018 inclusive.

10 Jan: --adjust-file implemented (performs --adjust's multiple-testing correction on any association analysis file).

9 Jan: Added 'no-idheader' modifiers to a few commands, and made that the default for --make-grm-bin/--make-grm-list to avoid breaking interoperability.

7 Jan: --vcf can now be given a sites-only VCF when the run doesn't require genotype data. Sample ID files, such as those produced by --write-samples, now include a header line by default; this will be necessary to distinguish between FID-IID and IID-SID output in the future. (With --write-samples, you can suppress the header line by adding the 'noheader' modifier.)

5 Jan: --pheno-col-nums/--covar-col-nums implemented.

2 Jan 2018: --keep-fcol (equivalent to PLINK 1.x --filter) implemented.

19 Dec 2017: --adjust implemented. --zst-level implemented (lets you control Zstd compression level). Un-broke --rerun.

18 Dec: --extract/--exclude can now be used directly on UCSC interval-BED files (ok for coordinates to be 0-based or for no 4th column to be present). "--output-chr 26" now causes PAR1/PAR2 to be rendered as '25' (for humans), to restore interoperability with programs like ADMIXTURE which can't handle alphabetic chromosome codes. --merge-x implemented (usually needs to be combined with --sort-vars now). --pvar can usually handle 'sites-only' VCF files (e.g. those released by the gnomAD project) now. --thin, --thin-count, --thin-indiv, and --thin-indiv-count implemented.

16 Dec: Multithreaded zstd compression implemented (on Linux and OS X). --make-grm-gz renamed to --make-grm-list, and gzip mode removed.

15 Dec: Fixed --extract-if-info and --exclude-if-info's behavior for non-numeric values which start with a number. Existence-checking flags renamed to --require-info and --require-no-info for naming consistency.

13 Dec: --extract-if-info and --exclude-if-info flags added, for simple filtering on INFO key/value pairs or key existence.

11 Dec: --king-table-subset flag added. This makes it straightforward to perform two-stage relationship/duplicate detection: start with --make-king-table on a small number of higher-MAF variants scattered across the genome, and then rerun it with --king-table-subset on an appropriate subset of candidate sample pairs from the first stage. --bp-space implemented (useful for the first stage above).
The two-stage workflow was first implemented by Wei-Min Chen in a recent version of KING; contact him for citation information.

7 Dec: Fixed bug which could occur when filtering samples from a phased dataset. Windows AVX2 build now available.

28 Nov: --import-dosage 'format=infer' (this is now the default) and 'id-delim=' (needed for reimport of "--export A-transpose" data) options added. Fixed --import-dosage bug that caused it to error out on missing genotypes under format=1. --no-psam-pheno (or --no-pheno/--no-fam-pheno) can now be used to ignore all phenotypes in the sample file, while keeping the phenotype(s) in the --pheno file if one was specified.

27 Nov: Implemented fast path for --glm no-missing-genotype case (mainly affects linear regression). --make-king{-table} can now automatically handle matrices too large to fit in memory without explicit use of --parallel. AVX2 sample filtering performance improvement. --validate bugfix.

19 Nov: Fix VCF FORMAT:GT header line parsing bug introduced in 14 Nov build.

18 Nov: --make-king{-table} performance improvements.

16 Nov: Fixed bug in 14 Nov build that broke ##chrSet header line parsing.

14 Nov: Fixed bug that caused --export A{D} to hang when the number of variants was between 65 and about a thousand.

4 Nov: Linux and OS X prebuilt AVX2 binaries now available; these should work well on most machines built within the last 4 years. Fixed another Firth regression spurious NA bug. Fixed --score bug that occurred when sample filter(s) were applied simultaneously. Fixed a --ld phased-hardcall handling bug. Array-popcount upgrade in progress (thanks to recent work by Wojciech Muła, Nathan Kurz, Daniel Lemire, and Kim Walisch).

3 Nov: Fixed multipass --export A{D} bug. --dummy dosage-freq= now fills in hardcalls with the default --hard-call-threshold cutoff of 0.1 when --hard-call-threshold is not explicitly specified.

2 Nov: --export A{D} implemented (with dosage support). --dummy dosage-freq= modifier now works properly for dosage frequencies above 0.75.

16 Oct: --ref-from-fa flag implemented, to set reference alleles from a FASTA file. (Note that this may be unable to determine which allele is reference when length changes are involved, but it should always work for SNPs and multi-nucleotide polymorphisms.) --update-name implemented. Fixed column-set parsing bug in 13 Oct build.

13 Oct: Fixed --glm logistic/Firth regression bug which could produce spurious NA results.

9 Oct: Fixed --ld's handling of some dosage and haploid cases. Fixed bug which could cause --make-pgen to discard phase/dosage information when extracting a small variant subset. --geno-counts no longer double-reports chrY counts.

8 Oct: --ld implemented, with supported for phased genotypes and dosages (try "--ld [var1] [var2] dosage"). Fixed tiny bgen-1.1 import bug that triggered when the number of threads exceeded the number of variants. Allele frequency computation no longer crashes on chrX when dosages are present but only hardcalls are needed.

1 Oct: Fixed GRM computation bug which sometimes caused segfaults when both dosages and missing values were present. --glm is now a bit faster when many covariates are present.

20 Sep: Firth regression Hessian matrix inversion step raised to double-precision, after last week's builds revealed that single-precision inversion could be unreliable.

15 Sep: --vif/--max-corr per-variant checks are now working. These are no longer skipped during logistic regression.

8 Sep: Alternative VCF INFO:PR fields are now tolerated. Removed debug code that slowed down yesterday's --make-pgen.

7 Sep: --score uninitialized memory bugfix. Partially-phased data handling bugfix.

6 Sep: Fix OS X stack size issue (could cause --pca and some other commands to crash in recent builds; 1 Sep build had an incomplete workaround).

4 Sep: --{covar-}variance-standardize missing value handling bugfix. --ref-allele/--alt1-allele implemented (--a2-allele and --a1-allele are treated as aliases).

1 Sep: --{pheno/covar}-quantile-normalize missing-phenotype handling bugfix.

29 Aug: --glm 'gcountcc' column set option added (reports genotype hardcall counts, stratified by case/control status). --write-samples command added (analogous to --write-snplist).

2 Aug: --sort-vars implemented.

25 Jul: --loop-cats now works properly with genotype-based variant filters.

24 Jul: Fixed '--pca approx' allele frequency handling bug introduced in 4 Jun build; we recommend redoing any '--pca approx' runs performed with an affected build. (Regular --pca was not affected.) --loop-cats implemented (similar to PLINK 1.x --loop-assoc, except it's not restricted to association tests). VCF export now supports 'vcf-dosage=DS-force' mode. --dummy multithread + dosage bugfix.

17 Jul: BGEN v1.2/1.3 importer memory allocation bugfix. Size of failed allocation is now logged on most out-of-memory errors.

2 Jul: Improved multithreading in BGEN v1.2/1.3 importer. Python writer can now be called with multiple variants at a time.

25 Jun: Basic BGEN v1.2/1.3 import (unphased biallelic dosages; suffices for main UK Biobank data release). --warning-errcode flag added (causes an error code to be returned to the OS on exit when at least one warning is printed).

20 Jun: --condition-list + variant filter bugfix.

5 Jun: --make-pgen memory requirement greatly reduced. End time now printed to console in most situations.

4 Jun: --hwe no longer causes a segfault when chrX is present and no gender information is available. Fixed --dummy bug.

29 May: --import-dosage format=1 bugfix.

26 May: --glm 'standard-beta' modifier replaced with --variance-standardize flag. --quantile-normalize function added. Fixed a missing-sex allele counting bug.

25 May: --hardy/--hwe works properly again when chrX is present but not at the beginning of the dataset.

22 May: Fixed major dosage data + sample-filter bug; we recommend rerunning any operations involving both dosage data and sample filtering performed with earlier plink2 builds. --score 'list-variants' modifier added.

19 May: Fixed a bug with allele frequency computation on dosage data when sample filter(s) are applied.

18 May: Many categorical phenotype-handling flags (--within, --keep-cats, --split-cat-pheno, ...) implemented. Basic phenotype-based filtering implemented (e.g. "--remove-if PHENO1 '>' 2.5"; note that unnamed phenotypes are assigned the names 'PHENO1', 'PHENO2', etc., and that the '<' and '>' characters must be quoted in most shells). --write-covar implemented. --mach-r2-filter implemented, and raw MaCH r2 values can be dumped with "--freq cols=+machr2".

11 May: --condition{-list} + --covar bugfix.

8 May: Fix quantitative phenotype/covariate loading bug introduced in 6 May build.

7 May: --import-dosage implemented.

6 May: Fixed bug which caused '0' to be treated as control instead of missing for binary phenotypes. Minor change to --glm's column headers, in preparation for multiallelic data.

2 May: --score bugfix. --maj-ref bugfix. --vcf-min-dp and '--export A-transpose' implemented.

1 May: VCF dosage import/export, --vcf-min-gq, and --read-freq implemented. --score can now work with standard errors. --autosome{-par} now works properly. SNPHWE2 and SNPHWEX functions relicensed as GPL-2+, to enable inclusion in the HardyWeinberg R package.

20 April: .sample export bugfix (didn't work if file was over 256 KB and no phenotypes were present). --dummy implemented (can now generate dosages).

19 April: --hardy/--hwe chrX bugfix (thanks to Jan Graffelman for catching the problem and validating the fix). --new-id-max-allele-len now has three modes ('error', 'missing', and 'truncate'), and the default mode is now 'error' (i.e. --set-missing-var-ids and --set-all-var-ids now error out when an allele code longer than 23 characters is encountered, instead of silently truncating). --score implemented, and extended to support variance-normalization and multiple score columns (these two features provide a simple way to project new samples onto previously computed principal components).

11 April: --pca var-wts bugfix, and --pca eigenvalue ordering bugfix. --glm linear regression and --condition{-list} support added. --geno/--mind/--missing/--genotyping-rate can now refer to missing dosages instead of just missing hardcalls (note that, when importing dosage data, dosages in (0.1, 0.9) and (1.1, 1.9) are saved but there usually won't be associated hardcalls).

20 March 2017: Initial public release.


What's new?

  • Preservation of reference alleles (without requiring constant use of --keep-allele-order), phase information, and the VCF QUAL, FILTER, and INFO fields. Use --make-pgen instead of --make-bed when importing a VCF; the fileset can then be referenced with --pfile. We will provide 1000 Genomes phase 3 downloads in the new fileset format as soon as multiallelic variants are also supported.
  • The new .pgen file format incorporates SNPack-style genotype compression, frequently reducing file sizes by 80+% with negligible computational cost. To allow users to take advantage of genotype compression without sacrificing compatibility with scripts expecting old-style .bim and .fam text files, PLINK 2.0 supports a hybrid .pgen + .bim + .fam usage mode (--make-bpgen/--bpfile). We've also provided a Python library for reading and writing .pgen files, to simplify migration to the new format. (PLINK 1 .bed files are valid .pgen files, so code written on top of the library is backward-compatible.)
  • Firth regression ('--glm firth-fallback', '--glm firth'). Standard logistic regression fails to converge, yielding 'NA' or nonsense results, when the 2x2 allele/phenotype contingency table has an empty cell ("quasi-complete separation"); this is common, and especially likely to happen with the strongest associations. Firth regression can prevent you from missing these associations. The fast 'firth-fallback' mode (only use Firth regression when there's either an empty contingency table cell or regular-logistic-regression convergence failure) gets you most of the benefit for a fraction of the computational cost.
  • '--pca approx' (equivalent to EIGENSOFT 6 fastmode with default parameters). If you have more than ten thousand samples, only need the top principal components, and can tolerate ~0.1% error in the last PC, this can save you a ton of compute time.
  • The 64-bit Linux build can handle linear algebra on matrices with more than 231 elements (so regular --pca is no longer limited to ~46000 samples), as long as your system has enough memory.
  • KING-robust kinship coefficients (--make-king, --make-king-table, --king-cutoff). These remain accurate when good population allele frequency estimates are unavailable. We have found --king-cutoff to be much more reliable than the PLINK 1.9 --rel-cutoff flag for removal of close relations.
  • Proper support for dosages (decimal allele count expected values). When .gen/.bgen files are imported, hardcalls and dosages are saved to the .pgen. Operations which naturally extend to decimals (e.g. --pca, --glm, --freq, --maf/--mac) use the dosage information when it's present, while methods that can only make use of hardcalls (e.g. KING-robust, Hardy-Weinberg exact test) simply ignore the dosages. --hard-call-threshold can now be used to change the saved hardcalls without changing the dosages.
  • Much more multithreaded code.
  • Most commands let you control which columns appear in the main output file(s).
  • Broad support for both gzipped and Zstd-compressed text input files.
  • Graffelman and Weir's extended chrX Hardy-Weinberg exact test, which takes male allele frequencies into account. We've found that this tends to identify quite a few obviously miscalled chrX variants which were not caught by the usual QC filters.
  • Oxford-style haplotype filesets can now be imported and exported (--haps, '--export haps'/'--export hapslegend').
  • Sample-major PLINK binary files can now be efficiently exported ('--export ind-major-bed'). This is close to 3 orders of magnitude faster than the previous implementation (PLINK 1.07 --make-bed + --ind-major).

Coming next

  1. Multiallelic variant support.
  2. BCF2 import/export.
  3. Merge. (Once this is operational, a stable version of the .pgen specification will be provided, and PLINK 2.0 beta testing will begin.)

General usage >>