Introduction, downloads

D: 20 Mar 2017

Recent version history

What's new?

Where're the other docs?

Output file list

Credits

File formats

PLINK 2.00 alpha

PLINK 2.0 alpha was developed by Christopher Chang and the Human Longevity, Inc. Data Science team, with substantial input from Stanford's Department of Biomedical Data Science. (More detailed credits.)

Binary downloads

Build
Operating systemDevelopment (20 Mar)
Linux 64-bit Intel1download
Linux 32-bitdownload
OS X (64-bit)download
Windows 64-bitdownload
Windows 32-bitdownload

1: This build can still run on AMD processors, but it's statically linked to Intel MKL, so some linear algebra operations will be slow. We will try to provide an AMD Zen-optimized build as soon as supporting libraries are available.

Source code and build instructions are available on Github.

Recent version history

20 March 2017: Initial public release.


What's new?

  • Preservation of reference alleles (without requiring constant use of --keep-allele-order), phase information, and the VCF QUAL, FILTER, and INFO fields. Use --make-pgen instead of --make-bed when importing a VCF; the fileset can then be referenced with --pfile. We will provide 1000 Genomes phase 3 downloads in the new fileset format as soon as multiallelic variants are also supported.
  • The new .pgen file format incorporates SNPack-style genotype compression, frequently reducing file sizes by 80+% with negligible computational cost. To allow users to take advantage of genotype compression without sacrificing compatibility with scripts expecting old-style .bim and .fam text files, PLINK 2.0 supports a hybrid .pgen + .bim + .fam usage mode (--make-bpgen/--bpfile). We've also provided a Python library for reading and writing .pgen files, to simplify migration to the new format. (PLINK 1 .bed files are valid .pgen files, so code written on top of the library is backward-compatible.)
  • Firth regression ('--glm firth-fallback', '--glm firth'). Standard logistic regression fails to converge, yielding 'NA' or nonsense results, when the 2x2 allele/phenotype contingency table has an empty cell ("quasi-complete separation"); this is common, and especially likely to happen with the strongest associations. Firth regression can prevent you from missing these associations. The fast 'firth-fallback' mode (only use Firth regression when there's either an empty contingency table cell or regular-logistic-regression convergence failure) gets you most of the benefit for a fraction of the computational cost.
  • '--pca approx' (equivalent to EIGENSOFT 6 fastmode with default parameters). If you have more than ten thousand samples, only need the top principal components, and can tolerate ~0.1% error in the last PC, this can save you a ton of compute time.
  • The 64-bit Linux build can handle linear algebra on matrices with more than 231 elements (so regular --pca is no longer limited to ~46000 samples), as long as your system has enough memory.
  • KING-robust kinship coefficients (--make-king, --make-king-table, --king-cutoff). These remain accurate when good population allele frequency estimates are unavailable. We have found --king-cutoff to be much more reliable than the PLINK 1.9 --rel-cutoff flag for removal of close relations.
  • Proper support for dosages (decimal allele count expected values). When .gen/.bgen files are imported, hardcalls and dosages are saved to the .pgen. Operations which naturally extend to decimals (e.g. --pca, --glm, --freq, --maf/--mac) use the dosage information when it's present, while methods that can only make use of hardcalls (e.g. KING-robust, Hardy-Weinberg exact test) simply ignore the dosages. --hard-call-threshold can now be used to change the saved hardcalls without changing the dosages.
  • Much more multithreaded code.
  • Most commands let you control which columns appear in the main output file(s). For example, the help text for --make-king-table is

    --make-king-table <zs> <counts> <cols=[column set descriptor]>
      Similar to --make-king, except results are reported in the original .kin0
      text table format (with minor changes, e.g. row order is more friendly to
      incremental addition of samples), and --king-table-filter can be used to
      restrict the report to high kinship values.
      Supported column sets are:
        id: FID1/ID1/FID2/ID2.
        maybesid: SID1/SID2, if at least one value is nonmissing.  Must be used
                  with 'id'.
        sid: Force SID1/SID2 even when all values are missing.
        nsnp: Number of variants considered (autosomal, neither call missing).
        hethet: Proportion/count of considered call pairs which are het-het.
        ibs0: Proportion/count of considered call pairs which are opposite homs.
        ibs1: HET1_HOM2 and HET2_HOM1 proportions/counts.
        kinship: KING-robust between-family kinship estimator.
      The default is id,maybesid,nsnp,hethet,ibs0,kinship.  hethet/ibs0/ibs1
      values are proportions unless the 'counts' modifier is present.  If id is
      omitted, a .kin0.id file is also written.

    A "column set descriptor" is either a comma-separated sequence of column set names (e.g. 'cols=id,nsnp,hethet,ibs0,ibs1,kinship' would add HET1_HOM2 and HET2_HOM1 columns, while ensuring that SID columns do not appear), or a comma-separated sequence of column set names where every name is preceded by a plus or minus (in which case the column sets are added/subtracted from the default, e.g. 'cols=+ibs1,-maybesid' is a shorter way to add HET1_HOM2/HET2_HOM1 and exclude SID1/SID2).
    • What's SID, you ask? It's an optional third sample ID component which can be used to distinguish samples from the same individual. (A cautionary note: SID support isn't well-tested yet. But this should change soon.)
    • And what's 'zs'? That requests Zstandard compression of the main output file. All PLINK 2.0 input text files are permitted to be gzip- or Zstd-compressed. When working disk space is limited and you still need to generate gzipped output, you can start with Zstd-compressed output and then e.g. pipe the output of --zst-decompress to pigz.
  • Graffelman and Weir's extended chrX Hardy-Weinberg exact test, which takes male allele frequencies into account. We've found that this tends to identify quite a few obviously miscalled chrX variants which were not caught by the usual QC filters.
  • Oxford-style haplotype filesets can now be imported and exported (--haps, '--export haps'/'--export hapslegend').
  • Sample-major PLINK binary files can now be efficiently exported ('--export ind-major-bed'). This is close to 3 orders of magnitude faster than the previous implementation (PLINK 1.07 --make-bed + --ind-major).

Coming next

  1. Linear regression.
  2. BGEN v1.2 and v1.3 import/export.
  3. Multiallelic variant support.
  4. BCF2 import/export.
  5. Merge. (Once this is operational, a stable version of the .pgen specification will be provided, and PLINK 2.0 beta testing will begin.)

Where's the rest of the online documentation?

It should be available by early April. Meanwhile, "plink2 --help [flag name]" should provide most of the information you need; feel free to ask for further clarification in plink2-users.

Output file list >>