Introduction, downloads

S: 13 Oct 2017 (b4.9)

D: 13 Oct 2017

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf{.gz}, .bcf)

Oxford (.gen{.gz}, .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--{b}merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq{x}

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes phase 1

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Developer information

Source code

Our public GitHub repository is at https://github.com/chrchang/plink-ng. Here's the source code snapshot for our latest posted build, and here's a mirror of PLINK 1.07's source.

PLINK 1.9 is GPLv3+ software: you are free to modify and rerelease it, as long as you do not restrict others' freedom to do the same, and include proper attribution.

Compilation instructions

We haven't tested this across a wide range of machine configurations. If you are running into problems, or if you have experience setting up portable installation scripts, let us know.

(Note that the plink_first_compile script currently puts zlib-1.2.11 under the parent directory; you may want to change this configuration.)

First time on Amazon Linux:

sudo yum install -y gcc gcc-c++ libstdc++ gcc-gfortran glibc glibc-devel make blas-devel lapack lapack-devel atlas-devel

./plink_first_compile

(Ubuntu is similar, but you'll need both libatlas-dev and libatlas-base-dev.)

First time on OS X:

./plink_first_compile

Subsequent compiles on both platforms:

make plink

Due to the need for MinGW-w64, our Windows build process is not as easy to replicate. However, if you have experience with that compiler, you should be able to adapt the 'plinkw' Makefile build target to your machine. We'll respond to requests for assistance as time allows.

Core algorithms

Partial sum lookup
Each entry of a distance/relationship matrix is a sum of per-marker terms. For any specific marker, there are at most seven distinct cases:

  1. Both genotypes are homozygous for the major allele.
  2. One is homozygous major, and the other is heterozygous.
  3. One is homozygous major, and the other is homozygous minor.
  4. Both are heterozygous.
  5. One is heterozygous, and the other is homozygous minor.
  6. Both are homozygous minor.
  7. At least one genotype is missing.

For example, the GCTA genomic relationship matrix is defined by the following per-marker increments (where q is the MAF):

  1. (2-2q)(2-2q) / (2q(1-q))
  2. (2-2q)(1-2q) / (2q(1-q))
  3. (2-2q)(0-2q) / (2q(1-q))
  4. (1-2q)(1-2q) / (2q(1-q))
  5. (1-2q)(0-2q) / (2q(1-q))
  6. (0-2q)(0-2q) / (2q(1-q))
  7. 0 (subtract 1 from the final denominator instead, in another loop)

This suggests the following matrix calculation algorithm, as a first draft:

  1. Initialize all distance/relationship partial sums to zero.
  2. For each marker, calculate and save the seven possible increments in a lookup table, and then refer to the table when updating partial sums. This replaces a floating point divide and several adds/multiplies in the inner loop with a single addition operation.

We can substantially improve on this by handling multiple markers at a time. Since seven cases can be distinguished by three bits, we can compose a sequence of bitwise operations which maps a pair of padded 2-bit PLINK genotypes to seven different 3-bit values in the appropriate manner. On 64-bit machines, 20 3-bit values can be packed into a machine word (for example, let bits 0-2 describe the relation at marker #0, bits 3-5 describe the relation at marker #1, etc., all the way up to bits 57-59 describing the relation at marker #19), so this representation lets us instruct the processor to act on 20 markers simultaneously.

Then, we need to perform the update

  Ajk := Ajk + f0(x0) + f1(x1) + ... + f19(x19)

where the xi's are bit trios, and the fi's map them to increments. This could be done with 20 table lookups and floating point addition operations. Or, the update could be restructured as

  Ajk := Ajk + f{0-4}(x{0-4}) + ... + f{15-19}(x{15-19})

where x{0-4} denotes the lowest-order 15 bits, and f{0-4} maps them directly to f0(x0) + f1(x1) + f2(x2) + f3(x3) + f4(x4); similarly for f{5-9}, f{10-14}, and f{15-19}. In exchange for a bit more precomputation (four tables of size 215 each; total size 1 MB, which isn't too hard on today's L2/L3 caches), this restructuring licenses the use of four table lookups and adds per update instead of twenty.

(Given probabilistic calls, where there's no seven-cases-per-marker restriction, we recommend expressing the result as the sum of a few large matrix multiplications; you can then rely on a machine-optimized dgemm/sgemm to do the heavy lifting.)

Bit population count
If we represent each sample as a packed array of 2-bit genotypes using PLINK's standard binary encoding (00 = zero copies of second allele, 10 = one copy, 11 = two copies), the unweighted genomic distance between two samples is simply the bit population count (Hamming weight) of the result of XORing their arrays. (Missing genotypes can be handled with an additional bitmasking step.)

Most new (post-2008) processors offer a specialized POPCNT instruction to directly evaluate this in hardware. But it turns out that the best portable vector algorithms, refined over nearly 50 years of hacking, can come within 10 percent of the speed of algorithms exploiting the hardware instruction. And unfortunately, reliance on the GCC __builtin_popcountll() intrinsic leads to a ~300% slowdown when POPCNT is not present.

Therefore, 64-bit PLINK relies on a vector algorithm. Building on work and discussion by Andrew Dalke, Robert Harley, Cédric Lauradoux, Terje Mathisen, and Kim Walisch (cf. Dalke's testing, and this comp.arch.arithmetic thread tracing the algorithm's origin to Harley), we developed an implementation that achieves near-optimal performance with only SSE2 instructions, which are supported by all Intel and AMD x86-64 processors dating back to 2003. Even older processors are supported by 32-bit PLINK, which evaluates bit population counts with just integer operations. (See popcount_vecs() and popcount_longs() in plink_common.c.) Either way, when evaluating unweighted distances, this approach is even more efficient than using partial sum lookups.

Many of PLINK's other calculations also have a bit population count, or a minor variant of it, in the innermost loop. Here's an example.

Ternary dot product
Evaluation of the variant correlations required by e.g. --indep mostly boils down to taking the dot product of two ternary vectors. Specifically, we can associate the value -1 with homozygous minor genotypes, 0 with heterozygous genotypes, and +1 with homozygous major; then (assuming no missing values) we can express the observed correlation between two variants in terms of the dot product of the two {-1, 0, 1} vectors, along with a few other terms that are cheaper to compute.

Encoding homozygous minor genotypes as the 2-bit value 00, heterozygous genotypes as 01, and homozygous major as 10, we evaluate the dot product of two encoded genotype vectors x and y as follows:

  1. Set z := (x | y) & 0x5555... (z then notes which variants have at least one heterozygous genotype)
  2. Set w := ((x ^ y) & (0xaaaa... - z)) | z (one way to 'multiply' all variants in parallel)
  3. Sum the components of w (via a slightly modified bit population count)

Handling of missing values is discussed in our upcoming paper. It slows things down by a factor of three or so, but this approach still beats partial sum lookup.

The same basic idea can be used to calculate covariance matrix terms (without variance standardization). However, in this case we have not found a good way to handle missing calls, so partial sum lookups are still employed.

Vertical population count
Consider a 2-dimensional array of 0s and 1s, where columns represent samples, rows represent permutations, every row is guaranteed to have the same number of 1s (representing cases), and each column can be one of four colors (representing homozygote minor, heterozygote, homozygote major, or missing genotype status). The primary operation in a case/control max(T) permutation test involves determining the sum of all columns of a given color, for each color.

Brian Browning noted that it was not necessary to explicitly evaluate all four sums; instead, given three sums, the fourth could be determined via subtraction. This is a key optimization introduced by his PRESTO software, which produces a large speedup on the low-MAF variants that will be increasingly common in tomorrow's datasets. However, it is not compatible with the 'horizontal' bit population count described above (which sums one full row at a time, instead of one column at a time). Therefore, we decided to search for an efficient 'vertical population count' which could be used with Browning's optimization.

One obvious approach involves ordinary parallel adds. Unfortunately, it is necessary to pad the 0/1 matrix entries out to 32 bits in order to avoid integer overflow, meaning only 4 terms can be updated simultaneously by SSE2 instructions; the padding also has a horrible effect on memory bus traffic. Now, you could buy an additional factor of 2 by limiting the number of cases to 65535 and using 16-bit integers, but that's a crude hack which limits scalability... unless...

...unless you have an inner loop which runs to 65535, and an outer loop which uses the 16-bit counts to update an array of 32-bit totals before overflow would actually happen.

PLINK uses a 4-bit inner loop, an 8-bit middle loop, and a 32-bit outer loop instead of just a 16-bit and a 32-bit loop, but the basic idea is the same. In addition, instead of spending even 4 bits per permutation matrix entry, we use bit shifts and masks in the inner loop to efficiently unpack our 1-bit-per-entry storage format. We use a specially reordered representation of the permutation matrix to streamline the inner loop logic.

More precisely, we use an inner loop like this:

  1. Set v := [reordered bit representation of next permutation matrix entries]
  2. a4[0] += v & 0x1111...
  3. a4[1] += (v >> 1) & 0x1111...
  4. a4[2] += (v >> 2) & 0x1111...
  5. a4[3] += (v >> 3) & 0x1111...
  6. Move a4 pointer forward 4 spaces

and a middle loop like so:

  1. Zero out a4[] array
  2. Run inner loop 15 times
  3. a8[0] += a4[0] & 0x0f0f...
  4. a8[1] += (a4[0] >> 4) & 0x0f0f...
  5. Move a8 pointer forward 2 spaces, a4 forward 1
  6. Return to step 3 unless at end of a4 array

This has a lot in common with Edel and Klein's vertical population count.

Hardy-Weinberg equilibrium and Fisher exact tests
The Abecasis Lab's SNP-HWE routines for calculating Hardy-Weinberg equilibrium exact test p-values exploit the fact that the multinomial likelihoods of adjacent contingency tables are related by simple ratios: it is cheap to evaluate the relative likelihoods of many tables at once, as long as the tables are next to each other. However, the original implementation summed tail likelihoods beyond the point of futility: instead of exiting the main loop once all remaining terms were too small to affect the machine representation of the tail total, it kept multiplying and dividing to the very end. PLINK's implementation avoids this, and as a result the time complexity of our algorithm is O(sqrt(n)) instead of O(n) (where n is e.g. the average of the numbers in the contingency table)—there are a lot of vanishingly small terms when n is large. In addition, we have removed the O(n) space requirement by reordering the calculation.

Using the probability distribution's super-geometric decay to put upper bounds on tail sums, we have also developed an even faster routine for comparing a contingency table against a specific p-value.

The Fisher 2x2 exact test has a nearly identical form, which we have applied all the same optimizations to. Our permutation test routines go one step further: comparisons against a known p-value are usually performed by checking whether the upper-left contingency table value is within a precomputed range. This almost eliminates the need to use chi-square approximations, which can be a significant win when rare alleles are involved.

The code for our Fisher-Freeman-Halton 2x3 exact test is an extension of the SNP-HWE strategy to two dimensions, with time complexity O(n), constant space complexity, and small enough coefficients that even tables with entries in the millions can be evaluated within seconds in a web browser on a commodity PC (see the demo link below). This is a very large improvement over Mehta and Patel's network algorithm and its descendants (including e.g. Requena's work), which have been the standard approach for the last 30 years.

Extension to larger table sizes is conceptually straightforward (determine relative likelihoods of the df-ball of tables near enough to the most probable one to not cause a machine precision underflow); the general-case time complexity is O(ndf/2). We expect that the network algorithm is still the best basic approach once there are enough degrees of freedom, but if you're writing any serious software for df ≤ 6 we strongly recommend trying our algorithm.

For your convenience, we have provided GPLv3 interactive JavaScript calculators and C/C++ standalone code utilizing these ideas. (That page also includes an O(sqrt(n)) exact binomial test applying the same idea.)

Multithreaded gzip
For many purposes, compressed text files strike a nice balance between ease of interpretation, loading speed, and resource consumption. However, the computational cost of generating them is fairly high; it's not uncommon for data compression to take longer than all other operations combined. To make a dent in this bottleneck, we have written a simple multithreaded1 compression library function based on Mark Adler's excellent pigz program (which really should be better-known...), and routed all of our gzipping through it. See compressed_pzwrite() in pigz.c for details, and feel free to use it in your own programs.

For multithreaded block-gzip, we use code from htslib 1.2.1.

1: Well, mostly. Since pigz doesn't support Windows yet, we just use single-threaded compression there.

Guidelines for adding new functionality

It's not too hard to add a new flag to PLINK if you're experienced with C programming. Define one or two more variables in main() and extend the argument list of the appropriate downstream function (usually plink()), add command-line parsing logic for your flag in the appropriate alphabetic branch of the big switch() block, and then insert your actual calculation. Some technical notes:

  • Most memory allocation is made off a giant 'stack' set aside when the program starts, using wrappers of the internal wkspace_alloc() function (which guarantees cacheline alignment). It is common for a function to note the initial 'stack' base, allocate some temporary buffers, then use wkspace_reset() to 'free' all the buffers simultaneously on exit.
    Note that this sometimes requires careful sequencing of memory allocations; e.g. any buffers that need to stay around after the function returns should be allocated first. PLINK currently ensures 64MB of regular heap space is available for out-of-order allocations.
  • While PLINK is technically a C++ program, the only C++ dependencies currently present are std::sort() and std::nth_element(). Everything else was written in valid C99, because other C++-specific idioms like I/O streams have a nasty habit of being 5-10x as slow as their C counterparts when you aren't sufficiently careful and knowledgeable. As a result, the program can actually be compiled with gcc, rather than just g++ (it didn't take long to throw in the necessary #ifdefs). This compiler agnosticism may not last much longer—we believe C++11 has enough advantages over C99 to be worth it, and the only thing still in the way of a full conversion to C++11 is poor compiler support on Scientific Linux. But for now, you're free to combine PLINK with either C-specific or C++-specific code.
  • The program works in both 32- and 64-bit environments, mostly due to careful use of the [u]int32_t, [u]intptr_t, and [u]int64_t datatypes, and specification of integer constant datatypes when necessary (e.g. 1LU or 1LLU instead of 1). There's also some vectorized 64-bit code which is shadowed by simpler 32-bit code; while this may seem like a waste of effort, in practice having two implementations of a complex function has been useful for debugging.
  • The code makes heavy use of bit shifts and bitwise operators, since the primary data element has a size of 2 bits. Yes, such code is harder to read and maintain, but the memory efficiency and speed advantages are worth it in most contexts where you'd be motivated to edit PLINK's source. (Presumably there's a reason why you aren't writing an R or Python script instead.)
  • Since the actual genotype table is not kept in memory, most analysis functions include a loop which loads a small data window, processes it, then loads the next data window, etc. Try to use the same strategy. If you can't, you may find the mmap() function handy (especially if you don't actually need your code to run on 32-bit systems).
  • Most other relevant information (including genomic matrix dimensions, bit arrays tracking which variants and samples are excluded from the current analysis, major/minor alleles, and MAFs) is kept in memory.
  • There are goto statements in the code, but in the vast majority of cases they are just there to standardize error-handling. Don't worry, they won't bite.
  • If you want to learn more, you should join our plink2-dev Google group.

Google groups >>