S: 13 Jan 2017 (b3.45) D: 13 Jan 2017 Main functions (distance...) (makegrmbin...) (ibstest...) (assoc, model) (mh, mh2, homog) (assoc, gxe) (linear, logistic) Core algorithms Quick index search 
Developer informationSource codeOur public GitHub repository is at https://github.com/chrchang/plinkng. Here's the source code snapshot for our latest posted build, and here's a mirror of PLINK 1.07's source. PLINK 1.9 is GPLv3+ software: you are free to modify and rerelease it, as long as you do not restrict others' freedom to do the same, and include proper attribution. Compilation instructionsWe haven't tested this across a wide range of machine configurations. If you are running into problems, or if you have experience setting up portable installation scripts, let us know. First time on Amazon Linux: sudo yum install y gcc gccc++ libstdc++ gccgfortran glibc glibcdevel make blasdevel lapack lapackdevel atlasdevel ./plink_first_compile (Ubuntu is similar, but you'll need both libatlasdev and libatlasbasedev.) First time on OS X: ./plink_first_compile Subsequent compiles on both platforms: make plink Due to the need for MinGWw64, our Windows build process is not as easy to replicate. However, if you have experience with that compiler, you should be able to adapt the 'plinkw' Makefile build target to your machine. We'll respond to requests for assistance as time allows. Core algorithmsPartial sum lookup
For example, the GCTA genomic relationship matrix is defined by the following permarker increments (where q is the MAF):
This suggests the following matrix calculation algorithm, as a first draft:
We can substantially improve on this by handling multiple markers at a time. Since seven cases can be distinguished by three bits, we can compose a sequence of bitwise operations which maps a pair of padded 2bit PLINK genotypes to seven different 3bit values in the appropriate manner. On 64bit machines, 20 3bit values can be packed into a machine word (for example, let bits 02 describe the relation at marker #0, bits 35 describe the relation at marker #1, etc., all the way up to bits 5759 describing the relation at marker #19), so this representation lets us instruct the processor to act on 20 markers simultaneously. Then, we need to perform the update A_{jk} := A_{jk} + f_{0}(x_{0}) + f_{1}(x_{1}) + ... + f_{19}(x_{19}) where the x_{i}'s are bit trios, and the f_{i}'s map them to increments. This could be done with 20 table lookups and floating point addition operations. Or, the update could be restructured as A_{jk} := A_{jk} + f_{{04}}(x_{{04}}) + ... + f_{{1519}}(x_{{1519}}) where x_{{04}} denotes the lowestorder 15 bits, and f_{{04}} maps them directly to f_{0}(x_{0}) + f_{1}(x_{1}) + f_{2}(x_{2}) + f_{3}(x_{3}) + f_{4}(x_{4}); similarly for f_{{59}}, f_{{1014}}, and f_{{1519}}. In exchange for a bit more precomputation (four tables of size 2^{15} each; total size 1 MB, which isn't too hard on today's L2/L3 caches), this restructuring licenses the use of four table lookups and adds per update instead of twenty. (Given probabilistic calls, where there's no sevencasespermarker restriction, we recommend expressing the result as the sum of a few large matrix multiplications; you can then rely on a machineoptimized dgemm/sgemm to do the heavy lifting.) Bit population count Most new (post2008) processors offer a specialized POPCNT instruction to directly evaluate this in hardware. But it turns out that the best portable vector algorithms, refined over nearly 50 years of hacking, can come within 10 percent of the speed of algorithms exploiting the hardware instruction. And unfortunately, reliance on the GCC __builtin_popcountll() intrinsic leads to a ~300% slowdown when POPCNT is not present. Therefore, 64bit PLINK relies on a vector algorithm. Building on work and discussion by Andrew Dalke, Robert Harley, Cédric Lauradoux, Terje Mathisen, and Kim Walisch (cf. Dalke's testing, and this comp.arch.arithmetic thread tracing the algorithm's origin to Harley), we developed an implementation that achieves nearoptimal performance with only SSE2 instructions, which are supported by all Intel and AMD x8664 processors dating back to 2003. Even older processors are supported by 32bit PLINK, which evaluates bit population counts with just integer operations. (See popcount_vecs() and popcount_longs() in plink_common.c.) Either way, when evaluating unweighted distances, this approach is even more efficient than using partial sum lookups. Many of PLINK's other calculations also have a bit population count, or a minor variant of it, in the innermost loop. Here's an example. Ternary dot product Encoding homozygous minor genotypes as the 2bit value 00, heterozygous genotypes as 01, and homozygous major as 10, we evaluate the dot product of two encoded genotype vectors x and y as follows:
Handling of missing values is discussed in our upcoming paper. It slows things down by a factor of three or so, but this approach still beats partial sum lookup. The same basic idea can be used to calculate covariance matrix terms (without variance standardization). However, in this case we have not found a good way to handle missing calls, so partial sum lookups are still employed. Vertical population count Brian Browning noted that it was not necessary to explicitly evaluate all four sums; instead, given three sums, the fourth could be determined via subtraction. This is a key optimization introduced by his PRESTO software, which produces a large speedup on the lowMAF variants that will be increasingly common in tomorrow's datasets. However, it is not compatible with the 'horizontal' bit population count described above (which sums one full row at a time, instead of one column at a time). Therefore, we decided to search for an efficient 'vertical population count' which could be used with Browning's optimization. One obvious approach involves ordinary parallel adds. Unfortunately, it is necessary to pad the 0/1 matrix entries out to 32 bits in order to avoid integer overflow, meaning only 4 terms can be updated simultaneously by SSE2 instructions; the padding also has a horrible effect on memory bus traffic. Now, you could buy an additional factor of 2 by limiting the number of cases to 65535 and using 16bit integers, but that's a crude hack which limits scalability... unless... ...unless you have an inner loop which runs to 65535, and an outer loop which uses the 16bit counts to update an array of 32bit totals before overflow would actually happen. PLINK uses a 4bit inner loop, an 8bit middle loop, and a 32bit outer loop instead of just a 16bit and a 32bit loop, but the basic idea is the same. In addition, instead of spending even 4 bits per permutation matrix entry, we use bit shifts and masks in the inner loop to efficiently unpack our 1bitperentry storage format. We use a specially reordered representation of the permutation matrix to streamline the inner loop logic. More precisely, we use an inner loop like this:
and a middle loop like so:
This has a lot in common with Edel and Klein's vertical population count. HardyWeinberg equilibrium and Fisher exact tests Using the probability distribution's supergeometric decay to put upper bounds on tail sums, we have also developed an even faster routine for comparing a contingency table against a specific pvalue. The Fisher 2x2 exact test has a nearly identical form, which we have applied all the same optimizations to. Our permutation test routines go one step further: comparisons against a known pvalue are usually performed by checking whether the upperleft contingency table value is within a precomputed range. This almost eliminates the need to use chisquare approximations, which can be a significant win when rare alleles are involved. The code for our FisherFreemanHalton 2x3 exact test is an extension of the SNPHWE strategy to two dimensions, with time complexity O(n), constant space complexity, and small enough coefficients that even tables with entries in the millions can be evaluated within seconds in a web browser on a commodity PC (see the demo link below). This is a very large improvement over Mehta and Patel's network algorithm and its descendants (including e.g. Requena's work), which have been the standard approach for the last 30 years. Extension to larger table sizes is conceptually straightforward (determine relative likelihoods of the dfball of tables near enough to the most probable one to not cause a machine precision underflow); the generalcase time complexity is O(n^{df/2}). We expect that the network algorithm is still the best basic approach once there are enough degrees of freedom, but if you're writing any serious software for df ≤ 6 we strongly recommend trying our algorithm. For your convenience, we have provided GPLv3 interactive JavaScript calculators and C/C++ standalone code utilizing these ideas. (That page also includes an O(sqrt(n)) exact binomial test applying the same idea.) Multithreaded gzip For multithreaded blockgzip, we use code from htslib 1.2.1. 1: Well, mostly. Since pigz doesn't support Windows yet, we just use singlethreaded compression there. Guidelines for adding new functionalityIt's not too hard to add a new flag to PLINK if you're experienced with C programming. Define one or two more variables in main() and extend the argument list of the appropriate downstream function (usually plink()), add commandline parsing logic for your flag in the appropriate alphabetic branch of the big switch() block, and then insert your actual calculation. Some technical notes:
