S: 4 Aug 2024 (b7.3) D: 4 Aug 2024 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic)
Quick index search |
## Developer information## Source codeOur public GitHub repository is at https://github.com/chrchang/plink-ng. Here's the source code snapshot for our latest posted build, and here's a mirror of PLINK 1.07's source. PLINK 1.9 is GPLv3+ software: you are free to modify and rerelease it, as long as you do not restrict others' freedom to do the same, and include proper attribution. ## Compilation instructionsWe haven't tested this across a wide range of machine configurations. If you are running into problems, or if you have experience setting up portable installation scripts, let us know. (Note that the plink_first_compile script currently puts zlib-1.3 under the parent directory; you may want to change this configuration.) First time on Amazon Linux: sudo yum install -y gcc gcc-c++ libstdc++ gcc-gfortran glibc glibc-devel make blas-devel lapack lapack-devel atlas-devel ./plink_first_compile (Ubuntu is similar, but you'll need both libatlas-dev and libatlas-base-dev.) First time on macOS: ./plink_first_compile Subsequent compiles on both platforms: make plink Due to the need for MinGW-w64, our Windows build process is not as easy to replicate. However, if you have experience with that compiler, you should be able to adapt the 'plinkw' Makefile build target to your machine. We'll respond to requests for assistance as time allows. ## Core algorithms
- Both genotypes are homozygous for the major allele.
- One is homozygous major, and the other is heterozygous.
- One is homozygous major, and the other is homozygous minor.
- Both are heterozygous.
- One is heterozygous, and the other is homozygous minor.
- Both are homozygous minor.
- At least one genotype is missing.
For example, the GCTA genomic relationship matrix is defined by the following per-marker increments (where - (2-2
**q**)(2-2**q**) / (2**q**(1-**q**)) - (2-2
**q**)(1-2**q**) / (2**q**(1-**q**)) - (2-2
**q**)(0-2**q**) / (2**q**(1-**q**)) - (1-2
**q**)(1-2**q**) / (2**q**(1-**q**)) - (1-2
**q**)(0-2**q**) / (2**q**(1-**q**)) - (0-2
**q**)(0-2**q**) / (2**q**(1-**q**)) - 0 (subtract 1 from the final denominator instead, in another loop)
This suggests the following matrix calculation algorithm, as a first draft: - Initialize all distance/relationship partial sums to zero.
- For each marker, calculate and save the seven possible increments in a lookup table, and then refer to the table when updating partial sums. This replaces a floating point divide and several adds/multiplies in the inner loop with a single addition operation.
We can substantially improve on this by handling multiple markers at a time. Since seven cases can be distinguished by three bits, we can compose a sequence of bitwise operations which maps a pair of padded 2-bit PLINK genotypes to seven different 3-bit values in the appropriate manner. On 64-bit machines, 20 3-bit values can be packed into a machine word (for example, let bits 0-2 describe the relation at marker #0, bits 3-5 describe the relation at marker #1, etc., all the way up to bits 57-59 describing the relation at marker #19), so this representation lets us instruct the processor to act on 20 markers simultaneously. Then, we need to perform the update A where the x A where x (Given probabilistic calls, where there's no seven-cases-per-marker restriction, we recommend expressing the result as the sum of a few large matrix multiplications; you can then rely on a machine-optimized dgemm/sgemm to do the heavy lifting.)
Most new (post-2008) processors offer a specialized POPCNT instruction to directly evaluate this in hardware. But it turns out that the best portable vector algorithms, refined over nearly 50 years of hacking, can come within 10 percent of the speed of algorithms exploiting the hardware instruction. And unfortunately, reliance on the GCC __builtin_popcountll() intrinsic leads to a ~300% slowdown when POPCNT is not present. Therefore, 64-bit PLINK relies on a vector algorithm. Building on work and discussion by Andrew Dalke, Robert Harley, Cédric Lauradoux, Terje Mathisen, and Kim Walisch (cf. Dalke's testing, and this comp.arch.arithmetic thread tracing the algorithm's origin to Harley), we developed an implementation that achieves near-optimal performance with only SSE2 instructions, which are supported by all Intel and AMD x86-64 processors dating back to 2003. Even older processors are supported by 32-bit PLINK, which evaluates bit population counts with just integer operations. (See popcount_vecs() and popcount_longs() in plink_common.c.) Either way, when evaluating unweighted distances, this approach is even more efficient than using partial sum lookups. Many of PLINK's other calculations also have a bit population count, or a minor variant of it, in the innermost loop. Here's an example.
Encoding homozygous minor genotypes as the 2-bit value 00, heterozygous genotypes as 01, and homozygous major as 10, we evaluate the dot product of two encoded genotype vectors - Set
**z**:= (**x**|**y**) & 0x5555... (**z**then notes which variants have at least one heterozygous genotype) - Set
**w**:= ((**x**^**y**) & (0xaaaa... -**z**)) |**z**(one way to 'multiply' all variants in parallel) - Sum the components of
**w**(via a slightly modified bit population count)
Handling of missing values is discussed in our upcoming paper. It slows things down by a factor of three or so, but this approach still beats partial sum lookup. The same basic idea can be used to calculate covariance matrix terms (without variance standardization). However, in this case we have not found a good way to handle missing calls, so partial sum lookups are still employed.
Brian Browning noted that it was not necessary to explicitly evaluate all four sums; instead, given three sums, the fourth could be determined via subtraction. This is a key optimization introduced by his PRESTO software, which produces a large speedup on the low-MAF variants that will be increasingly common in tomorrow's datasets. However, it is not compatible with the 'horizontal' bit population count described above (which sums one full row at a time, instead of one column at a time). Therefore, we decided to search for an efficient "vertical population count" which could be used with Browning's optimization. One obvious approach involves ordinary parallel adds. Unfortunately, it is necessary to pad the 0/1 matrix entries out to 32 bits in order to avoid integer overflow, meaning only 4 terms can be updated simultaneously by SSE2 instructions; the padding also has a horrible effect on memory bus traffic. Now, you could buy an additional factor of 2 by limiting the number of cases to 65535 and using 16-bit integers, but that's a crude hack which limits scalability... unless... ...unless you have an inner loop which runs to 65535, and an outer loop which uses the 16-bit counts to update an array of 32-bit totals before overflow would actually happen. PLINK uses a 4-bit inner loop, an 8-bit middle loop, and a 32-bit outer loop instead of just a 16-bit and a 32-bit loop, but the basic idea is the same. In addition, instead of spending even 4 bits per permutation matrix entry, we use bit shifts and masks in the inner loop to efficiently unpack our 1-bit-per-entry storage format. We use a specially reordered representation of the permutation matrix to streamline the inner loop logic. More precisely, we use an inner loop like this: - Set
**v**:= [reordered bit representation of next permutation matrix entries] - a4[0] +=
**v**& 0x1111... - a4[1] += (
**v**>> 1) & 0x1111... - a4[2] += (
**v**>> 2) & 0x1111... - a4[3] += (
**v**>> 3) & 0x1111... - Move a4 pointer forward 4 spaces
and a middle loop like so: - Zero out a4[] array
- Run inner loop 15 times
- a8[0] += a4[0] & 0x0f0f...
- a8[1] += (a4[0] >> 4) & 0x0f0f...
- Move a8 pointer forward 2 spaces, a4 forward 1
- Return to step 3 unless at end of a4 array
This has a lot in common with Edel and Klein's vertical population count.
Using the probability distribution's super-geometric decay to put upper bounds on tail sums, we have also developed an even faster routine for comparing a contingency table against a specific p-value. The Fisher 2x2 exact test has a nearly identical form, which we have applied all the same optimizations to. Our permutation test routines go one step further: comparisons against a known p-value are usually performed by checking whether the upper-left contingency table value is within a precomputed range. This almost eliminates the need to use chi-square approximations, which can be a significant win when rare alleles are involved. The code for our Fisher-Freeman-Halton 2x3 exact test is an extension of the SNP-HWE strategy to two dimensions, with time complexity O( Extension to larger table sizes is conceptually straightforward (determine relative likelihoods of the df-ball of tables near enough to the most probable one to not cause a machine precision underflow); the general-case time complexity is O( For your convenience, we have provided GPLv3 interactive JavaScript calculators and C/C++ standalone code utilizing these ideas. (That page also includes an O(sqrt(
For multithreaded block-gzip, we use code from htslib 1.2.1. 1: Well, mostly. Since pigz doesn't support Windows yet, we just use single-threaded compression there. ## Guidelines for adding new functionalityIt's not too hard to add a new flag to PLINK if you're experienced with C programming. Define one or two more variables in main() and extend the argument list of the appropriate downstream function (usually plink()), add command-line parsing logic for your flag in the appropriate alphabetic branch of the big switch() block, and then insert your actual calculation. Some technical notes: - Most memory allocation is made off a giant 'stack' set aside when the program starts, using wrappers of the internal bigstack_alloc() function (which guarantees cacheline alignment). It is common for a function to note the initial 'stack' base, allocate some temporary buffers, then use bigstack_reset() to 'free' all the buffers simultaneously on exit.
Note that this sometimes requires careful sequencing of memory allocations; e.g. any buffers that need to stay around after the function returns should be allocated first. PLINK currently ensures 64 MiB of regular heap address space is available for out-of-order allocations. - While PLINK is technically a C++ program, the only C++ dependencies currently present are std::sort() and std::nth_element(). Everything else was written in valid C99, because other C++-specific idioms like I/O streams have a nasty habit of being 5-10x as slow as their C counterparts when you aren't sufficiently careful and knowledgeable. As a result, the program can actually be compiled with gcc, rather than just g++ (it didn't take long to throw in the necessary #ifdefs). This compiler agnosticism may not last much longer—we believe C++11 has enough advantages over C99 to be worth it, and the only thing still in the way of a full conversion to C++11 is poor compiler support on Scientific Linux. But for now, you're free to combine PLINK with either C-specific or C++-specific code.
- The program works in both 32- and 64-bit environments, mostly due to careful use of the [u]int32_t, [u]intptr_t, and [u]int64_t datatypes, and specification of integer constant datatypes when necessary (e.g. 1LU or 1LLU instead of 1). There's also some vectorized 64-bit code which is shadowed by simpler 32-bit code; while this may seem like a waste of effort, in practice having two implementations of a complex function has been useful for debugging.
- The code makes heavy use of bit shifts and bitwise operators, since the primary data element has a size of 2 bits. Yes, such code is harder to read and maintain, but the memory efficiency and speed advantages are worth it in most contexts where you'd be motivated to edit PLINK's source. (Presumably there's a reason why you aren't writing an R or Python script instead.)
- Since the actual genotype table is not kept in memory, most analysis functions include a loop which loads a small data window, processes it, then loads the next data window, etc. Try to use the same strategy. If you can't, you may find the mmap() function handy (especially if you don't actually need your code to run on 32-bit systems).
- Most other relevant information (including genomic matrix dimensions, bit arrays tracking which variants and samples are excluded from the current analysis, major/minor alleles, and MAFs) is kept in memory.
- There are goto statements in the code, but in the vast majority of cases they are just there to standardize error-handling. Don't worry, they won't bite.
If you want to learn more, you are encouraged to watch the GitHub repository. Feel free to open discussions about the code on its issues page. |