Introduction, downloads

D: 14 Aug 2022

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--adjust-file

Linear scoring

--score

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Compilation

Adding new functionality

Google groups

Credits

File formats

Quick index search

Developer information

Source code

The PLINK 2.0 codebase is at https://github.com/chrchang/plink-ng/tree/master/2.0/. Here's the source code snapshot for our latest posted build.

PLINK 2.0's main application code is GPLv3+ software: you are free to modify and rerelease it, as long as you do not restrict others' freedom to do the same, and include proper attribution.

The codebase includes slightly modified copies of the following third-party libraries:

The include/ subdirectory is LGPLv3+-licensed. The following components in that subdirectory may be of interest:

  • plink2_text provides a pair of classes designed to replace std::getline(), fgets(), and similar ways of iterating over text lines. They have the following properties:
    • Instead of copying every line to your buffer, one at a time, these classes just return a pointer to the beginning of each line in the underlying binary stream, and give you access to a pointer to the end. The catch is that the line is invalidated when you iterate to the next one; it's like being forced to pass the same string to std::getline(), or the same buffer to fgets(), on every call. But whenever that's problematic, you can always copy the line before iterating to the next; on all systems I've seen, that still exhibits better throughput than getline/fgets. And in the many situations where there's no need to copy, you get a fundamentally lower-latency abstraction.
    • They automatically detect and decompress gzipped and Zstd-compressed files, in a manner that works with pipe file descriptors.
    • The primary TextStream class automatically reads and decompresses ahead for you. Decompression is even multithreaded by default when the file is BGZF-compressed. (And the textFILE class covers the setting where you don't want to launch any more threads.)
    • They do not support network input as of this writing, but that would not be difficult to add. The existing code uses FILE* in a very straightforward manner.
    • As for text parsing, the ScanadvDouble() utility function in plink2_string is a very efficient string-to-double converter. While it does not support perfect string<->double round-trips (that's what C++17 std::from_chars is for; GCC 11+ and Abseil have working implementations while we wait for clang...), or long-tail features like locale-specific decimal separators or hex floats, it has been incredibly useful for speeding up the basic job of scanning standard-locale printf("%g")-formatted and similar output. (Note that you lose roughly a billion times as much accuracy to %g's default 6-digit limit as you do to imperfect string->double conversion in that setting.)
  • pgenlib supports reading and writing of PLINK 2.x genotype files (".pgen"). A draft specification for this format is under https://github.com/chrchang/plink-ng/tree/master/pgen_spec/; here are some key properties:
    • A PLINK 1 .bed is a valid .pgen.
    • In addition, .pgen can represent multiallelic, phased, and/or dosage information. As of this writing, software support for multiallelic dosages does not exist yet, but it does for the other attribute pairs (multiallelic+phased, phased+dosage).
    • .pgen CANNOT represent genotype probability triplets. It also cannot store read depths, per-call quality scores, etc. While PLINK 2 can filter on the aforementioned BGEN/VCF fields during import, it cannot re-export or do anything else with them. Use other software, such as bcftools or qctool v2, when you must retain any of these fields.
    • .pgen is compressed, but in a domain-specific manner that supports very fast compression and decompression. It is even practical to perform several key computations (e.g. allele frequency) directly on the compressed representation, and this capability is exposed by the pgenlib library.
  • plink2_stats includes a function for computing the 2x2 Fisher's exact test p-value in approximately O(sqrt(n)) time—much faster than the O(n) algorithms employed by other libraries as of this writing—as well as several log-p-value computations (Z-score/chi-square, T-test, F-test) that remain accurate well beyond the limits of analogous functions in many other libraries. (No, you don't want to take a 10-1000000 p-value literally, but it can be useful to distinguish it from 10-325, and both of these numbers can naturally arise when analyzing biobank-scale data.)

The Python/ subdirectory includes a basic Python .pgen read/write library (also available on PyPI), and the pgenlibr/ subdirectory includes a basic R .pgen reader.

Compilation instructions

Either clang, or most versions of gcc can be used to compile PLINK 2. There is an important exception: gcc 8.3 has a known miscompilation bug that affects PLINK 2, and it is necessary to either upgrade to gcc 8.4+ or downgrade to gcc 7.x to work around it.

The build_dynamic/ subdirectory contains a Makefile suitable for Linux and macOS dynamic builds. Zlib is assumed to be installed. On Linux, if Intel MKL is installed using the instructions at e.g. https://www.intel.com/content/www/us/en/developer/articles/guide/installing-free-libraries-and-python-apt-repo.html, you can dynamically link to it; otherwise, LAPACK and ATLAS (liblapack-dev and libatlas-base-dev packages on Ubuntu) should be installed first.

The build_win/ subdirectory contains a Makefile for producing static Windows build. This requires MinGW[-w64] and zlib; a prebuilt OpenBLAS package from https://sourceforge.net/projects/openblas/files/ is also strongly recommended.

The build_cuda/ subdirectory contains a Makefile producing an Nvidia GPU-using build on Linux. Note that there is almost no GPU-using code for now; this is really just a proof of concept.

Guidelines for adding new functionality

It's not too hard to add a new command to PLINK 2 if you're experienced with C programming. Here are the main things to take care of along the way:

  • Most command-line parsing happens in a big switch() statement in plink2.cc's main(), with flag names grouped by first letter but otherwise not in a rigid order. This code performs initial validation of flag arguments, saves them to member(s) of the main Plink2CmdlineStruct variable, sets the command's bit in the command_flags1 member, and updates the filter_flags and/or dependency_flags members of that variable when necessary.
    • You must define a new Command1Flags entry for your command.
    • You'll usually want to add a member or two to Plink2CmdlineStruct. Make sure to initialize them properly, and add cleanup code if appropriate.
  • There are several intermediate computations, mostly related to allele/genotype frequencies, which are only conditionally executed. See DecentAlleleFreqsAreNeeded(), MajAllelesAreNeeded(), etc. for the condition checks, and add your flag to them as necessary.
  • Add a call to the function actually implementing the command toward the end of Plink2Core(), and add an #include statement to the top of plink2.cc if necessary.
    • Update Makefile.src when adding new source code file(s).
    • The only significant nonconstant global variables are:
      • g_bigstack_base and g_bigstack_end, the bounds of the currently-free region in the middle of the main memory arena.
      • g_logfile and g_logbuf, for logging.
      • g_textbuf, a ~256 KiB buffer intended for small-scale I/O.
      Everything else, including input dataset contents, is explicitly passed between functions; this is a bit verbose, but makes data dependencies clearer.

Some technical notes about how existing flags are implemented:

  • Most memory allocation is done through a "double-ended stack allocator".
    It is common for a simple function to note the initial g_bigstack_base value, perform allocations with typed wrappers of bigstack_alloc() (which guarantees cacheline alignment), and then call BigstackReset() at the end to free the allocations.
    More complex memory usage patterns, where some allocations must be freed mid-computation or returned to the caller, can often be handled efficiently by carefully ordering the allocations and taking advantage of the other end of the memory arena (see g_bigstack_end, bigstack_end_alloc(), and BigstackEndReset()). This does take a bit of getting used to, and doesn't quite cover everything, so PLINK 2 also ensures 64 MiB of regular heap address space is available for out-of-order allocations.
  • PLINK 2 is a C++ program, but it only uses C++ as a "better C", to the point that every use of a C++-specific feature has a valid (though occasionally less efficient) C99 fallback, and you can compile the program with either gcc or g++. This has the minor practical benefit of simplifying FFI development, but the main advantage of this style is the performance footguns it gets rid of. There are many C++ features (e.g. I/O streams) that have a nasty habit of being 5-10x as slow as their C counterparts when you aren't sufficiently careful and knowledgeable.
  • The program works in both 32- and 64-bit environments, mostly due to careful use of the [u]int32_t, [u]intptr_t, and [u]int64_t datatypes, and specification of integer constant datatypes when necessary (e.g. 1LU or 1LLU instead of 1). There's also some vectorized 64-bit code which is shadowed by simpler 32-bit code; while this may seem like a waste of effort, in practice having two implementations of a complex function has been useful for debugging.
  • The code makes heavy use of bit shifts and bitwise operators, since the primary data element has a size of 2 bits. Yes, such code is harder to read and maintain, but the memory efficiency and speed advantages are worth it in most contexts where you'd be motivated to edit PLINK 2's source. (Presumably there's a reason why you aren't writing an R or Python script instead.)
  • Since the actual genotype table is not kept in memory, most analysis functions include a loop which loads a small data window, processes it, then loads the next data window, etc. Try to use the same strategy.
  • Most other relevant information (including genomic matrix dimensions, bit arrays tracking which variants and samples are excluded from the current analysis, major/minor alleles, and MAFs) is kept in memory.
  • There are goto statements in the code, but in the vast majority of cases they are just there to standardize error-handling. Don't worry, they won't bite.
  • If you want to learn more, you should join our plink2-dev Google group.

Google groups >>