Introduction, downloads

D: 18 Aug 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Quick index search

PLINK 2.00 alpha

PLINK 2.0 alpha was developed by Christopher Chang, with support from Human Longevity, Inc. in 2016-17, and substantial input from Stanford's Department of Biomedical Data Science. (More detailed credits.) (Usage questions should be sent to the plink2-users Google group, not Christopher's email.)

Binary downloads

Build
Operating systemDevelopment (18 Aug)Alpha 5.14 final (20 Aug)
Linux AVX2 Intel1downloaddownload
Linux AVX2 AMD1downloaddownload
Linux 64-bit Intel1downloaddownload
Linux 32-bitdownloaddownload
macOS M1downloaddownload
macOS AVX2downloaddownload
macOS 64-bitdownloaddownload
Windows AVX2downloaddownload
Windows 64-bitdownloaddownload
Windows 32-bitdownloaddownload

1: Intel-labeled builds can still run on AMD processors, and vice versa, but they're statically linked to linear algebra libraries optimized for the labeled vendor.

Source code, compilation instructions, and the like are on the developer page.

Recent version history

20 Aug 2024: Backported Nov 2023 --glm no-firth covariate-preprocessing bugfix to alpha 5.

18 Aug: --vcf now prints a warning (to be upgraded to an error in a future build) when importing a single-sample VCF that appears to have all hom-REF genotypes filtered out; this can be disabled with --vcf-allow-no-nonvar. This behavior has been backported to alpha 5.
--r[2]-[un]phased now explicitly reports active filters when generating tabular output.

6 Aug: --export 'bgen-omit-sample-id-block' modifier implemented.

4 Aug: .ped and .map lines starting with '#' are now consistently treated as comments, following hint #2 in the original PED documentation. We have updated the respective PLINK 1.9 and 2.0 "File formats" entries, and apologize for overlooking this earlier.

4 Jul: Fixed multiallelic-variant REF/ALT permute bug in previous build. INFO Number=A and Number=R entries are now updated in an appropriate manner when REF/ALT alleles are permuted. --alt-allele flag added, controlling ordering of all rather than just the first ALT allele. --maj-ref now sorts all alleles by frequency, rather than just setting REF=highest-frequency and ALT1=next-highest. --maj-ref + --read-freq combination now prohibited, to reduce potential for confusion (--maj-ref just looks at the current dataset, --read-freq has never affected it).

25 Jun: --ref-allele/--alt1-allele/--ref-from-fa/--maj-ref can now be used with multiallelic variants. --maj-ref without 'force' now behaves as documented (only applies to variants with provisional REF alleles; previous implementation was backwards); this bugfix has been backported to alpha 5.

24 Jun: --king-cutoff-table and --king-table-require-xor implemented. --merge-max-alleles accepted as an alias for --merge-max-allele-ct, for naming consistency with --max-alleles and --import-max-alleles.

22 Jun: After looking at the original sequence reads, we have set the sex of sample HG03511 in the 1000 Genomes pedigree-corrected .psam files to NA, partially reversing the adjustment on 9 Jan 2023. The sequenced cell line appears to have one copy of chrX and no copies of chrY. This can be a consequence of either mosaic loss of chrX in a female or mosaic loss of chrY in a male. The pedigree provided by 1000 Genomes labels this sample as female, so that is almost certainly the true sex, but the chrX variant calls are much more representative of male data than female data.

17 Jun: --import-max-alleles implemented.

9 Jun: Fixed bug in --sort-vars handling of mixed known/provisional REF alleles when no variants were filtered out in the same run. Fixed --pgen-info bug in 26 May build.

26 May: --update-sex no longer exits without printing an error message when file-open fails. --pgen-info now prints to stdout instead of stderr. These fixes and last month's --no-categorical bugfix have been backported to alpha 5.

16 May: Fixed --r[2]-[un]phased table-output crash that could occur when the number of variants in the .bim/.pvar was an exact multiple of 64 (or 32 for 32-bit builds).

4 May: Fixed --r[2]-[un]phased 'ref-based' table-output segfault bugs. --r-[un]phased table-output signs should now be correct when the REF allele is not major.

18 Apr: --glm 'qt-residualize' implemented, along with optimizations for the no-covariate linear regression code it exercises. --pheno-svd command added. Fixed --ld-snp-list / 'inter-chr' bug that could occur when dataset was too large to fit in memory; we recommended rerunning large-dataset --ld-snp-list or 'inter-chr' jobs with this build. Fixed --no-categorical bug that caused a segfault or assertion failure when the first nonheader line of a phenotype or covariate file had a category name.

18 Mar: --glm 'cc-residualize'/'firth-residualize' can now be used without 'single-prec-cc'.

2 Mar: --hardy and --hwe now compute accurate chrX p-values below the double-precision floating point limit.

5 Feb: Fixed "--hardy midp" bug in 3 Feb build which could come up when processing more than 2.6 million samples.

3 Feb: Fixed 10+ ALT allele "--vcf-half-call missing" bug introduced in alpha 6. --hardy 'log10' modifier implemented. Except on chrX, --hardy/--hwe now compute accurate p-values below the double-precision floating point limit of 5e-324, so e.g. "--hwe 1e-400" now works properly.

5 Jan: Fixed multithreaded --make-bed data corruption bug that could occur when writing >260k samples while filtering or reordering samples, or writing >520k samples without filtering/reordering. This bugfix has been backported to alpha 5. Data corruption from this bug should be pretty obvious, but if you were in the habit of using --bfile/--make-bed instead of --pfile/--make-pgen with UK Biobank data, you may want to rerun large --make-bed operations and downstream steps to be safe.
Remaining builds and GitHub releases affected by this bug will be deleted in February 2024.

4 Jan: --r[2]-unphased low-MAF-optimization / ref-based bugfixes. --r-[un]phased now defaults to including ref instead of maj column-set when 'ref-based' is specified.

3 Jan: --r[2]-[un]phased table-mode now errors out when column-set doesn't seem to contain enough information to define the calculation (e.g. multiallelic variants are present and it may not be clear which allele is being counted); this can be overridden with the 'allow-ambiguous-allele' modifier. --r[2]-[un]phased 'ref-based' modifier implemented.

2 Jan: More r2 optimizations.

1 Jan 2024: Fixed --r[2]-[un]phased segfault that occurred when chrom column-set was omitted. Unphased-r2 calculation speedup on low-MAF (<0.8%) .pgen hardcall variants.

12 Dec 2023: --r[2]-[un]phased implemented. Fixed error in phased-r2 log-likelihood calculation that could cause a slightly suboptimal solution to be selected (occasionally changes --clump/--ld result); this fix has been backported to alpha 5.

23 Nov: Fixed categorical-covariate preprocessing bug that could occur in --glm no-firth mode. --normalize + variant-split in the same command now causes a warning (which will be upgraded to an error in a future build) to be printed, since those operations should almost always be split across two PLINK 2 runs.

21 Nov: Fixed serious --glm double-precision logistic regression bug; if you performed double-precision logistic regression on a binary phenotype with a Jan-Oct 2023 plink2 build, we recommend rerunning the regression with this fix. --vcf-require-gt should work properly with VCF files again. These bugfixes have been backported to alpha 5.
Affected builds and GitHub releases will be deleted in 2024.

21 Oct: Fixed segmentation fault that could occur when processing multiallelic variants with 4+ alleles. This bugfix has been backported to alpha 5.

29 Oct: Fixed --clump crash that could occur if the total number of alleles in the dataset was a multiple of 64 (or 32 for 32-bit builds); this bugfix has been backported to alpha 5. Fixed --clump chrX missing-data bug.

27 Oct: Phased-r2 (computed by --clump when --clump-unphased is absent, as well as the upcoming --r[2]-phased command) formula has been changed to remove an inappropriate independence assumption that caused the r2 of a variant with itself to be less than 1 when dosages were present. (Results are unchanged when dosages are not present.) Fixed --clump-unphased chrX bug. Alpha 5 has been updated with the --clump-unphased bugfix; dosages + --clump phased-r2 was already disabled.

24 Oct: Fixed --clump no-males chrX segfault, and --clump-unphased bug that could occur when dosages and fully-missing values were simultaneously present. These bugfixes have been backported to alpha 5.

18 Oct: Most computations now consider unknown-sex samples on chrY, unless they explicitly stated otherwise. (Yes, this is a potentially compatibility-breaking change.) The main exception is missingness-rate; in that case, use the new --y-nosex-missing-stats flag to include them. A related exception is --score and --variant-score when they perform mean-imputation; these now block you from processing chrY variants when unknown-sex samples are present.
Fixed a --variant-score chrY bug, and a --ld chrX unknown-sex handling bug. These bugfixes have been backported to alpha 5.
The --set-hh-missing flag has been renamed to --set-invalid-haploid-missing, to make the name more accurate. (The old flag name will still work in all PLINK 2.0 builds.)
The Byrska-Bishop et al. 1000 Genomes hg38 callset on the Resources page now has dbSNP 156 rsIDs if you want them.

11 Oct: Fixed a --clump/--ld phased-dosage bug. Affected code paths are now disabled in alpha 5.

5 Oct: --score-list implemented.

3 Oct: --sort-vars + --make-just-pvar now works properly instead of segfaulting. --sort-vars can handle larger INFO columns than it could yesterday.

2 Oct: Fixed --sort-vars bug that could occur when INFO column didn't fit in memory, and improved performance in that case (though a bit of extra disk space is now required). Fixed a --glm chrX/chrY --parameters + --tests bug. Fixed a --dummy phased-dosage bug. These bugfixes has been backported to alpha 5.

27 Sep 2023 (alpha 6): This makes the following potentially compatibility-breaking changes:

  • --allow-extra-chr is never necessary anymore unless --strict-extra-chr is specified.
  • --set-all-var-ids and related flags are now applied to all --pmerge[-list] inputs.
  • Polyploid genotypes in VCF/BCF files now default to triggering an error; this behavior is now controlled by the --polyploid-mode flag. Note that x/y/. genotypes were previously silently imported as x/y, but there is currently no --polyploid-mode setting corresponding to that behavior; contact us if this is a problem for your workflow.
  • -log10(x) column names now consistently start with "NEG_LOG10_", instead of "LOG10_".
  • --adjust[-file] QQ column is no longer affected by the 'log10' modifier.
  • --ld no longer has a 'dosage' modifier. Dosage is now always taken into account when present.

(Update: see also the 18 Oct 2023 change to handling of unknown-sex samples on chrY.)

23 Sep: "--export phylip[-phased]" implemented.
This is the final alpha 5(.0) build.

22 Sep: Fixed uninitialized-memory bug that could cause .sample loading to fail. This bugfix has been backported to alpha 4.

21 Sep: --clump implemented. Fixed an issue with --ld's handling of chrX heterozygous-haploid males.

15 Sep: Fixed pgenlib phased-dosage sample-subsetting bug. (The bug only occurred when the caller actually requested the phased-dosage data track; affected plink2 commands were "--export {b,v}cf vcf-dosage=HDS", "--export bgen-1.{2,3}", and --ld with the 'dosage' modifier.) This bugfix has been backported to alpha 4.

13 Sep: Fixed --ld bug introduced in previous build.

11 Sep: Large-dataset (>92k sample) minimac3-r2 calculation bugfix. Backported to alpha 4 on 12 Sep.

8 Sep: --update-chr implemented; it now must be used with --sort-vars. To reduce potential for confusion, --sort-vars can no longer be used with output commands other than --make-[b]pgen/--make-bed and --pmerge[-list].

29 Aug: Fixed .pvar-writing bug that could occur when the INFO/PR flag was present in the input. This and the --indep-pairwise/--indep-pairphase bugfix have been backported to alpha 4.

25 Aug: Fixed --indep-pairwise/--indep-pairphase chrX/chrY bug that occurred when some nonfounders were present.

18 Aug: --indep-pairphase implemented. (This command now requires all genotype data to be phased.)

13 Aug: Fixed --recover-var-ids bug that occurred when a chromosome ended with multiple variants at the same position. This and other recent bugfixes have been backported to alpha 4.

11 Aug: PAR1/PAR2 chromosome codes are now usable in nonhuman species. 'chm13' build code added for --split-par. Fixed GRM/PCA bug in 32-bit builds.

4 Aug: Fixed bug in --pmerge-list parsing of 3-column input, when second and third columns weren't of equal length.

7 Jul: Fixed --missing-code for binary and quantitative phenotypes.

21 Jun: Fixed --king-table-require's behavior when FID isn't constant. This bugfix has been backported to alpha 4.

7 Jun: Fixed --update-name bug that could cause the updated variant names to be corrupted in some cases, or trigger an assertion error. (Update, 10 Jun: This bugfix has been backported to alpha 4.) --strict-extra-chr flag added; it currently has no effect, but in the future, we will remove the current requirement of including the --allow-extra-chr flag whenever the dataset contains nonstandard chromosome code(s), except when --strict-extra-chr is present.

31 May: Fixed --read-freq bug that occurred when some alleles in the loaded variant were not present in the --read-freq file. This bugfix has been backported to alpha 4.

28 May: Fixed GRM/--pca bug affecting handling of missing data in multiallelic variants. This bugfix has been backported to alpha 4.

16 May 2023 (alpha 5): This makes the following potentially compatibility-breaking changes:

  • For --glm, 'provref', 'omitted' and 'a1freq' are now included in the default column set, to improve GWAS Catalog compatibility. The 'ax' column set has been retired.
  • 'maybeprovref' is now included in the default column set of all other commands that support it, to better distinguish trusted VCF-derived REF alleles from generic .bim A2 alleles.
  • --indep-pairwise now defaults to "--indep-order 2", which tends to be much faster but slightly changes results from PLINK 1.x. Use "--indep-order 1" to request the old behavior.
  • When it looks like chrX is being mis-imported (no sex information, or missing --split-par), PLINK 2 now errors out instead of just printing a warning. You can suppress this with --lax-chrx-import, but this is strongly discouraged.
  • When nonfounders are present for --mac/--max-mac/"--freq counts", you now must explicitly specify whether they should be included (--nonfounders) or excluded (--ac-founders) in the computation.
  • --adjust now reports genomic inflation factors smaller than 1. (The factor is still rounded up to 1 for the purpose of reporting inflation-corrected p-values.)

26 Apr: When numeric chromosome codes 27 or 28 appear in an input .bim/.pvar when the default human chromosome set is in effect, an appropriate error message is now printed.
This is the final alpha 4(.0) build.

17 Apr: Fixed a bug with loading multiallelic variants in a mixed-phased/unphased dataset.

11 Apr: --gwas-ssf command added, which (re)formats --glm output for submission to the GWAS Catalog. QUAL values are now allowed to be infinity. --adjust-file should now work properly on files with no TEST column. 'maybeprovref' and 'provref' optional column sets added to many commands. Built-in zstd version updated to 1.5.5, which fixes a very rare corruption bug for --zst-level 16+.

29 Mar: Fixed a rare multiallelic-variant .pgen writing bug that could result in an obviously-corrupted .pgen (--validate reports unexpected file size) when REF and ALT1 frequencies are both low.

25 Mar: Fixed --indep-pairwise chrX bug that occurred when some nonfounders were present. --indep-pairwise speed improvement. --indep-order flag added; "--indep-order 2" greatly speeds up --indep-pairwise while slightly changing results, and will become the default in the future. --glm 'provref' optional column set added, distinguishing provisional from known reference alleles; this will be included in the default column set in the future.

3 Mar: Fixed --hardy/--hwe bug that could cause a segfault when chrX and multiallelic variants are present, and sex information is absent. --glm 'omitted' optional column set added, listing only the omitted allele; this (and 'a1freq') will be included in the default column set in the future. --glm 'ax' column set deprecated. Linux builds now avoid requesting more memory than reported by /proc/meminfo's 'MemAvailable' entry (when it exists). --pca now prints a more informative error message when there is a nan in the GRM.

9 Jan: --sample-counts multiallelic-variant heterozygous-haploid bugfix. --ac-founders flag added to confirm intentional exclusion of nonfounders from --mac/--max-mac/"--freq counts". (Why? Because we overlooked this detail when processing the preliminary 1000 Genomes hg38 callset...) A warning, which will be upgraded to an error in the future, is now printed when nonfounders remain during these operations, and neither --nonfounders nor --ac-founders is specified.
The sex of sample HG03511 in the 1000 Genomes pedigree-corrected .psam files has been changed to male, due to having >30x fewer heterozygous calls on chrX (outside the pseudoautosomal regions) than average for females in the hg38 callset, no autosomal evidence of inbreeding, the largest fraction of non-SNPs among its chrX heterozygous calls, and the largest percentage decline in chrX heterozygous call count from build 37 to 38.

8 Jan: BCF import/export now permits non-Flag INFO keys to have no value, as allowed by the VCFv4.3 specification, and handles empty-String values properly. --extract-if-info/--exclude-if-info now support comparison against the empty-string value (specify ';' as the value). --no-categorical flag added (forces all non-numeric phenotype/covariate strings to be interpreted as missing values, not category names).
The final Byrska-Bishop et al. 1000 Genomes hg38 callset is now available on the Resources page. (Due to a header line and an INFO annotation quirk, PLINK 2 builds older than 8 Jan 2023 are unable to convert this dataset to or from BCF.)

7 Jan: VCF header line parser no longer requires key/value pairs to be in ID, Number, Type order. Fixed uninitialized-free() crash that occurred when e.g. the --bcf filename was mistyped.

1 Jan 2023 (alpha 4): Primary changes:

  • --glm logistic/Firth regression now defaults to double-precision (float64) instead of single-precision (float32) arithmetic, the main logistic regression convergence criterion now matches R's, and the Firth regression logic has been updated to line up with the latest (1.24.1) version of the R logistf package. This improves the reliability of results (and the consistency of PLINK 2's interface), but at a significant computational cost. When the latter is a problem, you can still request single-precision arithmetic with the 'single-prec-cc' modifier (and 'cc-residualize' may also be worthwhile).
  • --allow-misleading-out-arg is now required when using an --out argument ending with a common filename extension.

This also fixes a --glm degenerate-covariate-removal bug that could result in a spurious "genotype/covariate scales vary too widely for numerical stability" error when at least one such covariate was present.

26 Nov 2022: An Apple Silicon native build is now available.

24 Oct: Fixed a bug that prevented VCF/.bgen import from erroring out properly when sample ID import failed. Added --king-table-require flag supporting some-vs.-all --make-king-table runs. Linux builds running on shared HPC nodes now try to detect the number of cores available to the user, rather than the total number of cores on the system. A Linux build statically linked to AMD AOCL is now available.

14 Aug: Fixed AVX2 --set-[all,missing]-var-ids bug that caused spurious "Variant IDs are limited to 16000 characters" errors when the POS value for a variant was 0.

9 Aug: Fixed --pmerge[-list] bug that could trigger when duplicate variants (identical ID, position, and alleles) were present in an input fileset.

1 Aug: Out-of-bounds --pheno-col-nums/--covar-col-nums arguments now produce an appropriate error message, instead of a segfault. A warning is now printed when the --out argument ends with a common file extension (.bcf, .bed, .bgen, .gz, .ped, .pgen, .vcf); this will become an error in the future, and the warning/error can be suppressed with --allow-misleading-out-arg. A warning is now printed when importing non-PLINK-format data containing chrX, and expected sex information and/or --split-par is missing; this will also become an error in the future, which can be suppressed with --lax-chrx-import. HIGH_AUTOSOME_NUM_BUILD #define added to plink2_common.h.

3 Jun: Fixed --score dosage-processing bug in yesterday's build that could cause a segfault, and implemented more speed improvements. plink2_common.h now has a HIGH_CONTIG_BUILD #define that can be uncommented to extend the contig-count limit from ~65k to ~980k; the "Too many distinct nonstandard chromosome/contig names" error message now mentions this.

2 Jun: Fixed .bed-transpose bug which resulted in incorrect imports of .ped files with less than ~21k variants; if you used --pedmap on a small .ped, you should rerun with this build. Fixed bug that could cause a --pmerge[-list] assertion failure when multiallelic variants are present in the input. --score parallelism improvement.

31 May: HGDP-CEPH whole-genome-sequencing genotype calls have been posted to the Resources page.

19 May: Fixed AVX2-build-only bug that could result in miscounting of the genotypes at the very end of a row when a sample filter was being applied. --geno-counts, --hardy, and --hwe were affected; if you used one of these flags in the same run as a sample filter (--keep/--remove, etc.) with an AVX2 build, we recommend that you rerun with this build.

14 May: --make-grm-bin bugfix (.grm.N.bin file contained too few entries when no missing genotypes were present).
This is the final alpha 3(.0) build. S3 download links for this and subsequent alpha 3.x binaries will continue working indefinitely.

3 May: --error-on-freq-calc flag added, to help with pipeline performance optimization; thanks to Arthur Gilly for the suggestion. Documentation now includes a list of flags which may invoke the allele frequency calculation.

26 Apr: 'shrink-overlapping-deletions' modifier renamed to 'adjust-overlapping-deletions', since it may move rather than (or in addition to) shrinking an overlapping deletion.

25 Apr: --normalize no longer defaults to trying to shrink variants with a '*' overlapping-deletion allele; instead this must be enabled with the 'shrink-overlapping-deletions' modifier.

24 Apr: Fixed a bug that could result in a "Malformed .pgen" error or corrupt a small amount of data when simultaneously using --indiv-sort and taking a <6% subset of a dataset with dosages or multiallelic variants. --dummy now explicitly simulates some LD, and uses a uniform allele frequency distribution (instead of just AF=0.5) to generate variants. --dummy now accepts multiple (comma-separated) missing dosage frequencies; one of the provided frequencies is selected for each variant. --dummy phase-freq= now works properly for phase-frequencies in (0.5, 1).

19 Apr: GRCh38 1000 Genomes phase 3 chrY/chrM/contigs data has been added to the Resources page.

15 Apr: '*' is now allowed in the middle of contig names when exporting a VCFv4.3 file. (This was disallowed by a few earlier VCFv4.3 drafts, but it's now allowed, and actually necessary to work with some contigs in the GRCh38 reference that the latest 1000 Genomes callset was based on.)
Filenames of the form all_hg38* have been changed to main_hg38* on the Resources page, in preparation for posting chrY/chrM/contig data.

13 Apr: The Resources page has been reorganized, and now has both GRCh37 and GRCh38 1000 Genomes phase 3 datasets. The corresponding FASTA files have also been added.

10 Apr: VCF/BCF export no longer places usually-incorrect contig lengths (based on the latest variant position) in the header when the lengths are absent from the .pvar input. When these lengths are needed, they can now be scraped with --fa.

28 Mar: --geno-counts bugfix for chrY multiallelic variants. --delete-pmerge-result and --indv flags added.

22 Mar: --pmerge[-list] now consistently errors out when a duplicate sample ID is present, instead of usually crashing in this scenario. (This case is likely to be supported in the future; PLINK 1.9 can handle it.)

20 Mar: "--export sample-v2" bugfix for haps[legend], .gen, and bgen-1.1 formats. Improved GRM computation's load-balancing in some circumstances (built-from-source without linking Intel MKL, OpenBLAS, or Apple Accelerate; or --parallel); thanks to Lorién López Villellas for identifying the issue.

17 Mar: --export's vcf-dosage=DS-only mode now errors out when chrX (excepting the pseudoautosomal regions), chrM, and/or fully-haploid chromosomes are present, since readers may not be able to infer ploidy without the GT field; relatedly, --vcf/--bcf dosage=DS now requires the GT field to also be present in these scenarios. Fixed --bcf bug that triggered when genotype/dosage data was present, but not of the expected form (e.g. a DS-only file when the user did not specify dosage=DS).

16 Mar: Fixed a --glm local-covar= bug that could occur when same-position variants were present and a variant filter was applied. C/C++ pgenlib writer no longer requires the number of variants to be known in advance (instead, you just need to provide an upper bound).

15 Mar: Fixed a bug that could cause PLINK 2 to enter an infinite loop instead of erroring out if the user attempted to export a .bgen with multiallelic variant(s), or tried to process a corrupted .pgen.

14 Mar: .pgen specification now defines .pgen.pgi index files, and PLINK 2 (and the pgenlib C/C++ library) can now read these files when necessary. pgenlib can now write .pgen + .pgen.pgi file pairs in a sequential manner, and the pgen_compress sample program has been updated to be able to use this functionality.
This is technically a backward-incompatible change to the C/C++ library: a few function signatures have been adjusted. (The Python and R APIs have not changed.)

13 Mar: --tfile speed improvement.

12 Mar: --pedmap and "--export ped" implemented. Chromosome filter and CM column now work properly with --tfile. --not-pheno and --not-covar flags added.

10 Mar: Fixed --q-score-range bug that manifested on chrY, chrM, and all-haploid chromosome sets.

8 Mar: --tfile and "--export tped" implemented. --export '01' and '12' modifiers implemented.

5 Mar: --export 'A-transpose' format renamed to 'Av'; documentation now explains that exporting it does not involve transpose, while exporting 'A'/'AD' does. (The 'A-transpose' name will continue to work.) --file now generates an appropriate error message.

2 Mar: --within now has a clearer error message when no sample IDs match up.

23 Feb: Non-dosage --ld should now correctly mask out chrX male heterozygous genotypes.

17 Feb: --pmerge[-list] should now work properly with --allow-extra-chr.

29 Jan: "DS-only" mode added for VCF/BCF dosage export.

21 Jan: --indep-preferred flag added; this helps you control which variants --indep-pairwise tries to keep.

20 Jan 2022: When monomorphic autosomal variants are present, --het now prints a warning indicating that they are skipped.

17 Dec 2021: Fixed a bug that caused incorrect sample IDs to be printed in some --sample-diff error messages.

25 Nov: --glm firth-fallback mode now works properly when a categorical covariate is present; previously it was incorrectly using Firth regression all the time.

11 Oct: Filled in missing --adjust-file error message when file couldn't be opened. Python writer all_phased=True bugfix.

20 Sep: --glm 'hetonly' mode added. --glm diploid-only modes ('dominant', 'recessive', 'hetonly', 'genotypic', 'hethom') no longer exclude chrX when all samples are female.

8 Sep: --merge-max-allele-ct should work properly now. (In particular, --pmerge[-list] should no longer throw a "'split' multiallelic variant" error when "--merge-max-allele-ct 2" is specified.)

5 Sep: --glm 'cc-residualize' and 'firth-residualize' modes should now work properly with multiallelic variants and genotypic/hethom joint tests.

26 Aug: --pmerge-list concatenation-job detector no longer misses concatenation jobs where adjacent input files have same-position variants but those variants are always sorted by ID. --dummy can now generate phased data.

16 Aug: --parameters should now behave as described in the documentation when --glm is run with both the 'sex' and 'interaction' modifiers.

4 Aug: --update-sex now accepts 'U'/'u' as unknown-sex codes.

1 Jul: Fixed --bcf bug that could result in a spurious out-of-memory error when the file has no INFO entries.

8 Jun: Fixed multiallelic-variant-writing bug (typically manifesting as a segmentation fault or assertion failure) that could occur with --sort-vars or under low-memory conditions.

25 May: .fa loader now tolerates blank lines. gzip files containing multiple streams or trailing garbage should be accepted again.

23 May: Fixed FID+IID+SID loading (recent builds were giving an incorrect "SID column does not immediately follow IID column" error).

5 May: Fixed --within bug introduced on 16 Jan.

20 Apr: --het cols= should now work properly.

16 Apr: --data/--gen now supports .gen files with 6 leading columns. This format can be exported with "--export oxford-v2".

14 Apr: --pmerge-list should no longer be limited by the system's #-of-open-files cap.

13 Apr: --glm local-covariate-handling bugfix. Fixed --pmerge[-list] bug that could cause the generated .pgen header to be invalid when multiallelic variants were present.

6 Apr: --data/--sample now recognizes column type 'C' as a synonym for 'P' (continuous phenotype). (This build has an incorrect "6 Mar" datestamp; sorry about that.)

28 Mar: --sample-counts chrX no-known-males bugfix.

25 Mar: --pmerge[-list] .bim-handling bugfix.

23 Mar: Unbreak --make-pgen + --sort-vars (this was broken by the 28 Feb build).

2 Mar: --pmerge[-list] bugfixes (no longer segfaults when all variants are at different positions; if the output .pvar file already exists, it's deleted first instead of appended to; if an input file covers multiple chromosomes, there is no longer a likely assert failure; fixed some issues with merging of same-position same-ID variants).

1 Mar: --pmerge-list-dir flag implemented (specifies a common directory prefix for all --pmerge-list entries).

28 Feb: --pmerge[-list] can now be used for concatenation-like jobs.
Note that it doesn't necessarily perform pure concatenation on a chromosome-split dataset: if two variants in a file have the same position and ID, they will be merged, in a way that's not compatible with 'split' multiallelic variants sharing a single ID (those must be merged with a dedicated 'join' operation, such as "bcftools norm -m +"). As a consequence, --pmerge[-list] defaults to erroring out when it detects such a split variant. One workaround is to use --set-all-var-ids to assign distinct IDs to each piece of the split variant.

3 Feb: Fixed .pvar loading bug that triggered when FILTER values were relevant at the same time as either INFO/PR or CM values. .ped-derived filesets containing variants where both REF and ALT are missing are permitted again (such variants were prohibited in recent builds). --vcf-ref-n-missing flag added to simplify re-import of .ped-derived VCFs. Removed extra tabs from --pgen-diff output.

23 Jan: --chr-set now sets MT to haploid. ##chrSet .pvar header line without the corresponding command-line flag now initializes chrX, chrY, and MT ploidy correctly.

18 Jan: ##chrSet .pvar header lines now conform to VCFv4.3 specification (an ID field is included). VCF/BCF export now performs more header validation.

16 Jan: --update-parents now works properly with 'maybeparents' output column sets in other commands.

14 Jan: --update-ids now works properly with 'maybefid' output column sets in other commands when it creates a FID column.

4 Jan: --normalize now properly skips missing and '*' alleles.

3 Jan: Fixed --bcf bug that affected unphased multiallelic variants (usually resulting in a spurious "GT half-call" crash, but if you suppressed that with --vcf-half-call the data would not be imported correctly). Fixed "--export bcf" bug that occurred on headers with FILTER/INFO/FORMAT keys with identical names, and a crash that occurred on variants with multiple FILTER failures. --output-missing-genotype/--output-missing-phenotype bugfixes/cleanup.

2 Jan: --pgen-diff multiallelic-variant handling bugfix. --pgen-diff DS comparison implemented. --adjust cols= parsing bugfix ('cols=+qq' should work now).

1 Jan 2021: Several SID-handling bugfixes. --sample-diff 'dosage' and 'id-delim=' modifier command-line parsing bugfixes. --sample-diff no longer omits later ALT alleles when they're absent from the samples being compared. --pgen-diff GT comparison implemented (generalization of PLINK 1.x --merge-mode 6/7).

12 Dec 2020: --q-score-range score-average, ALLELE_CT, DENOM, and NAMED_ALLELE_DOSAGE_SUM column bugfix.

28 Oct: Multipass "--export A" bugfix. If you've previously run plink2 "--export A" on a file too large to fit in memory, we recommend that you rerun with this build.

20 Oct: --fst Weir-Cockerham method implemented. --fst ids= and chrX bugfixes. --fst variant-report OBS_CT is now specific to population pair.

19 Oct: Linux binaries should now yield reproducible results across machines unless --native is specified (previously, Intel MKL could select processor-dependent code paths with different floating-point rounding behavior). --fst Hudson method implemented. Categories within categorical phenotypes are now reported in natural-sorted order. --variant-score MISSING_CT/OBS_CT bugfix.

23 Sep: --update-ids no-FID bugfix.

14 Sep: --glm + --parameters chrX/chrY bugfix.

31 Aug: --data/--sample now supports QCTOOLv2's .sample dialect. --export 'sample-v2' exports it.

27 Jul: --glm 'cc-residualize' implemented. Note that these approximations are not recommended if you have a significant number of missing genotypes.

25 Jul: --glm 'firth-residualize' modifier added. This implements the fast Firth approximation introduced in Mbatchou J et al. (2020) Computationally efficient whole genome regression for quantitative and binary traits.

6 Jul: --af-pseudocount flag implemented; this lets you specify a pseudocount other than 0 or 1 for allele frequency estimation.

1 Jul: --make-[b]pgen 'fill-missing-from-dosage' modifier implemented, to support algorithms that require no missing hardcalls.

27 Jun: --hardy/--hwe chrX multiallelic-variant handling bugfixes.

25 Jun: Replaced a misleading "No such file or directory" file-read error message.

20 Jun: --het implemented.

15 Jun: --glm local-covar= no longer errors out on long RFMix2 header lines, as long as ID lengths are reasonable.

31 May: Added single-precision --variant-score mode.

11 May: Fixed --glm segfault that occurred when categorical covariates were present, but none had more than 2 categories.

9 Apr: Firth regression implementation now uses the same maxit=25 value as R logistf(). 'UNFINISHED' error code added to flag logistic/Firth regression results which would change with even more iterations.

28 Mar: Fixed --glm bug in 21 Mar build that caused segfaults when zero-MAF biallelic variants were present. --glm now errors out when no covariate file is specified, unless the 'allow-no-covars' modifier is specified.

21 Mar: Fixed --glm multiallelic-variant handling bugs that could occur when 'genotypic', 'hethom', 'dominant', 'recessive', 'interaction', or --tests was specified, and corrected 'dominant'/'recessive' documentation. It is no longer necessary to trim zero- (or other-constant-) dosage alleles from multiallelic variants to get --glm results for the other alleles.

14 Mar: --make-pgen/--make-just-pvar 'vcfheader' column set added (this makes it possible to directly generate a valid sites-only VCF). Bgzipping of the .pvar file is not directly supported, but you can use a named pipe to accomplish that with low overhead.

11 Mar: Fixed --glm segfault that could occur when no covariates were specified. VCF/BCF importers now default to compressing the temporary .pvar file, so that files with lots of INFO field content don't require a disproportionally large amount of free disk space to work with. --keep-autoconv now has a 'vzs' modifier to request compression of the .pvar file (and conversely, when --vcf/--bcf is used with bare --keep-autoconv, the .pvar is not compressed).

10 Mar: Fixed --make-pgen segfault that occurred when phased dosages were present without any phased hardcalls.

8 Mar: "--export bcf" implemented. VCF-export multiallelic HDS-force bugfixes. Added missing FILTER/fa header line to whole-genome 1000 Genomes phase 3 annotated .pvar files on Resources page.

25 Feb: --ld multiallelic-phased data handling bugfix.

22 Feb: --bcf n_allele=1 (ALT='.') bugfix.

19 Feb: --bcf GQ/DP-filtering bugfixes. --vcf and --bcf now enforce VCF contig naming restrictions.

17 Feb: --bcf implemented.

11 Feb: "--vcf-half-call reference" works properly again (it was behaving like "--vcf-half-call error" in recent builds).

8 Feb: BGZF-compressed text files should now work properly with all commands that make multiple passes over the file (previously they worked with --vcf, but almost no other commands of this type). Named-pipe input to these commands should now consistently result in an error message in a reasonable amount of time; previously this could hang forever.

3 Feb: --missing-code now works properly with --haps.

24 Jan: Fixed --extract/--exclude bug that could occur when another variant filter was applied earlier in the order of operations (e.g. --snps-only, --max-alleles, --extract-if-info). This bugfix has been backported to alpha 2.

23 Jan: Added --bed-border-{bp,kb} flags for extending all "--extract range"/"--exclude range" intervals.

21 Jan: "--extract range" and "--exclude range" no longer error out when their input files contain a chromosome code absent from the current dataset.

16 Jan: --pca allele/variant weight multithreading bugfix.

14 Jan: --make-king-table rel-check bugfix.

3 Jan 2020: Fixed --extract-if-info/--exclude-if-info numeric-argument bug introduced in late October.

30 Dec 2019 (alpha 3): This makes the following potentially compatibility-breaking changes:

  • --write-snplist and --indep-pairwise require all variant IDs to be unique. For --write-snplist, this can be overridden by adding the 'allow-dups' modifier.
  • .bgen/.gen import commands require the REF/ALT mode to be explicitly declared.
  • --glm defaults to 'firth-fallback' mode for binary phenotypes. The old behavior can be requested with the 'no-firth' modifier.
  • --glm errors out, instead of just skipping the phenotype and printing a warning, when there's a linear dependency between the phenotype and the covariates. The old behavior can be requested with the 'skip' modifier.
  • --pca's 'var-wts' subcommand has been replaced with 'allele-wts', which handles multiallelic variants properly. For datasets that contain only biallelic variants, the old output format can still be requested with 'biallelic-var-wts'.
  • PLINK 2 now errors out when you request an LD computation on a dataset with less than 50 founders. This can be overridden with --bad-ld.
  • --score's old NMISS_ALLELE_CT column (nonmissing allele count) has been renamed to ALLELE_CT, and the column set renamed accordingly, since in other contexts, 'nmiss' refers to the number of missing values, which is essentially the opposite...
  • --make-king-table's ID{1,2} columns have been renamed to IID{1,2}, for consistency with other PLINK 2 commands.

In addition, the GRM computation (along with "--pca approx" and "--score variance-standardize") now handles multiallelic variants properly, instead of just collapsing all minor alleles together; --score allows each allele in a multiallelic variant to be assigned its own score; and --glm handles categorical covariates in a manner that's less likely to cause VIF overflow.

The final alpha 2 build has been tagged in GitHub, and will remain downloadable from here for the next several months.

29 Dec: Fixed a bug which affected processing of some heterozygous-double-ALT multiallelic variants, and a bug that caused ALT2/ALT3/etc. allele frequencies to not be properly initialized in some circumstances.

13 Dec: Fixed bug introduced in 22 Nov build which caused some reported dosages/counts (such as --freq's OBS_CT column) to be doubled. --loop-cats bugfixes.

28 Nov: Fixed a VCF half-call handling bug introduced last month.

26 Nov: Fixed recent bug which caused a segfault when no-duplicate-allowed variant ID lookup was performed with more than 16 threads.

25 Nov: Fixed bug that caused --sort-vars to segfault when the number of contigs was a multiple of 16. --keep-fcol and --extract-fcol were judged to be poopy names, and have been renamed to --keep-col-match and --extract-col-cond respectively (the old names will still work in this build).
The online documentation is now almost complete. The sidebar search box works.

22 Nov: Firth regression speed improvement. "--freq counts" now exports dosages with enough precision for --read-freq to perfectly reconstruct the original allele frequencies from the .acount file, and --read-freq has been modified to do that.

20 Nov: Fixed --make-king[-table] + --parallel bug.

15 Nov: Fixed "--glm cols=+err" bug that could cause garbage output when 'hide-covar' was not specified. --covar-number retired (previously it was being incorrectly converted to --covar-col-nums, which does not have the same semantics).

12 Nov: All-vs.-all --make-king[-table] runs now handle MAF < 1% variants much more efficiently. --no-input-missing-phenotype option added. --variant-score now supports binary output.

10 Nov: Fixed bug introduced in 29 Oct build that caused a segfault when a 'NA'/'nan' phenotype or covariate value was encountered.

9 Nov: --variant-score (transpose of --score) implemented.

4 Nov: Restored "--export vcf" invalid-allele-code warning.

31 Oct: --split-cat-pheno 'omit-most' modifier implemented; it works better with --glm's built-in variance-inflation-factor check than 'omit-last', and --glm will switch to handling categorical covariates in this manner in alpha 3.

30 Oct: Fixed bug that caused --covar-col-nums and --covar iid-only to get mixed up. Stricter blank-line policy for most text input files: they're allowed at the end (since this happens every once in a while with manually edited files), but they're no longer allowed elsewhere. Removing the FILTER and/or INFO columns when generating a .pvar file (with e.g. 'pvar-cols=-info') now removes the corresponding header lines.

29 Oct: --q-score-range implemented. Strings which start with a number but contain nonnumeric content (e.g. "-123.4abc") now trigger an error when a floating-point number is expected; the example string was previously just parsed as -123.4.

25 Oct: --make-king-table 'rel-check' modifier added; this has the same effect as it did for PLINK 1.9 --genome. --pca 'var-wts' modifier deprecated: switch to 'biallelic-var-wts' when your data contains only biallelic variants and you want to continue generating only one weight per variant. (Alpha 3 will introduce an 'allele-wts' modifier which generates one weight per allele instead; this is necessary to support multiallelic variants in an analytically sound manner.)

22 Oct: --recover-var-ids implemented. (This is designed to reverse --set-all-var-ids.)

20 Oct: --sample-counts implemented; this provides the main (non-indel) sample counts reported by "bcftools stats"'s -s flag, and is >100x as fast for plink2-formatted large datasets. --extract-fcol extended to support substring matches.

15 Oct: Fixed bug in 12 Oct Linux builds that caused plink2 to hang on --extract/--exclude/--snps and similar variant ID filters. Implemented --extract-fcol, which filters variants based on a TSV column (this is an extension of PLINK 1.x --qual-scores).

12 Oct: "--hwe 0" no longer removes a small number of very-low-HWE-p-value variants.

9 Oct: --pheno/--covar 'iid-only' modifier added, supporting headerless files with a single ID column. Windows BGZF compression is now multithreaded. Improved read-error messages.

6 Oct: Windows --silent bugfix. Source code now supports dynamic linking with libzstd (though performance may suffer if you don't build the multithreaded version of that library).

4 Oct: --king-table-subset + --parallel bugfix. Automatic Zstd text-file decompression was broken for a few commands by the 28 Sep build; that should work properly now.

3 Oct: Fixed BGZF decompression bugs in 28 Sep build. (This did not affect VCF → .bed/.pgen conversion, though some rarer use cases were affected.) SID-loading bugfix.

28 Sep: Mixed-provisional-reference bugfixes. --ref-allele/--alt1-allele/--update-map/--update-name skip-count bugfix. --glm local-covar line-skipping bugfix. Automatic-rename when an input filename matches an output filename should work properly again instead of erroring out (though it should still be avoided).

10 Sep: --glm joint test p-value bug fix. (This bug only affected runs where --tests was invoked with 4 or more predictors.)

26 Aug: --read-freq now prints a warning, instead of segfaulting or entering an infinite loop, when all variants have already been filtered out.

21 Aug: Fixed --ref-from-fa/--ref-allele + VCF export interaction that caused spurious 'PR' INFO flags to be reported.

10 Aug: Open-fail and write-fail error messages now include a more detailed explanation of what went wrong. --bgen, --data, and --gen now have a 'ref-unknown' modifier for explicitly specifying that neither the first nor last allele is consistently REF.

31 Jul: --score prints an error message instead of segfaulting when an input-file line is truncated. Fixed rare --glm bug that could cause all results to be reported as 'NA' when exactly one covariate is defined. .log files print '--out' and '--d' properly again (this was broken by the 24 Jul build). --glm now has an optional output column ('err') which reports the reason for each 'NA' coefficient.

24 Jul: --d implemented.

8 Jul: --rm-dup/--sample-diff/--ld multiallelic variant bugfix.

5 Jul: --read-freq moved before usual allele frequency/count computation in order of operations. Loaded allele frequencies are not recomputed any more.

28 Jun: --king-table-subset should work properly again.

26 Jun: Fixed --glm multiallelic-variant bug that could cause one allele to be reported twice and one covariate test to be unreported, when neither 'hide-covar' nor 'intercept' was specified. Fixed issue that could cause --glm genotypic/hethom to segfault with no covariates.

17 Jun: Fixed rare underflow in --glm p-value computation which could cause an assertion failure.

27 May: Unbroke --adjust-file. "--export ind-major-bed" performance improvement.

12 May: Fixed --glm linear regression phenotype-batch handling bug that could cause a crash (or, on .bed-formatted data, generate incorrect results) on batches of size > 240.

29 Apr: BGEN 1.2/1.3 phased-dosage import bugfixes. --make-pgen + --dosage-erase-threshold without --hard-call-threshold no longer crashes.

28 Apr: PLINK 2-specific extensions to --update-ids and --update-parents simplified. --id-delim/--sample-diff 'sid' modifier for specifying that single-delimiter sample IDs should be interpreted as IID-SID changed to --iid-sid flag.

27 Apr: --haps bugfix for sample counts congruent to 17..31 (mod 32). This only affected the last few samples of the file, but if you used --haps with an earlier build, we strongly recommend rerunning it. --glm logistic regression 'SE' column renamed to LOG(OR)_SE when reporting odds ratio, to make it more obvious that the reported standard error does not use odds ratio units. --update-parents implemented.

2 Apr: Fixed --hwe bug that could cause chrY and MT variants to be improperly filtered. --glm 'pheno-ids' now works for groups of quantitative phenotypes.

1 Apr: --glm without --adjust now detects groups of quantitative phenotypes with the same "missingness pattern", and processes them together (with a large speed increase; but be careful re: disk space, you probably want to use the 'hide-covar' modifier, 'zs' and/or --pfilter might also be useful). --glm linear regression local-covar= bugfix.

26 Mar: Minimac3-r2 computation bugfix. --glm no longer generates .id files listing all samples used for each phenotype, unless the 'pheno-ids' modifier is added. --update-ids implemented.

23 Mar: Fixed multiallelic-variant writer bug that could affect files where the largest number of alleles is 6 or 18. --minimac3-r2-filter and --freq minimac3r2 column implemented.

18 Mar: --write-covar can now be used when no covariates are loaded, if at least one phenotype is loaded and phenotype output was requested.

9 Mar: plink2 --version and --help no longer return nonzero exit codes.
A draft PGEN specification is now available.

6 Mar: Fixed allele frequency computation bug that could cause a spurious "Malformed .pgen file" error when a variant filter was active.

5 Mar: Multithreaded --extract/--exclude.

4 Mar: --tests linear-regression output bugfix.

3 Mar: Fix --glm odds-ratio printing bug introduced on 1 Mar.

2 Mar: More help text cleanup (now including online documentation).

1 Mar: --recode-allele implemented (and renamed to --export-allele for consistency). VCF import now errors out when a space-containing INFO value is imported. Brackets in command-line help text are now used in a manner more similar to other tools.

21 Feb: --glm joint tests are now based on F-statistics, for better small-sample accuracy.

20 Feb: --import-dosage-certainty now always produces a missing call, instead of falling back on the VCF GT field, when dosage certainty is inadequate. --extract-intersect flag added.

19 Feb: --glm works properly again with no covariates (it was exiting with a spurious "out of memory" error). --import-dosage-certainty now has the expected effect on single-valued dosages, instead of just genotype-probability triplets.

18 Feb: Fixed a bug that could cause --missing to crash on dosage data.

14 Feb: Command-line integer parameters can now use scientific notation.

12 Feb: Phased-dosage import bugfix.

2 Feb: --tests + --parameters bugfix.

31 Jan: --pca approx now errors out instead of reporting inaccurate results when the number of variants is too small relative to the number of PCs. --pca approx eigenvalue bugfix.

30 Jan: --glm covariate-scale error is now propagated properly, instead of producing a mysterious out-of-memory error message.

27 Jan: --tests implemented.

22 Jan: --glm now errors out and recommends adding --covar-variance-standardize when covariates vary enough in scale for numeric instability to be a major concern.

2 Jan 2019: Phased-dosage import bugfix.

27 Dec 2018: --ref-allele/--alt1-allele skipchar was broken for the past few months; it should work properly again. Fixed a bug which occurred when importing an all-noninteger-dosage variant.

28 Oct: --keep-fam/--remove-fam bugfix.

2 Oct: Fixed bug that could occur when loading very long text lines (e.g. VCF lines longer than 5 MB).

22 Sep: Fixed rare bug that could occur when processing variants out of order. --sample-diff command implemented.

12 Sep: --normalize 'list' modifier added.

11 Sep: --rm-dup 'list' modifier added, for listing all duplicated variant IDs. (This can be run as a standalone command.)

9 Sep: Fixed rare race condition in text decompressor that could cause input lines to be skipped. (We believe this was the cause of the VCF-import "File read failure" crashes reported over the last few months.)

8 Sep: Fixed VCF-export bug that could occur when extra ##contig header lines were present. --sort-vars bugfix. --normalize now detects when post-normalization variants are no longer in sorted order, and prints a warning in that case.

7 Sep: --ld bugfix for phased multiallelic variants. --rm-dup flag added (removes duplicate-ID variants, can check for genotype/INFO/etc. equality).

4 Sep: Fixed A1_CASE_FREQ and related columns in --glm output broken by recent multiallelic update. Cleaned up a few column names in --geno-counts and --hardy output.

31 Aug: Fixed --glm bug with handling constant and all-constant-but-1 covariates.

30 Aug: AVX2 and 32-bit --export bgen-1.2/1.3 bugfixes (mainly affects missing genotypes). "--export vcf-4.2" mode added for compatibility with programs (e.g. SNPTEST) which reject VCF 4.3 files. Exported VCFs should now have more appropriate ##contig headers when PAR1 and/or PAR2 are present in the input. Left-normalization (--normalize) flag added.

26 Aug: Last column of --pca .eigenvec header line is no longer omitted.

21 Aug: Fixed --mac/--max-mac 'nref' and 'alt1' mode bugs in yesterday's build.

20 Aug: Fixed "--vcf dosage=GP" bug introduced on 7 May; if you used any build from the last three-and-a-half months to import VCF FORMAT/GP data, rerun with a newer build. "--vcf dosage=GP" now errors out with a suitable message when the file also contains a FORMAT/DS field, and a 'dosage=GP-force' option has been added to cover the rare cases where importing the GP field might still be worthwhile. --maf/--max-maf/--mac/--max-mac now let you filter on nonmajor (default), non-reference, alt1, or minor allele frequencies/counts; you can use bcftools notation for this (e.g. "--min-af 0.01:minor"), but keep the different default in mind.

18 Aug: plink2-formatted 1000 Genomes phase 3 files, with phased haplotypes and annotations included, and a few corrections to the official pedigree (determined via KING-robust analysis), can now be downloaded from the Resources page. --king-cutoff can now handle sample ID files containing a header line.

16 Aug: --glm logistic regression now supports multiallelic variants. Fixed --glm linear-regression dosage handling bug in yesterday's build.

15 Aug: --glm linear regression now supports multiallelic variants. --ld bugfix. --parameters + "--glm interaction" now works properly when a covariate is only involved as part of an interaction.

9 Aug: --make-king[-table] singleton/monomorphic-variant optimization implemented.

7 Aug: GRM construction and --missing no longer break with multiallelic data.

6 Aug: VCF multiallelic(-phased) import and export implemented. --hwe now tests each allele separately for multiallelic variants. --min-alleles/--max-alleles filtering flags added.
(--glm doesn't support multiallelic variants yet; that update is planned for next week.)

30 Jul: --vcf-max-dp flag added.

26 Jul: --vcf-half-call should now work properly on unphased data.

25 Jul: Fixed --sort-vars/low-memory-make-pgen dosage-handling bug that could trigger unwanted hardcall thresholding. If you used a build from 14 Apr - 19 Jul 2018 to work with dosage data, the hardcalls may not have been thresholded correctly. Unfiltered dosage datasets imported by an affected build can be corrected by running --make-pgen + explicit --hard-call-threshold. Hardcall-based filters such as --geno/--mind should be rerun (after the hardcalls have been corrected).

19 Jul: --update-alleles implemented.

16 Jul: Added more multithreaded-VCF-parse debug logging code.

13 Jul: Fixed chrX/Y/MT autoremoval bug in --make-king/--make-grm/--pca.

12 Jul: Unbroke --mach-r2-filter.

3 Jul: .fam/.psam files now load properly when only the IID column is requested or present.

29 Jun: .bim/.pvar files with more than ~134 million variants load properly again (given sufficient memory).

25 Jun: Fixed a few odd-sample-count export cases which were broken around 30 May.

22 Jun: Fixed a few log messages which were broken in the 19-20 Jun builds. Added debug-print code to support an ongoing multithread-VCF-dosage-import bug investigation (if you are encountering mysterious "File read failure" errors during VCF import or "Malformed .pgen" errors when reading the result, adding "--threads 1" to your VCF-import command will probably solve your immediate problem, but if you can also send me a .log file from the failing multithreaded run (or even better, test data) that would be very helpful).

20 Jun: Fix GRM/PCA/score-computation bug introduced on 30 May. If you used the 30 May or an early June build for GRM/--pca/--score, you should repeat the operation(s) with this build; apologies for the error.

19 Jun: Fixed rare --ref-allele/--alt1-allele corner case which could occur when a missing allele was replaced with a very long allele.

5 Jun: VCF import uninitialized-variable bugfix. --score 'ignore-dup-ids' modifier added.

30 May: "--export haps[legend]" bugfixes and bgzip support. "--export vcf vcf-dosage=DS" no longer exports undeclared HDS values when phase information is present. Unbreak --import-dosage + --map, for real this time.

21 May: --pgen-info command added (displays basic information about a .pgen file, such as whether it has any phase or dosage data).

17 May: --import-dosage and .gen import were broken for the last several weeks; this should be fixed now. A1 column added to --adjust output in preparation for multiallelic variants. --glm 'a0-ref' modifier renamed to 'omit-ref'.

15 May: Fixed chrX allele frequency computation bug when dosages are present. --ld modified to be based on major instead of reference alleles, to play better with multiallelic variants. --hardy header line and allele columns changed in preparation for multiallelic variant support.

8 May: --vcf dosage=HDS should now handle files with no DS field properly.

7 May: Fixed rare I/O deadlock. Improved VCF-import parallelism.

4 May: Fixed --bgen import/export when dosage precision bits isn't a multiple of 8 (previously misinterpreted the spec for those cases, sorry about that).

3 May: --bgen can now import variant records with up to 28 bits of dosage precision (though only ~15 bits will survive). "--export vcf-dosage=HDS-force" bugfix.

2 May: --vcf dosage= import no longer requires GT field to be present. Fixed potential --vcf dosage=HDS buffer overflow.

28 Apr: Fixed a --glm bug which occurred when autosomes and sex chromosome(s) were both present, or both chrX and chrY were present. If you performed a whole-genome --glm run with the 9 Feb 2018 build or later, you should rerun with the latest build. However, single-chromosome and autosome-only --glm runs were unaffected by the bug.

24 Apr: VCF phased-dosage import ("--vcf dosage=HDS") and export ("--export vcf vcf-dosage=HDS"). --pca and GRM computation now use correct variance for all-haploid genomes.

22 Apr: --export bgen-1.2/bgen-1.3 should now work for chrX/chrY/chrM; also fixed import bugs for those chromosomes.

16 Apr: --ref-from-fa contig line parsing bugfix.

14 Apr: --export bgen-1.2/bgen-1.3 implemented for autosomal diploid data. Operations like --pca which require decent allele frequencies now error out when frequencies are being estimated from less than 50 samples, unless you add the --bad-freqs flag. Phased dosage support implemented. Sample missingness rate in exported .sample files is now based on dosages rather than hardcalls. Non-AVX2 phase subsetting bugfix. --vcf + --psam bugfix. --vcf dosage= now ignores the hardcall when a dosage is present; instead, it's regenerated under --hard-call-threshold 0.1 (unless you specified a different threshold). --bgen 'ref-second' modifier renamed to 'ref-last', to generalize properly to multiallelic variants.

31 Mar: --export haps[legend] should now work properly when --ref-allele/--ref-from-fa/etc. flips some alleles in the same run.

29 Mar: --set-{missing,all}-var-ids non-AVX2 bugfix. --pheno/--covar autonaming bugfix.

28 Mar: --bgen 1-bit phased haplotype import implemented.

26 Mar: --make-bed + --indiv-sort bugfix.

23 Mar: Windows builds should work properly again (the 20-21 Mar Windows builds were badly broken). --glm now supports log-pvalue output (add the 'log10' modifier), and these remain accurate below the double-precision floating point limit of p=5e-324.

21 Mar: 3-column .sample file loading works properly again. Fixed a file-reading race condition.

20 Mar: Fix possible deadlock in recent builds when loading very long lines.

19 Mar: Fix --sample segfault in recent builds. .bgen import/export speed improvement. --oxford-single-chr wasn't extended correctly in the 4 Mar build; this should be fixed now.

11 Mar: Fix --pheno segfault in last week's builds that could occur when the file didn't have a header line.

9 Mar: Fix "File write failure" bug that occurred when a single write operation was larger than 2 GB (this could occur when running --make-bed with more than 128k samples). Reduced --make-bed memory requirement.

7 Mar: Fixed potential file-reading deadlock in recent builds (23 Feb or later).

5 Mar: --glm local-covar= should work properly again.

4 Mar: --oxford-single-chr can now be used on .bgen files. --make-pgen partially-phased data handling bugfix.

26 Feb: --keep/--remove/etc. should work properly now on IID-only files with no header line.

23 Feb: Fixed alpha 2 --vcf + --id-delim bug. Improved parsing speed for compressed VCF and .pvar files.

20 Feb: "--xchr-model 1" should work properly now.

16 Feb 2018 (alpha 2): This makes the following potentially compatibility-breaking changes:

  • FID is now an optional field: if it isn't in the input .psam file, it's omitted from several output files by default (these now have 'maybefid' and 'fid' column sets, where the default set includes 'maybefid'), and treated as always-'0' by any operation which requires FID values (such as --make-bed). When exporting genomic data files, 'maybefid' also treats the column as missing if all remaining values are '0'.
  • Relatedly, when importing sample IDs from a VCF or .bgen file, the default mode is now "--const-fid 0", and no FID column will be written to disk at all. --keep, --remove, and similar commands also now have "--const-fid 0" semantics when an input line contains only one token. You can now act as if IID is the only sample ID component, if that's what makes the most sense for your workflow. Conversely, it is now necessary to explicitly use --id-delim when you want to split the VCF/.bgen sample IDs into multiple components.
  • MT is treated as a haploid chromosome again. In PLINK 1.9 and earlier plink2 builds, MT was treated as diploid-ish to avoid throwing away information about heteroplasmic mutations; as a consequence, the --glm(/--linear/--logistic) genotype column and commands like "--freq counts" used a 0..2 scale. Now that plink2 has proper support for dosages, this kludge is no longer necessary.
  • --glm's 't' column set has been renamed to 'tz', to reflect it being a T-statistic for linear regression but a Wald Z-score for logistic/Firth. The corresponding column in .glm.logistic[.hybrid] and .glm.firth files now has 'Z_STAT' in the header line.

Also, --glm now defaults to regressing on minor instead of ALT allele dosages (this can be overridden with 'a0-ref').

The final alpha 1 build has been tagged in GitHub, and will remain downloadable from here for the next few months.

11 Feb: .king.cutoff.in/.king.cutoff.out files now end in .id, for consistency with other output files with sample IDs and no other information. Similarly, --mind's output file now has the extension .mindrem.id and defaults to having a header line. You can now use --no-id-header to suppress the header line (and force the columns to be FID/IID) in all .id output files.

10 Feb: --update-sex 'male0' option added, and custom column selection interface changed (now 'col-num='). --glm 'gcountcc' column names updated (now 'CASE_NON_A1_CT', 'CASE_HET_A1_CT', etc.) in preparation for switch to A1=major allele. --make-just-pvar + --ref-allele/--ref-from-fa no longer treats all initial reference alleles as provisional when the input .pvar has a header line.

9 Feb: Forcing .pvar QUAL/FILTER output when no such values are loaded no longer causes a segfault.

5 Feb: AVX2 phase-subsetting bugfix.

3 Feb: --score 'dominant' and 'recessive' modifiers added.

30 Jan: Fix .pgen writing bug which occurred when the number of variants was a multiple of 64 and the number of samples was large.

24 Jan: "--export oxford" now supports bgzipped output.

21 Jan: --glm now always reports an additional 'A1' column, indicating which allele(s) correspond to positive genotype column values. --glm column sets have been changed to revolve around A1 instead of ALT, so minor script modifications may be necessary when switching to this build.
In this build, A1 and ALT are still synonymous. This will change in alpha 2: A1 will default to the minor allele(s) to reduce multicollinearity (imitating PLINK 1.x's behavior in the absence of --keep-allele-order), though you will still have the option of forcing A1=ALT.

12 Jan: Fixed "--glm interaction" bug that occurred when multiple consecutive variants had no missing calls. We recommend redoing all --glm runs with the 'interaction' modifier which were performed with a build produced between 27 Nov 2017 and 10 Jan 2018 inclusive.

10 Jan: --adjust-file implemented (performs --adjust's multiple-testing correction on any association analysis file).

9 Jan: Added 'no-idheader' modifiers to a few commands, and made that the default for --make-grm-bin/--make-grm-list to avoid breaking interoperability.

7 Jan: --vcf can now be given a sites-only VCF when the run doesn't require genotype data. Sample ID files, such as those produced by --write-samples, now include a header line by default; this will be necessary to distinguish between FID-IID and IID-SID output in the future. (With --write-samples, you can suppress the header line by adding the 'noheader' modifier.)

5 Jan: --pheno-col-nums/--covar-col-nums implemented.

2 Jan 2018: --keep-fcol (equivalent to PLINK 1.x --filter) implemented.

19 Dec 2017: --adjust implemented. --zst-level implemented (lets you control Zstd compression level). Un-broke --rerun.

18 Dec: --extract/--exclude can now be used directly on UCSC interval-BED files (ok for coordinates to be 0-based or for no 4th column to be present). "--output-chr 26" now causes PAR1/PAR2 to be rendered as '25' (for humans), to restore interoperability with programs like ADMIXTURE which can't handle alphabetic chromosome codes. --merge-x implemented (usually needs to be combined with --sort-vars now). --pvar can usually handle 'sites-only' VCF files (e.g. those released by the gnomAD project) now. --thin, --thin-count, --thin-indiv, and --thin-indiv-count implemented.

16 Dec: Multithreaded zstd compression implemented (on Linux and macOS). --make-grm-gz renamed to --make-grm-list, and gzip mode removed.

15 Dec: Fixed --extract-if-info and --exclude-if-info's behavior for non-numeric values which start with a number. Existence-checking flags renamed to --require-info and --require-no-info for naming consistency.

13 Dec: --extract-if-info and --exclude-if-info flags added, for simple filtering on INFO key/value pairs or key existence.

11 Dec: --king-table-subset flag added. This makes it straightforward to perform two-stage relationship/duplicate detection: start with --make-king-table on a small number of higher-MAF variants scattered across the genome, and then rerun it with --king-table-subset on an appropriate subset of candidate sample pairs from the first stage. --bp-space implemented (useful for the first stage above).
The two-stage workflow was first implemented by Wei-Min Chen in a recent version of KING; contact him for citation information.

7 Dec: Fixed bug which could occur when filtering samples from a phased dataset. Windows AVX2 build now available.

28 Nov: --import-dosage 'format=infer' (this is now the default) and 'id-delim=' (needed for reimport of "--export A-transpose" data) options added. Fixed --import-dosage bug that caused it to error out on missing genotypes under format=1. --no-psam-pheno (or --no-pheno/--no-fam-pheno) can now be used to ignore all phenotypes in the sample file, while keeping the phenotype(s) in the --pheno file if one was specified.

27 Nov: Implemented fast path for --glm no-missing-genotype case (mainly affects linear regression). --make-king[-table] can now automatically handle matrices too large to fit in memory without explicit use of --parallel. AVX2 sample filtering performance improvement. --validate bugfix.

19 Nov: Fix VCF FORMAT/GT header line parsing bug introduced in 14 Nov build.

18 Nov: --make-king[-table] performance improvements.

16 Nov: Fixed bug in 14 Nov build that broke ##chrSet header line parsing.

14 Nov: Fixed bug that caused --export {A,AD} to hang when the number of variants was between 65 and about a thousand.

4 Nov: Linux and macOS prebuilt AVX2 binaries now available; these should work well on most machines built within the last 4 years. Fixed another Firth regression spurious NA bug. Fixed --score bug that occurred when sample filter(s) were applied simultaneously. Fixed a --ld phased-hardcall handling bug. Array-popcount upgrade in progress (thanks to recent work by Wojciech Muła, Nathan Kurz, Daniel Lemire, and Kim Walisch).

3 Nov: Fixed multipass --export {A,AD} bug. --dummy dosage-freq= now fills in hardcalls with the default --hard-call-threshold cutoff of 0.1 when --hard-call-threshold is not explicitly specified.

2 Nov: --export {A,AD} implemented (with dosage support). --dummy dosage-freq= modifier now works properly for dosage frequencies above 0.75.

16 Oct: --ref-from-fa flag implemented, to set reference alleles from a FASTA file. (Note that this may be unable to determine which allele is reference when length changes are involved, but it should always work for SNPs and multi-nucleotide polymorphisms.) --update-name implemented. Fixed column-set parsing bug in 13 Oct build.

13 Oct: Fixed --glm logistic/Firth regression bug which could produce spurious NA results.

9 Oct: Fixed --ld's handling of some dosage and haploid cases. Fixed bug which could cause --make-pgen to discard phase/dosage information when extracting a small variant subset. --geno-counts no longer double-reports chrY counts.

8 Oct: --ld implemented, with supported for phased genotypes and dosages (try "--ld <var1> <var2> dosage"). Fixed tiny bgen-1.1 import bug that triggered when the number of threads exceeded the number of variants. Allele frequency computation no longer crashes on chrX when dosages are present but only hardcalls are needed.

1 Oct: Fixed GRM computation bug which sometimes caused segfaults when both dosages and missing values were present. --glm is now a bit faster when many covariates are present.

20 Sep: Firth regression Hessian matrix inversion step raised to double-precision, after last week's builds revealed that single-precision inversion could be unreliable.

15 Sep: --vif/--max-corr per-variant checks are now working. These are no longer skipped during logistic regression.

8 Sep: Alternative VCF INFO/PR fields are now tolerated. Removed debug code that slowed down yesterday's --make-pgen.

7 Sep: --score uninitialized memory bugfix. Partially-phased data handling bugfix.

6 Sep: Fix macOS stack size issue (could cause --pca and some other commands to crash in recent builds; 1 Sep build had an incomplete workaround).

4 Sep: --[covar-]variance-standardize missing value handling bugfix. --ref-allele/--alt1-allele implemented (--a2-allele and --a1-allele are treated as aliases).

1 Sep: --{pheno,covar}-quantile-normalize missing-phenotype handling bugfix.

29 Aug: --glm 'gcountcc' column set option added (reports genotype hardcall counts, stratified by case/control status). --write-samples command added (analogous to --write-snplist).

2 Aug: --sort-vars implemented.

25 Jul: --loop-cats now works properly with genotype-based variant filters.

24 Jul: Fixed "--pca approx" allele frequency handling bug introduced in 4 Jun build; we recommend redoing any "--pca approx" runs performed with an affected build. (Regular --pca was not affected.) --loop-cats implemented (similar to PLINK 1.x --loop-assoc, except it's not restricted to association tests). VCF export now supports 'vcf-dosage=DS-force' mode. --dummy multithread + dosage bugfix.

17 Jul: BGEN v1.2/1.3 importer memory allocation bugfix. Size of failed allocation is now logged on most out-of-memory errors.

2 Jul: Improved multithreading in BGEN v1.2/1.3 importer. Python writer can now be called with multiple variants at a time.

25 Jun: Basic BGEN v1.2/1.3 import (unphased biallelic dosages; suffices for main UK Biobank data release). --warning-errcode flag added (causes an error code to be returned to the OS on exit when at least one warning is printed).

20 Jun: --condition-list + variant filter bugfix.

5 Jun: --make-pgen memory requirement greatly reduced. End time now printed to console in most situations.

4 Jun: --hwe no longer causes a segfault when chrX is present and no gender information is available. Fixed --dummy bug.

29 May: --import-dosage format=1 bugfix.

26 May: --glm 'standard-beta' modifier replaced with --variance-standardize flag. --quantile-normalize function added. Fixed a missing-sex allele counting bug.

25 May: --hardy/--hwe works properly again when chrX is present but not at the beginning of the dataset.

22 May: Fixed major dosage data + sample-filter bug; we recommend rerunning any operations involving both dosage data and sample filtering performed with earlier plink2 builds. --score 'list-variants' modifier added.

19 May: Fixed a bug with allele frequency computation on dosage data when sample filter(s) are applied.

18 May: Many categorical phenotype-handling flags (--within, --keep-cats, --split-cat-pheno, ...) implemented. Basic phenotype-based filtering implemented (e.g. "--remove-if PHENO1 '>' 2.5"; note that unnamed phenotypes are assigned the names 'PHENO1', 'PHENO2', etc., and that the '<' and '>' characters must be quoted in most shells). --write-covar implemented. --mach-r2-filter implemented, and raw MaCH r2 values can be dumped with "--freq cols=+machr2".

11 May: --condition[-list] + --covar bugfix.

8 May: Fix quantitative phenotype/covariate loading bug introduced in 6 May build.

7 May: --import-dosage implemented.

6 May: Fixed bug which caused '0' to be treated as control instead of missing for binary phenotypes. Minor change to --glm's column headers, in preparation for multiallelic data.

2 May: --score bugfix. --maj-ref bugfix. --vcf-min-dp and "--export A-transpose" implemented.

1 May: VCF dosage import/export, --vcf-min-gq, and --read-freq implemented. --score can now work with standard errors. --autosome[-par] now works properly. SNPHWE2 and SNPHWEX functions relicensed as GPL-2+, to enable inclusion in the HardyWeinberg R package.

20 April: .sample export bugfix (didn't work if file was over 256 KB and no phenotypes were present). --dummy implemented (can now generate dosages).

19 April: --hardy/--hwe chrX bugfix (thanks to Jan Graffelman for catching the problem and validating the fix). --new-id-max-allele-len now has three modes ('error', 'missing', and 'truncate'), and the default mode is now 'error' (i.e. --set-missing-var-ids and --set-all-var-ids now error out when an allele code longer than 23 characters is encountered, instead of silently truncating). --score implemented, and extended to support variance-normalization and multiple score columns (these two features provide a simple way to project new samples onto previously computed principal components).

11 April: --pca var-wts bugfix, and --pca eigenvalue ordering bugfix. --glm linear regression and --condition[-list] support added. --geno/--mind/--missing/--genotyping-rate can now refer to missing dosages instead of just missing hardcalls (note that, when importing dosage data, dosages in (0.1, 0.9) and (1.1, 1.9) are saved but there usually won't be associated hardcalls).

20 March 2017: Initial public release.


What's new?

  • Preservation of reference alleles (without requiring constant use of --keep-allele-order), phase information, and the VCF QUAL, FILTER, and INFO fields. Use --make-pgen instead of --make-bed when importing a VCF; the fileset can then be referenced with --pfile.
  • The new .pgen file format incorporates SNPack-style genotype compression, frequently reducing file sizes by 80+% with negligible computational cost. Note that this captures some major patterns that are missed by the usual general-purpose compression algorithms: our 1000 Genomes phase 3 downloads are 70+% smaller than the gzipped originals (and remain 45+% smaller after .pgen un-archiving), without throwing away any relevant information.
  • To allow users to take advantage of genotype compression without sacrificing compatibility with scripts expecting old-style .bim and .fam text files, PLINK 2.0 also supports a hybrid .pgen + .bim + .fam usage mode (--make-bpgen/--bpfile). We've also provided a Python library for reading and writing .pgen files, and an R library for reading them, to simplify migration to the new format. (PLINK 1 .bed files are valid .pgen files, so code written on top of the library is backward-compatible.)
  • Firth regression. Standard logistic regression fails to converge, yielding 'NA' or nonsense results, when the 2x2 allele/phenotype contingency table has an empty cell ("quasi-complete separation"); this is common, and especially likely to happen with the strongest associations. Firth regression can prevent you from missing these associations. --glm's default 'firth-fallback' mode (only use Firth regression when there's either an empty contingency table cell or regular-logistic-regression convergence failure) gets you most of the benefit for a fraction of the computational cost.
  • Relatedly, --glm now provides a reason for each 'NA' result.
  • --glm linear regression is often hundreds of times as fast than PLINK 1.9 --linear. When multiple phenotypes with the same "missingness pattern" are provided, the speedup can exceed 1000x, especially when imputation has been used to replace missing genotypes with dosages. (Note that mean-imputation of missing genotypes is deliberately not supported by --glm and many other plink2 functions: when filling in missing values can be justified at all, the dosage should come from your variant caller or modern imputation software.)
  • "--pca approx" (equivalent to EIGENSOFT 6+ fastmode with default parameters). If you have more than ten thousand samples, only need the top principal components, and can tolerate ~1% error in the last PC, this can save you a ton of compute time.
  • The 64-bit Linux build can handle linear algebra on matrices with more than 231 elements (so regular --pca is no longer limited to ~46000 samples), as long as your system has enough memory.
  • KING-robust kinship coefficients (--make-king, --make-king-table, --king-cutoff). These remain accurate when good population allele frequency estimates are unavailable. It still has limitations, but we have found --king-cutoff to be much more reliable than the PLINK 1.9 --rel-cutoff flag for general-purpose removal of close relations.
  • Proper support for dosages (decimal allele-count expected values). When .gen/.bgen files are imported, hardcalls and dosages are saved to the .pgen. Operations which naturally extend to decimals (e.g. --pca, --glm, --freq, --maf/--mac) use the dosage information when it's present, while methods that can only make use of hardcalls (e.g. KING-robust, Hardy-Weinberg exact test) simply ignore the dosages. --hard-call-threshold can now be used to change the saved hardcalls without changing the dosages.
  • Much more multithreaded code.
  • Most commands let you control which columns appear in the main output file(s).
  • Broad support for both gzipped and Zstd-compressed text input files.
  • Graffelman and Weir's extended chrX Hardy-Weinberg exact test, which takes male allele frequencies into account. We've found that this tends to identify quite a few obviously-miscalled chrX variants which were not caught by the usual QC filters.
  • Oxford-style haplotype filesets can now be imported and exported (--haps, "--export haps[legend]").
  • Sample-major PLINK binary files can now be efficiently exported ("--export ind-major-bed"). This is ~1000x as fast as the previous implementation (PLINK 1.07 --make-bed + --ind-major).
  • The relationship matrix (GRM) computation (as well as "--pca approx") now handles multiallelic variants properly, instead of just collapsing all minor alleles together.
  • --score allows each allele in a multiallelic variant to be assigned its own score.

Coming next

  1. Fully-powered merge. (Once this is ready, a stable version of the .pgen specification will be provided, and PLINK 2.0 beta testing will begin.)
  2. Variant split/join.
  3. Multiallelic dosage support.

General usage >>