Introduction, downloads

D: 18 Aug 2024

Recent version history

What's new?

Coming next

[Jump to search box]

General usage

Getting started

Flag usage summaries

Column set descriptors

Citation instructions

Standard data input

PLINK 1 binary (.bed)

PROVISIONAL_REF?

PLINK 2 binary (.pgen)

Autoconversion behavior

VCF/BCF (.vcf[.gz], .bcf)

Oxford genotype (.bgen)

Oxford haplotype (.haps)

PLINK 1 text (.ped, .tped)

PLINK 1 dosage

Sample ID conversion

Dosage import settings

Generate random

Unusual chromosome IDs

Allele frequencies

Phenotypes

Covariates

'Cluster' import

Reference genome (.fa)

Input filtering

Sample ID file

Variant ID file

Interval-BED file

--extract-col-cond

QUAL, FILTER, INFO

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Deduplicate variants

Sample/variant thinning

Pheno./covar. condition

Missingness

Category subset

--keep-col-match

Missing genotypes

Number of distinct alleles

Allele frequencies/counts

Hardy-Weinberg

Imputation quality

Sex

Founder status

Main functions

Data management

--make-[b]pgen/--make-bed

--export

--output-chr

--split-par/--merge-par

--set-all-var-ids

--recover-var-ids

--update-map...

--update-ids...

--ref-allele

--ref-from-fa

--normalize

--indiv-sort

--write-covar

--variance-standardize

--quantile-normalize

--split-cat-pheno

--pheno-svd

--pmerge[-list]

--write-samples

Basic statistics

--freq

--geno-counts

--sample-counts

--missing

--genotyping-rate

--hardy

--het

--fst

--pgen-info

Pairwise diffs

--pgen-diff

--sample-diff

Linkage disequilibrium

--indep...

--r[2]-[un]phased

--ld

Sample-distance matrices

Relationship/covariance

  (--make-grm-bin...)

--make-king...

--king-cutoff

Population stratification

--pca

PCA projection

Association analysis

--glm

--glm ERRCODE values

--gwas-ssf

--adjust-file

Report postprocessing

--clump

Linear scoring

--score[-list]

--variant-score

Distributed computation

Command-line help

Miscellaneous

Flag/parameter reuse

System resource usage

--loop-cats

.zst decompression

Pseudorandom numbers

Warnings as errors

.pgen validation

Resources

1000 Genomes phase 3

HGDP-CEPH

FASTA files

Errors and warnings

Output file list

Order of operations

Developer information

GitHub root

Python library

R library

Compilation

Adding new functionality

Discussion forums

Credits

File formats

Quick index search

Miscellany

Flag/parameter reuse

--script <filename>

--script loads the specified text file and applies all the command-line flags and parameters contained within. This is handy if you use the same QC filters across multiple runs and datasets.

--rerun [log file]

--rerun loads the specified PLINK 2.0 log (defaulting to plink2.log) and causes all commands to be rerun. The same parameter(s) will be used for each flag, except when the same flag is included on the current command line with different parameter(s).

Version information

--version

--version causes PLINK to only print its version number before exiting.

Console output suppression

--silent

--silent prevents PLINK from printing regular output to the console. (The usual logging will still occur, and error-output is not suppressed.)

System resource usage

--memory <main workspace size, in MB> ['require']

By default, PLINK tries to reserve half of your system's RAM for its main workspace. If this amount is insufficient for your current job, or if it causes unwanted interference with other running processes (e.g. you're using GNU parallel to run single-threaded instances of PLINK on each chromosome simultaneously), you can use --memory to adjust this behavior.

By default, if PLINK 2's first memory allocation attempt fails, it will retry with a smaller workspace size. If you want it to error out immediately in that case instead, add the 'require' modifier. (Warning: if memory overcommit is enabled on your system, the first memory allocation attempt is very unlikely to fail even when the system doesn't actually have enough memory. 'require' does not protect your job from being OOM-killed halfway through. --randmem might help you crash faster in that scenario, for whatever it's worth...)

When memory is moderately constrained, a reasonable guideline is to reserve 8000 MiB when working with datasets containing up to 50 million variants, and to add another 1000 MiB for every 10 million variants past that.

--threads <max>
  (aliases: --thread-num, --num_threads)

By default, multithreaded PLINK functions employ about as many CPU-intensive threads as your system has available logical cores. (More precisely, PLINK 2 sets the maximum compute-thread count to sysconf(_SC_NPROCESSORS_ONLN).) Occasionally, you'll want to change this number—perhaps sysconf() is reporting an inaccurate number (not uncommon with AMD processors), or some of your cores are already fully occupied with other tasks. This can be done with --threads.

--loop-cats <categorical phenotype/covariate name>

Given a categorical phenotype, --loop-cats executes the main body of PLINK 2 (variant filters and everything afterward; see the order of operations for details) once for each category. ".<current category name>" is appended to the output filename prefix.

Zstandard

Most PLINK 2 commands capable of generating very large text files have a 'zs' or similar modifier for requesting Zstd compression of the main file(s), and practically all PLINK 2 text-file-accepting flags can handle Zstd-compressed (or gzipped) input.

--zst-decompress <.zst file> [output filename]
  (alias: --zd)

However, many systems do not have the zstd command-line program installed. To ensure .zst files are still usable by other programs in this context, --zst-decompress decompresses a .zst file to either standard output (default) or the specified output file. (This cannot be used with any other flags.)

--zst-level <level>

--zst-level lets you set the .zst compression level (default 3). Values from 1 (fastest) to 22 (smallest) are supported.

Name range delimiter

--d <delimiter>

By default, PLINK commands accepting multiple name ranges (e.g. --snps, --pheno-name, --covar-name) expect ranges to be denoted with a single dash, with no space on either side of the dash. E.g. in

--snps rs1111-rs2222, rs3333, rs4444

'rs1111-rs2222' denotes all variants between rs1111 and rs2222 inclusive. --d lets you designate a non-dash character for this purpose, which can be essential if your IDs contain dashes. E.g.

--d : --snps SNP_A-8395068:SNP_A-8303431

tells --snps to act on all variants betwen SNP_A-8395068 and SNP_A-8303431 inclusive.

Sample ID matching

--strict-sid0

By default, if there is no SID column in the .psam/.fam (or --update-ids) file, but there is one in another input file (e.g. --keep/--remove), the latter SID column is ignored; sample IDs are considered matching as long as FID and IID are equal (with missing FID treated as '0'). If you also want to require SID = '0' for a sample ID match in this situation, add --strict-sid0.

Reproducible pseudorandom number sequences

--seed <integer...>

--seed initializes the pseudorandom number generator with the given seed(s). Each seed must be a 32-bit unsigned integer (i.e. between 0 and 4294967295 inclusive).

Note that --threads, "--memory require", and (less frequently) retrieval of an older PLINK build may be necessary to reproduce a run.

Faster but less reproducible linear algebra

--native

By default, when the same plink2 binary is run with the same flags, workspace size, thread count, and random seed, the results should be reproducible across machines with different Intel processors. (This was not necessarily true on Linux before 19 Oct 2020.) To allow Intel MKL to use processor-dependent code paths that can yield slightly different linear algebra results, add the --native flag.

P-value underflow

--output-min-p <threshold>

PLINK 2.0 represents most p-values as log-p-values internally, and has a custom print-function for them, so tiny p-values like 1.23456e-7890 can appear in its output files. However, most other programs cannot usefully read such values: double-precision floating point variables cannot represent numbers between 0 and ~4.94066e-324, and this can create problems for log(p) plots and the like. One workaround is --output-min-p, which prevents PLINK from reporting non-empirical p-values below the given threshold. (Other reported statistics are not affected, so you can e.g. infer the true p-value from the reported Z-statistic.)

Warnings as errors

--warning-errcode

By default, PLINK 2 only returns a nonzero error code to the operating system when an error occurs. --warning-errcode makes this also happen when a warning is printed.

Debugging support

--debug

Normally, PLINK does not force log entries to be written to disk immediately. However, when PLINK crashes unexpectedly (e.g. via segmentation fault), this may cause the log to be incomplete. --debug prevents this from happening.

--validate

PLINK 2's .pgen format is far more complex than the PLINK 1 binary format. This enables efficient implementation of a large amount of new functionality, but there are also many more ways for the main genotype/dosage table to be malformed; and over the last five years of PLINK 2 development, we have explored quite a bit of this new error space. --validate carefully scans a .pgen for malformed records, to help localize or rule out several classes of low-level data conversion bugs.

--randmem

A common type of bug in C programs like PLINK is forgotten memory initialization. Since some operating systems preferentially provide zeroed-out memory to applications, this type of bug might not manifest the first several times a new function is used, and even with a good bug report, the first attempt to replicate the problem may fail. --randmem randomizes all of PLINK 2's workspace memory immediately after it's allocated, to make it easier to replicate and fix this type of bug.

Resources >>