Introduction, downloads
D: 20 Feb 2019
Recent version history
What's
new?
Coming next
General usage
Column set descriptors
Citation instructions
Standard data input
PLINK 1 binary (.bed)
PLINK 2 binary (.pgen)
Autoconversion behavior
VCF (.vcf{.gz})
Oxford genotype (.bgen)
Oxford haplotype (.haps)
PLINK 1 dosage
Dosage import settings
Generate random
Unusual chromosome IDs
Phenotypes
Covariates
'Cluster' import
Reference genome (.fa)
Input filtering
Sample ID file
Variant ID file
Interval-BED file
QUAL, FILTER, INFO
Chromosomes
SNPs only
Simple variant window
Multiple variant ranges
Deduplicate variants
Sample/variant thinning
Pheno./covar. condition
Missingness
Category subset
--keep-fcol (was --filter)
Missing genotypes
Number of distinct alleles
Allele frequencies/counts
Hardy-Weinberg
Imputation quality
Sex
Founder status
Main functions
Data management
--make-{b}pgen/--make-bed
--export
--output-chr
--split-par/--merge-par
--set-all-var-ids
--ref-allele
--ref-from-fa
--normalize
--indiv-sort
--write-covar
--variance-standardize
--quantile-normalize
--split-cat-pheno
--write-samples
(TBD)
Resources
1000 Genomes phase 3
Output file list
Order of operations
Credits
File formats
|
|
General usage
Getting started
First, if plink and/or plink2 are not installed on your system, download
and unzip the appropriate binaries (v1.9, v2.0). (Or clone from GitHub and recompile.) As
alpha and beta testing continue, plink2 will become increasingly usable on
its own, but for now it's better to think of it as a supplement to rather
than a replacement for v1.9.
Then you can verify that both programs are functional with the following
pair of commands:
./plink --dummy 2 2 --freq
--make-bed --out toy_data
./plink2 --bfile toy_data --freq --out test2
You should see something like:
PLINK v1.90b6.4
64-bit (7 Aug 2018) www.cog-genomics.org/plink/1.9/
(C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to toy_data.log.
Options in effect:
--dummy 2 2
--freq
--make-bed
--out toy_data
16384 MB RAM detected; reserving 8192 MB for main workspace.
Dummy data (2 people, 2 SNPs) written to toy_data-temporary.bed +
toy_data-temporary.bim + toy_data-temporary.fam .
2 variants loaded from .bim file.
2 people (0 males, 2 females) loaded from .fam.
2 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2 founders and 0 nonfounders present.
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to toy_data.frq .
2 variants and 2 people pass filters and QC.
Among remaining phenotypes, 1 is a case and 1 is a control.
--make-bed to toy_data.bed + toy_data.bim + toy_data.fam ... done.
PLINK v2.00a2 AVX2 (21 Aug 2018) www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink2.log.
Options in effect:
--bfile toy_data
--freq
--out test2
Start time: Tue Aug 21 21:38:28 2018
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
2 samples (2 females, 0 males; 2 founders) loaded from toy_data.fam.
2 variants loaded from toy_data.bim.
1 binary phenotype loaded (1 case, 1 control).
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to test2.afreq .
End time: Tue Aug 21 21:38:28 2018
(Remove the './' prefix if the program was
installed earlier, or if you've added it to the system PATH.) If either
command fails, verify that you downloaded the correct binaries for your
machine, and consult the plink2-users Google group if you're still
stuck.
Okay, what did these commands mean? And what just happened?
PLINK parses each command line as a collection of flags (each of which
starts with two dashes1), plus parameters (which immediately
follow a flag, and never start with a dash unless that dash is immediately
followed by a digit) for those flags. The first command included four
flags: --dummy, --freq, --make-bed, and --out.
They specify the following three things, which are part of almost every
PLINK run:
- Input data: '--dummy 2 2' tells
PLINK 1.9 to generate a new random dataset with 2 samples and 2 variants.
You'll see several other ways to specify input data on the next page.
- Operation(s) to perform: --freq tells PLINK to generate an
allele frequency report, and --make-bed tells PLINK to save the data in
PLINK 1 binary format. The full range of supported operations is
summarized under 'Main functions' in the sidebar, and the formats of all
reports are described in the file format appendices (v1.9, v2.0).
- An output file prefix: We'll elaborate on this in a moment.
So this particular combination makes PLINK 1.9 generate a new 2x2 dataset,
write an allele frequency report to toy_data.frq,
and save the dataset to toy_data.bed + .bim + .fam.
Similarly, the second command makes PLINK 2.0 write its own allele
frequency report to plink2.afreq.
1: Actually, that was a lie.
With the exceptions of --1 and --23file, PLINK 1.9 and 2.0 allow you to use
a single dash in front of each flag. In exchange for saving you some
keystrokes, please do yourself a favor and avoid filenames that begin with
a dash.
The allele frequency reports are different?...
You may have noticed that the file extensions of the v1.9 and v2.0 allele
frequency reports aren't the same, and there are several formatting
differences between the two files, though they clearly contain the same
information. This is true for many commands; PLINK 2.0 cannot generally
be used as a drop-in replacement for previous PLINK versions. We
realize this can be a major annoyance, and will continue maintaining v1.9
for a long time to come for those who need full backward compatibility.
However, v2.0's reports are better-standardized (header lines preceded by
'#', tab-delimited, column headers are consistent with VCF, etc.) and more
flexible (lots of optional column sets); hopefully, this'll make your life
easier and be worth some minor transitional headaches.
Interpreting our flag usage summaries
The rest of this documentation has many one-line summaries describing the
parameter sets accepted by particular flags, followed by discussions of
flag functionality and the effects of optional parameters. We use the
following conventions in our one-line usage summaries:
- [square brackets] denote a required parameter, where the text
between the brackets describes its nature.
- <angle brackets> denote an optional modifier (or if '|' is
present, a set of mutually exclusive optional modifiers). To invoke one,
you need to use the EXACT text given in our summary, e.g. '--freq counts'
is valid given the summary
--freq <counts> ...
- There's one exception to the angle brackets/exact text rule: when a
modifier name in angle brackets ends with '=[value]', '[value]'
designates a variable parameter. E.g. '--glm perm' and '--glm mperm=10000' are both valid given the
summary
--glm <perm | mperm=[value]> ...
- {curly braces} denote an optional parameter, where the text
between the braces describes its nature.
- An ellipsis (...) indicates that you can enter many parameters
of the specified type.
- Many PLINK 2.0 commands accept a "column set
descriptor". For example, the help text for --make-king-table
is
--make-king-table <zs> <counts> <cols=[column set descriptor]>
Similar to --make-king, except results are reported in the original .kin0
text table format (with minor changes, e.g. row order is more friendly to
incremental addition of samples), and --king-table-filter can be used to
restrict the report to high kinship values.
Supported column sets are:
maybefid: FID1/FID2, if that column was in the input. Requires 'id'.
id: IID1/IID2 (column headers are actually 'ID1'/'ID2' to match KING).
maybesid: SID1/SID2, if that column was in the input. Requires 'id'.
sid: Force SID1/SID2 even when SID was absent in the input.
nsnp: Number of variants considered (autosomal, neither call missing).
hethet: Proportion/count of considered call pairs which are het-het.
ibs0: Proportion/count of considered call pairs which are opposite homs.
ibs1: HET1_HOM2 and HET2_HOM1 proportions/counts.
kinship: KING-robust between-family kinship estimator.
The default is maybefid,id,maybesid,nsnp,hethet,ibs0,kinship.
hethet/ibs0/ibs1 values are proportions unless the 'counts' modifier is
present. If id is omitted, a .kin0.id file is also written.
A valid descriptor is either
- a comma-separated sequence of column set names (e.g.
'cols=maybefid,id,nsnp,hethet,ibs0,ibs1,kinship' would add HET1_HOM2
and HET2_HOM1 columns, while ensuring that SID columns do not appear),
or
- a comma-separated sequence of column set names where every name is
preceded by a plus or minus (in which case the column sets are
added/subtracted from the default, e.g. 'cols=+ibs1,-maybesid' is a
shorter way to add HET1_HOM2/HET2_HOM1 and exclude SID1/SID2).
- Background color summarizes degree of similarity to PLINK 1.9. Green signals maximal
compatibility: there will usually be a minor difference in output
file formats, but all information in the PLINK 1.9 output file will also
be present in the PLINK 2.0 output file when the same flag and modifiers
are used. (Note that green does not guarantee the absence of additional
options.) Yellow
signals slightly different functionality and/or command-line usage, and
blue signals that
the flag is new to PLINK 2.0.
- If parts of our current implementation are known or strongly suspected
to be incomplete, that is signaled with red text. So red text on a green
background indicates that we plan to provide perfect
compatibility, but we have more coding and/or testing to do before we get
there.
If you're already familiar with PLINK, this should help you skim over
stuff you already know. If there are just one or two flags you need to
look up, you can quickly find what you need in the sidebar; try the search
box if the correct page isn't immediately apparent.
For the newer bioinformaticians out there, here's our first full flag
description.
Setting the output file prefix
--out [prefix]
By default, the output files generated by PLINK 2.0 all have names of the
form 'plink2.xyz', where '.xyz'
is one of these extensions.
This is fine for a single run, but as soon as you make more use of PLINK,
you'll start causing results from previous runs to be overwritten.
Therefore, you usually want to choose a different output file prefix for
each run. --out causes 'plink2' to be replaced with the prefix you
provide. E.g. in the example above, '--out test2'
caused PLINK 2 to create a file named test2.afreq instead of plink2.afreq.
Since the prefix is a required parameter, invoking --out without it will
cause PLINK 2 to quit during command line parsing:
[chrchang:~/plink-ng]$./plink2 --bfile toy_data --freq --out
PLINK v2.00a2 AVX2 (21 Aug 2018) www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3
Error: Missing --out parameter.
For more info, try 'plink2 --help [flag name]' or 'plink2 --help | more'.
In the rest of this documentation, we will continue highlighting full
command lines in purple, default
parameter values in orange, and sample
parameter values you can freely change in green.
If you use PLINK 2.0 in any published work, please cite both the software
(as an electronic resource/URL):
Package : PLINK [version]
Authors : Shaun Purcell, Christopher Chang
URL :
www.cog-genomics.org/plink/2.0/
and the manuscript(s) describing the methods you used. Our primary
methods paper is:
Chang CC, Chow
CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ (2015) Second-generation
PLINK: rising to the challenge of larger and richer datasets.
GigaScience, 4.
PLINK 2.0 includes implementations of many analyses that were developed by
other teams. The original sources are summarized below.
- Methods introduced in PLINK 1.0:
Purcell S,
Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P,
de Bakker P, Daly MJ, Sham PC (2007) PLINK: A Tool Set for Whole-Genome
and Population-Based Linkage Analyses. American Journal of Human
Genetics, 81.
- --hardy/--hwe:
Wigginton JE,
Cutler DJ, Abecasis GR (2005) A note on exact tests of Hardy-Weinberg
equilibrium. American Journal of Human Genetics, 76. Graffelman J, Weir BS
(2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic
markers on the X chromosome. Heredity, 116. Graffelman J, Moreno V
(2013) The mid p-value in exact tests for Hardy-Weinberg equilibrium.
Statistical Applications in Genetics and Molecular Biology, 12. (if
mid-p adjustment is applied)
- --ld
Gaunt T,
Rodríguez S, Day I (2007) Cubic exact solutions for the estimation
of pairwise haplotype frequencies: implications for linkage
disequilibrium analyses and a web tool 'CubeX'. BMC Bioinformatics,
8.
- GRM-related functions:
Yang J, Lee
SH, Goddard ME, Visscher PM (2011) GCTA: A Tool for Genome-wide Complex
Trait Analysis. American Journal of Human Genetics, 88.
- KING-robust kinship analysis:
Manichaikul A,
Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust
relationship inference in genome-wide association studies.
Bioinformatics, 26.
- --glm logistic regression:
Hill A, Loh PR,
Bharadwaj RB, Pons P, Shang J, Guinan E, Lakhani K, Kilty I, Jelinsky SA
(2017) Stepwise Distributed Open Innovation Contests for Software
Development - Acceleration of Genome-Wide Association Analysis.
GigaScience, 6.
- --pca approx:
Galinsky KJ, Bhatia G,
Loh PR, Georgiev S, Mukherjee S, Patterson NJ, Price AL (2016) Fast
Principal-Component Analysis Reveals Convergent Evolution of ADH1B in
Europe and East Asia. American Journal of Human Genetics, 98.
Standard data input
>>
|
|