Introduction, downloads

S: 11 Dec 2023 (b7.2)

D: 11 Dec 2023

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Data management

Generate binary fileset

--make-bed

--make-bed creates a new PLINK 1 binary fileset, after applying sample/variant filters and other operations below. For example,

plink --file text_fileset --maf 0.05 --make-bed --out binary_fileset

does the following:

  1. Autogenerate binary_fileset-temporary.bed + .bim + .fam. (The MAF filter has not yet been applied at this stage. See the Order of operations page for more details.)
  2. Read binary_fileset-temporary.bed + .bim + .fam. Calculate MAFs. Remove all variants with MAF < 0.05 from the current analysis.
  3. Generate binary_fileset.bed + .bim + .fam. Any samples/variants removed from the current analysis are also not present in this fileset. (This is the --make-bed step.)
  4. Delete binary_fileset-temporary.bed + .bim + .fam.

In contrast, the fileset left behind by --keep-autoconv is just the result of step 1.

--make-just-bim
--make-just-fam

--make-just-bim is a variant of --make-bed which only generates a .bim file, and --make-just-fam plays the same role for .fam files. Unlike most other PLINK commands, these do not require the main input to include a .bed file (though you won't have access to many filtering flags when using these in no-.bed mode).

Use these cautiously. It is very easy to desynchronize your binary genotype data and your .bim/.fam indexes if you use these commands improperly. If you have any doubt, stick with --make-bed.

Generate text fileset

--recode [{01 | 12}] [{23 | A | A-transpose | AD | beagle | beagle-nomap | bimbam | bimbam-1chr | compound-genotypes | fastphase | fastphase-1chr | HV | HV-1chr | lgen | lgen-ref | list | oxford | rlist | structure | transpose | vcf | vcf-fid | vcf-iid}] [{tab | tabx | spacex | bgz | gen-gz}] ['include-alt'] ['omit-nonmale-y']

--recode-allele <filename>

--recode creates a new text fileset, after applying sample/variant filters and other operations. By default, the fileset includes a .ped and a .map file, readable with --file.

  • The '12' modifier causes A1 (usually minor) alleles to be coded as '1' and A2 alleles to be coded as '2', while '01' maps A1→0 and A2→1. (PLINK forces you to combine '01' with --[output-]missing-genotype when this is necessary to prevent missing genotypes from becoming indistinguishable from A1 calls.)
  • The '23' modifier causes a 23andMe-formatted file to be generated. This can only be used on a single sample's data (a one-line --keep file may come in handy here). There is currently no special handling of the XY pseudo-autosomal region.
  • The 'AD' modifier causes an additive (0/1/2) + dominant (het = 1, otherwise 0) component file, suitable for loading from R, to be generated. 'A' is the same, except without the dominance component.
    • By default, A1 alleles are counted; this can be customized with --recode-allele. --recode-allele's input file should have variant IDs in the first column and allele IDs in the second.
    • By default, the header line for .raw files only names the counted alleles. To include the alternate allele codes as well, add the 'include-alt' modifier.
    • Haploid additive components are 0/2-valued instead of 0/1-valued, to maintain a consistent scale on the X chromosome.
    See also --R.
  • The 'A-transpose' modifier causes a variant-major additive component file to be generated. This can also be used with --recode-allele.
  • The 'beagle' modifier causes unphased per-autosome .dat and .map files, readable by BEAGLE 3.3 and earlier, to be generated, while 'beagle-nomap' generates a single .dat file (no chromosome splitting occurs in this case).
  • The 'bimbam' modifier causes a BIMBAM-formatted fileset to be generated. If your input data only contains one chromosome, you can use 'bimbam-1chr' instead to write a two-column .pos.txt file.
  • If all allele codes are single-character, you can use the 'compound-genotypes' modifier to omit the space between each pair of allele codes in a single genotype call when generating a .ped + .map fileset. You will need to use the --compound-genotypes flag to load this data in PLINK 1.07, but it's not needed for PLINK 1.9.
  • The 'fastphase' modifier causes per-chromosome fastPHASE files to be generated. If your input data only contains one chromosome, you can use 'fastphase-1chr' instead to exclude the chromosome number from the file extension.
  • The 'HV' modifier causes a Haploview-format .ped + .info fileset to be generated per chromosome. 'HV-1chr' is analogous to 'fastphase-1chr'.
  • The 'lgen' modifier causes a long-format fileset, loadable with --lfile, to be generated. 'lgen-ref' is equivalent to PLINK 1.07 --recode-lgen --with-reference.
  • The 'list' modifier causes a genotype-based list to be generated. This does not produce a .fam or .map file.
  • The 'oxford' modifier causes a Oxford-format .gen + .sample fileset to be generated. If you also include the 'gen-gz' modifier, the .gen file is gzipped.
  • The 'rlist' modifier causes a rare-genotype fileset to be generated (similar to --list's output, but with .fam and .map files, and without homozygous major genotypes).
  • With the 'list' and 'rlist' formats, the 'omit-nonmale-y' modifier causes nonmale genotypes to be omitted on the Y chromosome.
  • The 'structure' modifier causes a Structure-format file to be generated.
  • The 'transpose' modifier causes a transposed text fileset, loadable with --tfile, to be generated.
  • The 'vcf', 'vcf-fid', and 'vcf-iid' modifiers result in production of a VCFv4.2 file. 'vcf-fid' and 'vcf-iid' cause family IDs and within-family IDs respectively to be used for the sample IDs in the last header row, while 'vcf' merges both IDs and puts an underscore between them (in this case, a warning will be given if an ID already contains an underscore).
    If the 'bgz' modifier is added, the VCF file is block-gzipped. (Gzipping of other --recode output files is not currently supported.)
    The A2 allele is saved as the reference and normally flagged as not based on a real reference genome ('PR' INFO field value). When it is important for reference alleles to be correct, you'll usually also want to include --a2-allele and --real-ref-alleles in your command.
  • The 'tab' modifier makes the output mostly tab-delimited instead of mostly space-delimited when the format permits both delimiters. 'tabx' and 'spacex' force all tabs and all spaces, respectively. (See this page for guidelines on swapping tabs/spaces in other contexts.)

For example,

plink --bfile binary_fileset --recode --out new_text_fileset

generates new_text_fileset.ped and new_text_fileset.map from the data in binary_fileset.bed + .bim + .fam, while

plink --bfile binary_fileset --recode vcf-iid --out new_vcf

generates new_vcf.vcf from the same data, removing family IDs in the process.

Irregular output coding

--output-chr <MT code>

Normally, autosomal/sex/mitochondrial chromosome codes in PLINK output files are numeric, e.g. '23' for human X. --output-chr lets you specify a different coding scheme by providing the desired human mitochondrial code; supported options are '26' (default), 'M', 'MT', '0M', 'chr26', 'chrM', and 'chrMT'. (PLINK 1.9 correctly interprets all of these encodings in input files.)

--output-missing-genotype <char>
--output-missing-phenotype <string>

--output-missing-genotype allows you to change the character (normally the --missing-genotype value) used to represent missing genotypes in PLINK output files, while --output-missing-phenotype changes the string (normally the --missing-phenotype value) representing missing phenotypes.

Note that these flags do not affect --[b]merge/--merge-list or the autoconverters, since they generate files that may be reloaded during the same run. Add --make-bed if you want to change missing genotype/phenotype coding when performing those operations.

Set blocks of genotype calls to missing

--zero-cluster <filename>

If clusters have been defined, --zero-cluster takes a file with variant IDs in the first column and cluster IDs in the second, and sets all the corresponding genotype calls to missing. See the PLINK 1.07 documentation for an example.

This flag must now be used with --make-bed and no other output commands (since PLINK no longer keeps the entire genotype matrix in memory).

Heterozygous haploid errors

--set-hh-missing

Normally, heterozygous haploid and nonmale Y chromosome genotype calls are logged to plink.hh and treated as missing by all analysis commands, but left undisturbed by --make-bed and --recode (since, once gender and/or chromosome code errors have been fixed, the calls are often valid). If you actually want --make-bed/--recode to erase this information, use --set-hh-missing. (The scope of this flag is a bit wider than for PLINK 1.07, since commands like --list and --recode-rlist which previously did not respect --set-hh-missing have been consolidated under --recode.)

Note that the most common source of heterozygous haploid errors is imported data which doesn't follow PLINK's convention for representing the X chromosome pseudo-autosomal region. This should be addressed with --split-x below, not --set-hh-missing.

--set-mixed-mt-missing

Mitochondrial DNA is subject to heteroplasmy, so PLINK 1.9 permits 'heterozygous' genotypes and treats MT more like a diploid than a haploid chromosome. However, some analytical methods don't use mixed MT genotype calls, and instead assume that no 'heterozygous' MT calls exist. The --set-mixed-mt-missing flag can be used with --make-bed/--recode to export a dataset with mixed MT calls erased.

X chromosome pseudo-autosomal region

--split-x <last bp position of head> <first bp position of tail> ['no-fail']
--split-x <build code> ['no-fail']
--merge-x ['no-fail']

PLINK prefers to represent the X chromosome's pseudo-autosomal region as a separate 'XY' chromosome (numeric code 25 in humans); this removes the need for special handling of male X heterozygous calls. However, this convention has not been widely adopted, and as a consequence, heterozygous haploid 'errors' are commonplace when PLINK 1.07 is used to handle X chromosome data. The new --split-x and --merge-x flags address this problem.

Given a dataset with no preexisting XY region, --split-x takes the base-pair position boundaries of the pseudo-autosomal region, and changes the chromosome codes of all variants in the region to XY. As (typo-resistant) shorthand, you can use one of the following build codes:

  • 'b36'/'hg18': NCBI build 36/UCSC human genome 18, boundaries 2709521 and 154584237
  • 'b37'/'hg19': GRCh37/UCSC human genome 19, boundaries 2699520 and 154931044
  • 'b38'/'hg38': GRCh38/UCSC human genome 38, boundaries 2781479 and 155701383

By default, PLINK errors out if no variants would be affected by the split. This behavior may break data conversion scripts which are intended to work on e.g. VCF files regardless of whether or not they contain pseudo-autosomal region data; use the 'no-fail' modifier to force PLINK to always proceed in this case.

Conversely, in preparation for data export, --merge-x changes chromosome codes of all XY variants back to X (and 'no-fail' has the same effect). Both of these flags must be used with --make-bed and no other output commands.

Mendel errors

--set-me-missing

In combination with --make-bed, --set-me-missing scans the dataset for Mendel errors and sets implicated genotypes (as defined in the --mendel table) to missing.

  • --mendel-duos causes samples with only one parent in the dataset to be checked, while --mendel-multigen causes (great-)ngrandparental data to be referenced when a parental genotype is missing.
  • It is no longer necessary to combine this with e.g. "--me 1 1" to prevent the Mendel error scan from being skipped.
  • Results may differ slightly from PLINK 1.07 when overlapping trios are present, since genotypes are no longer set to missing before scanning is complete.
Fill in missing calls

--fill-missing-a2

It can be useful to fill in all missing calls in a dataset, e.g. in preparation for using an algorithm which cannot handle them, or as a 'decompression' step when all variants not included in a fileset can be assumed to be homozygous reference matches and there are no explicit missing calls that still need to be preserved.

For the first scenario, a sophisticated imputation program such as BEAGLE or IMPUTE2 should normally be used, and --fill-missing-a2 would be an information-destroying operation bordering on malpractice. However, sometimes the accuracy of the filled-in calls isn't important for whatever reason, or you're dealing with the second scenario. In those cases you can use the --fill-missing-a2 flag (in combination with --make-bed and no other output commands) to simply replace all missing calls with homozygous A2 calls. When used in combination with --zero-cluster/--set-hh-missing/--set-me-missing, this always acts last.

You may want to combine this with --a2-allele below.

Update variant information

--set-missing-var-ids <template string>
--new-id-max-allele-len <n>
--missing-var-code <missing ID string>

Whole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you don't want to throw out all of that data, you'll usually want to assign them chromosome-and-position-based IDs.

--set-missing-var-ids provides one way to do this. The parameter taken by these flags is a special template string, with a '@' where the chromosome code should go, and a '#' where the base-pair position belongs. (Exactly one @ and one # must be present.) For example, given a .bim file starting with

chr1 . 0 10583 A G
chr1 . 0 886817 C T
chr1 . 0 886817 CATTTT C
chrMT . 0 64 T C

"--set-missing-var-ids @:#[b37]" would name the first variant 'chr1:10583[b37]', the second variant 'chr1:886817[b37]'... and then error out when naming the third variant, since it would be given the same name as the second variant. (Note that this position overlap is actually present in 1000 Genomes Project phase 1 data.)

To maintain unique IDs in this situation, you can include '$1' and '$2' in your template string as well; these refer to the first and second allele names in ASCII-sort order. So, if we're using a bash shell, we can try again with

--set-missing-var-ids @:#[b37]\$1,\$2

which would name the first variant 'chr1:10583[b37]A,G', the second variant 'chr1:886817[b37]C,T', the third variant 'chr1:886817[b37]C,CATTTT', and the fourth variant 'chrMT:64[b37]C,T'. Note the extra backslashes: they are necessary in bash because '$' is a reserved character there.

You may still get a small number of duplicate ID errors when using '$1' and '$2'. If indels are involved, it is likely that the ambiguity cannot be resolved by PLINK 1 at all, because it matters which allele is the reference allele1. Instead, you must use e.g. PLINK 2 --set-{all,missing}-var-ids or bcftools, which support REF/ALT-based naming templates. We apologize for the inconvenience.

Allele names associated with indels are occasionally very, very long, and the synthetic variant ID names which would be generated from such long alleles are very inconvenient to work with. As a result, when an allele name exceeds 23 characters, it is automatically truncated in the variant ID generated by --set-missing-var-ids. You can use --new-id-max-allele-len to change the limit.

If your pipeline does not use '.' to represent unnamed variants, you can use --missing-var-code to specify a different string to match. For example, "--missing-var-code NA" would be appropriate for a .bim file starting with

chr1 NA 0 10583 A G
chr1 NA 0 886817 C T
chr1 NA 0 886817 CATTTT C
chrMT NA 0 64 T C

1: Technically, if you never forget to use --keep-allele-order between VCF file conversion and variant naming, an A1/A2-allele-based naming template would work. But we cannot justify directly supporting this workflow, since (i) it's too error-prone, and (ii) if you're an advanced user who can get this right, you can just use an awk one-liner on the post-conversion .bim file.

--update-chr <filename> [chr col. number] [variant ID col.] [skip]
--update-cm <filename> [cm col. number] [variant ID col.] [skip]
--update-name <filename> [new ID col. number] [old ID col.] [skip]

--update-map <filename> [bp col. number] [variant ID col.] [skip]
--update-alleles <filename>
--allele1234 ['multichar']
--alleleACGT ['multichar']

(Also see --cm-map, which is an alternative to --update-cm.)

--update-chr, --update-cm, --update-map, and --update-name update variant chromosomes, centimorgan positions, base-pair positions, and IDs, respectively. By default, the new value is read from column 2 and the (old) variant ID from column 1, but you can adjust these positions with the second and third parameters. The optional fourth 'skip' parameter is either a nonnegative integer, in which case it indicates the number of lines to skip at the top of the file, or a single nonnumeric character, which causes each line with that leading character to be skipped. (Note that, if you want to specify '#' as the skip character, you need to surround it with single- or double-quotes in some Unix shells.)

Strictly speaking, you can use Unix tail, cut, paste, and/or sed to perform the same job (albeit with more time and hassle) as the three optional parameters we have introduced. If you have not used these Unix commands before, we recommend that you familiarize yourself with what they do because they are still likely to come in handy in other scenarios.

You can combine --update-chr, --update-cm, and/or --update-map in the same run. (However, to avoid confusion regarding whether old or new variant IDs apply, we force --update-name to be run separately.)

When invoking --update-chr, you now must use --make-bed in the same run, and no other output commands. Otherwise, we still recommend that you use --make-bed once instead of --update-... over and over, but it's not absolutely required.

--update-alleles updates variant allele codes. Its input should have the following five fields:

  1. Variant ID
  2. One of the old allele codes
  3. The other old allele code
  4. New code for the first named allele
  5. New code for the second named allele

Note that, if you just want to swap A1/A2 allele assignments in the .bim files without changing the real genotype data, you must use --a1-allele/--a2-allele instead.

--allele1234 interprets and/or recodes A/C/G/T alleles in the input as 1/2/3/4, while --alleleACGT does the reverse. With the 'multichar' modifier, these will translate multi-character alleles as well, e.g. "--allele1234 multichar" converts 'TT' to '44'.

Update sample information

--update-ids <filename>

--update-parents <filename>

--update-sex <filename> [n]

These update sample IDs, parental codes, and sexes, respectively. --update-parents now also updates founder/nonfounder status in the current run when appropriate.

--update-ids expects input with the following four fields:

  1. Old family ID
  2. Old within-family ID
  3. New family ID
  4. New within-family ID

--update-parents expects the following four fields:

  1. Family ID
  2. Within-family ID
  3. New paternal within-family ID
  4. New maternal within-family ID

--update-sex expects a file with FIDs and IIDs in the first two columns, and sex information (1 or M = male, 2 or F = female, 0 = missing) in the (n+2)th column. If no second parameter is provided, n defaults to 1. It is frequently useful to set n=3, since sex defaults to the 5th column in .ped and .fam files.

--update-ids cannot be used in the same run as --update-parents or --update-sex.

Flip DNA strand for SNPs

--flip <SNP ID list>
--flip-subset <sample ID list>

Given a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T and C↔G. A warning will be given if any alleles are not named A, C, G, or T.

To save the results instead of only applying the swap to the current run, combine this with --make-bed/--make-just-bim. If --make-bed is the only other operation in the run, you can also use --flip-subset, which only flips alleles for samples in the given ID list ('FID' family IDs in the first column, and 'IID' within-family IDs in the second column), and fails if any SNPs are not A/T or C/G.

LD-based scan for incorrect strand assignment

--flip-scan ['verbose']
--flip-scan-window <max variant ct + 1>
--flip-scan-window-kb <maximum kb distance>
--flip-scan-threshold <minimum correlation>
  (aliases: --flipscan, --flipscan-window, etc.)

If you are working with case/control data where cases and controls were genotyped separately, --flip-scan can help you identify strand inconsistencies in A/T and C/G SNPs which didn't get caught during data merging. This procedure computes signed correlations between nearby variants, considering cases and controls separately; the idea is, when adjacent variants are highly correlated, a strand flip causes the sign of the correlation to be different in cases and controls. It should be performed before LD-based pruning (which removes much of the signal) if at all possible.

The main report is written to plink.flipscan; see the PLINK 1.07 documentation for a detailed discussion of how the results should be handled.

  • The 'verbose' modifier causes raw correlations to be dumped as well (to plink.flipscan.verbose).
  • By default, only pairs of variants less than 10 apart, and no more than 1000 kilobases apart, are considered. These values can be adjusted with --flip-scan-window and --flip-scan-window-kb, respectively.
  • By default, when case-only and control-only correlations both have absolute value smaller than 0.5, the variant pair is ignored. This threshold can be changed with --flip-scan-threshold.
Force A1/A2 alleles

--keep-allele-order

--real-ref-alleles

--a1-allele <filename> [A1 allele col. number] [variant ID col.] [skip]
  (aliases: --reference-allele, --update-ref-allele)

--a2-allele <filename> [A2 allele col. number] [variant ID col.] [skip]

If a binary fileset was originally loaded, --keep-allele-order forces the original A1/A2 allele encoding to be preserved; otherwise, the major allele is set to A2. (We plan to turn off this behavior and properly separate the "major/minor allele" and "reference allele" concepts in PLINK 2.0, but for our very first release we feel it is best to maximize backward compatibility.) --real-ref-alleles has that effect as well, and also removes 'PR' from the INFO values emitted by "--recode vcf{,-fid,-iid}".

With --a1-allele, all alleles in the provided file are set to A1; --a2-allele does the reverse. If the original .bim file only has a single allele code and the --a1-allele/--a2-allele file names a second allele, a concurrent --make-bed will save both allele codes. If there are already two allele codes loaded and --a1-allele/--a2-allele names a third, a warning with the variant ID will be printed (you will usually want to resolve this with --exclude or --flip).

Column and skip parameters work the same way as with --update-chr and friends.

Note that most PLINK analyses treat the A1 (usually minor) allele as the reference allele, which makes sense when only biallelic variants are involved. However, since it is conventional for VCF files to set the major allele as the reference allele instead, you should generally use

--a2-allele <uncompressed VCF filename> 4 3 '#'

to scrape allele codes from them. ("--a1-allele <uncompressed, biallelic VCF> 5 3 '#'" is occasionally useful as well, for filling in missing alternate allele codes.)

Sort by FID/IID

--indiv-sort <mode name> [filename]

This allows you to specify how samples should be sorted when generating new datasets. The four modes are:

  • 'none'/'0': Stick to the order the samples were loaded in. This is what PLINK 1.07 does, and is the PLINK 1.9 default for all operations except merges.
  • 'natural'/'n': "Natural sort" of family and within-family IDs, similar to the logic used in macOS and Windows file selection dialogs; e.g. 'id2' < 'ID3' < 'id10'. This is the PLINK 1.9 default when merging datasets.
  • 'ascii'/'a': Sort in ASCII order, e.g. 'ID3' < 'id10' < 'id2'. This may be more appropriate than natural sort if you need an ordering that's trivial to regenerate in other software, or if your IDs mix letters and digits in a random and meaningless fashion.
  • 'file'/'f': Use the order in another file (named in the second parameter). The file should be space/tab-delimited, family IDs should be in the first column, and within-family IDs should be in the second column.

For now, only --[b]merge/--merge-list and explicit --make-bed/--make-just-fam (i.e. not the text-to-binary autoconverters) respect this flag; this may change in the future.

Covariate files

--write-covar

If a covariate file is loaded, --make-bed/--make-just-fam and --recode automatically write an updated version (after application of sample filters) to plink.cov. However, if you do not wish to rewrite the genotype file, --write-covar lets you just write an updated covariate file.

--with-phenotype ['no-parents'] [{no-sex | female-2}]
--dummy-coding [max categories] ['no-round']

--with-phenotype normally adds parental IDs, sex, and the main phenotype to the updated covariate file. You can exclude parents and/or sex with the 'no-parents' and 'no-sex' modifiers, respectively. By default, sex is coded as male = 1, female/unknown = 0; the 'female-2' modifier requests male = 1, female = 2, unknown = 0 coding instead.

--dummy-coding recodes categorical covariates as a collection of n-1 binary dummy variables. By default, categories are identified by 32-bit integer values, with decimals rounded toward zero; to turn off rounding and give each distinct decimal value (up to six significant figures) its own category, use the 'no-round' modifier. Covariates which would have less than three or more than 49 categories are normally not recoded; change the upper bound by providing a numeric parameter to --dummy-coding.

Cluster files

--write-cluster ['omit-unassigned']

If a cluster file is loaded, --write-cluster writes a pruned version (after application of sample filters) to plink.clst. For backwards compatibility, samples not assigned to any cluster will be given a cluster name of 'NA' by default; to instead omit them from the output file, use the 'omit-unassigned' modifier.

Set files

--write-set
--set-table

If sets have been defined, --write-set dumps 'END'-terminated set membership lists to plink.set, while --set-table writes a variant-by-set membership table to plink.set.table.

Merge filesets

--merge <.ped filename> <.map filename>
--merge <text fileset prefix>
--bmerge <.bed filename> <.bim filename> <.fam filename>
--bmerge <binary fileset prefix>

--merge-mode <mode number>

--merge-equal-pos

--merge allows you to merge exactly one text fileset with the reference fileset. --bmerge merges exactly one binary fileset with the reference.

The new fileset plink.bed + .bim + .fam is automatically created in the process. (Corner case exception: if "--recode lgen" is part of the same run, the prefix is plink-merge instead.) Thus, it is no longer necessary to combine --merge with --make-bed if you aren't simultaneously applying some filters to the data.

The order of sample IDs in the new fileset can now be controlled with --indiv-sort. Note that this order will normally be different from PLINK 1.07's merged fileset sample ID order; use "--indiv-sort 0" to replicate the original behavior.

The following modes (set this with --merge-mode) are available for resolving merge conflicts:

  1. (default) Ignore missing calls, otherwise set mismatches to missing.
  2. Only overwrite calls which are missing in the original file.
  3. Only overwrite calls which are nonmissing in the new file.
  4. Never overwrite.
  5. Always overwrite.
  6. (no merge) Report all mismatching calls.
  7. (no merge) Report mismatching nonmissing calls.

The last two modes generate a .diff file describing merge conflicts, instead of actually performing a merge.

If two variants have the same position, PLINK 1.9's merge commands will always notify you. If you wish to try to merge them, use --merge-equal-pos. (This will fail if any of the same-position variant pairs do not have matching allele names.) Unplaced variants (chromosome code 0) are not considered by --merge-equal-pos.

Note that you are permitted to merge a fileset with itself; doing so with --merge-equal-pos can be worthwhile when working with data containing redundant loci for quality control purposes.

Merge failures
If binary merging fails because at least one variant would have more than two alleles, a list of offending variant(s) will be written to plink.missnp. (For efficiency reasons, this list is no longer generated during a failed text fileset merge; convert to binary and remerge when you need it.) There are several possible causes for this: the variant could be known to be triallelic; there could be a strand flipping issue, or a sequencing error, or a previously unseen variant... manual inspection of some variants in this list is generally advisable. Here are a few pointers.

  • If you are merging files that are missing some allele codes—e.g. single-sample files that were imported from .ped or 23andMe format, or multi-sample files imported from .ped and never filtered on MAF—and they were processed by PLINK 2.0 before merging, you may need to update your PLINK 1.9 build. PLINK 1.9 builds before Jan 2023 did not explicitly recognize the missing allele code emitted by PLINK 2.0; this was usually harmless, but broke merge.
  • To check for strand errors, you can do a "trial flip". Note the number of merge errors, use --flip with one of the source files and the .missnp file, and retry the merge. If most of the errors disappear, you probably do have strand errors, and you can use --flip on the second .missnp file to 'un-flip' any other errors. For example:

plink --bfile source2 --flip merged.missnp --make-bed --out source2_trial

plink --bfile source1 --bmerge source2_trial --make-bed --out merged_trial

plink --bfile source2_trial --flip merged_trial.missnp --make-bed --out source2_corrected

  • If the first .missnp file did contain strand errors, it probably did not contain all of them. After you're done with the basic merge, use --flip-scan to catch the A/T and C/G SNP flips that slipped through (using --make-pheno to temporarily redefine 'case' and 'control' if necessary):

plink --bfile merged --make-pheno source1.fam '*' --flip-scan

  • If, on the other hand, your "trial flip" results suggest that strand errors are not an issue (i.e. most merge errors remained), and you don't have much time for further inspection, you can use the following sequence of commands to remove all offending variants and remerge:

plink --bfile source1 --exclude merged.missnp --make-bed --out source1_tmp

plink --bfile source2 --exclude merged.missnp --make-bed --out source2_tmp

plink --bfile source1_tmp --bmerge source2_tmp --make-bed --out merged

rm source1_tmp.*

rm source2_tmp.*

  • PLINK cannot properly resolve genuine triallelic variants. We recommend exporting that subset of the data to VCF, using another tool/script to perform the merge in the way you want, and then importing the result. Note that, by default, when more than one alternate allele is present, --vcf keeps the reference allele and the most common alternate. (--[b]merge's inability to support that behavior is by design: the most common alternate allele after the first merge step may not remain so after later steps, so the outcome of multiple merges would depend on the order of execution.)

VCF reference merge example
When working with whole-genome sequence data, it is usually more efficient to only track differences from a reference genome, vs. explicitly storing calls at every single variant. Thus, it is useful to be able to manually reconstruct a PLINK fileset containing all the explicit calls given a smaller 'diff-only' fileset and a reference genome in e.g. VCF format.

This is a two step process:

  1. Convert the relevant portion of the reference genome to PLINK 1 binary format.
  2. Use --merge-mode 5 to use the reference genome call whenever the 'diff-only' fileset does not contain the variant.

For a VCF reference genome, you can start by converting to PLINK 1 binary, while skipping all variants with 2+ alternate alleles:

plink --vcf reference.vcf --biallelic-only strict --out reference

Sometimes, the reference VCF contains duplicate variant IDs. This creates problems down the line, so you should scan for and remove/rename all affected variants. Here's the simplest approach (removing them all):

grep -v '^#' reference.vcf | cut -f 3 | sort | uniq -d > reference.dups

plink --bfile reference --exclude reference.dups --make-bed --out reference

That's it for step 1. You can use --extract/--exclude to perform further pruning of the variant set at this stage.

With this reference fileset in hand, you can then use the following commands to fill out a genome based on the same reference:

mv indiv_diff.fam indiv_diff.orig_fam

cp reference.fam indiv_diff.fam

plink --bfile reference --bmerge indiv_diff --merge-mode 5 --out indiv_full

mv -f indiv_diff.orig_fam indiv_diff.fam

cp indiv_diff.fam indiv_full.fam

(It's also possible to do this with merge + --fill-missing-a2 when you don't need to preserve any explicit missing calls.)

For more discussion and examples of dataset merging, refer to the PLINK 1.07 documentation.

--merge-list <filename>

This allows you to merge more than one fileset to the reference fileset. (Also, this can be used without a reference; in that case, the newly created fileset is then treated as the reference by most other PLINK operations.) The parameter must be the name of a text file specifying one fileset per line.

  • If a line contains only one name, it is assumed to be the prefix for a binary fileset.
  • If a line contains exactly two names, they are assumed to be the full filenames for a text fileset (.ped, then .map). These filesets may not contain multi-character alleles.
  • If a line contains exactly three names, they are assumed to be the full filenames for a binary fileset (.bed, then .bim, then .fam).

On merge failure, if a .missnp file is generated, it now only lists conflicts between the binary filesets. When this is problematic, you should convert your data to binary and retry the merge.

--merge-list can also be combined with --merge-mode. However, we do not recommend this except when you are certain that no genotype appears in more than one --merge-list fileset; otherwise the final result may depend on the order of the --merge-list entries. (--merge-list + merge mode 1 has been implemented in a manner which avoids this nontransitivity problem.)

Variant filtering

--write-snplist

--list-23-indels

--write-snplist writes IDs of all variants which pass the filters and inclusion thresholds you've specified to plink.snplist, while --list-23-indels writes the subset with 23andMe-style indel calls (D/I allele codes) to plink.indel.

--list-duplicate-vars ['require-same-ref'] ['ids-only'] ['suppress-first']

When multiple variants share the same bp coordinate and allele codes, it is likely that they are not actually distinct, and the duplicates should be merged or removed. (In fact, some tools, such as BEAGLE 4, require such duplicate variants to be merged/removed during preprocessing.) --list-duplicate-vars identifies them, and writes a report to plink.dupvar.

  • This is not based on variant IDs. Use PLINK 2.0's --rm-dup for ID-based deduplication.
  • By default, this ignores A1/A2 allele assignments, since PLINK 1 normally does not preserve them. If you want two variants with identical positions and reversed allele assignments to not be considered duplicates (here's an example of a real-world situation where this matters), use the 'require-same-ref' modifier, along with --keep-allele-order/--a2-allele.
  • Normally, the report has a header line, and contains positions and allele codes in the first 3-4 columns. However, if you just want an input file for --extract/--exclude, the 'ids-only' modifier removes the header and the position/allele columns, and 'suppress-first' prevents the first variant in each group from being reported (since, if you're removing duplicates, you probably want to keep one member of each group).
  • --list-duplicate-vars fails in 'ids-only' mode if any of the reported variant IDs are not unique. It may be necessary to temporarily change some variant IDs to work around this; the position/allele report should contain the information you need.

Basic statistics >>