Introduction, downloads

S: 11 Dec 2023 (b7.2)

D: 11 Dec 2023

Recent version history

What's new?

Future development

Limitations

Note to testers

[Jump to search box]

General usage

Getting started

Citation instructions

Standard data input

PLINK 1 binary (.bed)

Autoconversion behavior

PLINK text (.ped, .tped...)

VCF (.vcf[.gz], .bcf)

Oxford (.gen[.gz], .bgen)

23andMe text

Generate random

Unusual chromosome IDs

Recombination map

Allele frequencies

Phenotypes

Covariates

Clusters of samples

Variant sets

Binary distance matrix

IBD report (.genome)

Input filtering

Sample ID file

Variant ID file

Positional ranges file

Cluster membership

Set membership

Attribute-based

Chromosomes

SNPs only

Simple variant window

Multiple variant ranges

Sample/variant thinning

Covariates (--filter)

Missing genotypes

Missing phenotypes

Minor allele frequencies

Hardy-Weinberg

Mendel errors

Quality scores

Relationships

Main functions

Data management

--make-bed

--recode

--output-chr

--zero-cluster

--split-x/--merge-x

--set-me-missing

--fill-missing-a2

--set-missing-var-ids

--update-map...

--update-ids...

--flip

--flip-scan

--keep-allele-order...

--indiv-sort

--write-covar...

--[b]merge...

Merge failures

VCF reference merge

--merge-list

--write-snplist

--list-duplicate-vars

Basic statistics

--freq[x]

--missing

--test-mishap

--hardy

--mendel

--het/--ibc

--check-sex/--impute-sex

--fst

Linkage disequilibrium

--indep...

--r/--r2

--show-tags

--blocks

Distance matrices

Identity-by-state/Hamming

  (--distance...)

Relationship/covariance

  (--make-grm-bin...)

--rel-cutoff

Distance-pheno. analysis

  (--ibs-test...)

Identity-by-descent

--genome

--homozyg...

Population stratification

--cluster

--pca

--mds-plot

--neighbour

Association analysis

Basic case/control

  (--assoc, --model)

Stratified case/control

  (--mh, --mh2, --homog)

Quantitative trait

  (--assoc, --gxe)

Regression w/ covariates

  (--linear, --logistic)

--dosage

--lasso

--test-missing

Monte Carlo permutation

Set-based tests

REML additive heritability

Family-based association

--tdt

--dfam

--qfam...

--tucc

Report postprocessing

--annotate

--clump

--gene-report

--meta-analysis

Epistasis

--fast-epistasis

--epistasis

--twolocus

Allelic scoring (--score)

R plugins (--R)

Secondary input

GCTA matrix (.grm.bin...)

Distributed computation

Command-line help

Miscellaneous

Tabs vs. spaces

Flag/parameter reuse

System resource usage

Pseudorandom numbers

Resources

1000 Genomes

Teaching materials

Gene range lists

Functional SNP attributes

Errors and warnings

Output file list

Order of operations

For developers

GitHub repository

Compilation

Core algorithms

Partial sum lookup

Bit population count

Ternary dot product

Vertical population count

Exact statistical tests

Multithreaded gzip

Adding new functionality

Google groups

plink2-users

plink2-dev

Credits

File formats

Quick index search

Association analysis

Basic reports, case/control phenotype

--assoc ['perm' | 'mperm='<value>] ['genedrop'] ['perm-count'] [{fisher | fisher-midp}] ['counts'] ['set-test']
--model ['perm' | 'mperm='<value>] ['genedrop'] ['perm-count'] [{fisher | fisher-midp | trend-only}] ['set-test'] [{dom | rec | gen | trend}]
--cell <contingency table threshold>

Given a case/control phenotype, --assoc writes the results of a 1df chi-square allelic test to plink.assoc (or .assoc.fisher with 'fisher'/'fisher-midp'), while --model performs four other tests as well (1df dominant gene action, 1df recessive gene action, 2df genotypic, and Cochran-Armitage trend) and writes the combined results to plink.model. Refer to the PLINK 1.07 documentation for more details about the statistical tests employed.

  • When the 'fisher' or 'fisher-midp' modifier is present, Fisher's exact test is used to generate p-values. This is recommended, since the additional computation time is now usually minimal. 'fisher-midp' also applies Lancaster's mid-p adjustment.
  • 'perm' causes an adaptive Monte Carlo permutation test to be performed. Results are written to e.g. plink.model.fisher.perm. Similarly, 'mperm=<value>' causes a max(T) permutation test with the specified number of replications to be performed, with results written to e.g. plink.model.fisher.mperm. See below for more discussion of permutation procedures and associated options.
  • 'genedrop' causes offspring genotypes to be regenerated via gene-dropping in the permutation test. (Also, if no permutation test is specified, it causes an adaptive test to be performed.) Note that, by default, no label-swapping occurs in the test; to request limited label-swapping, see below. Not implemented yet.
  • 'perm-count' causes the permutation test report to include permutation counts instead of frequencies.
  • 'counts' causes --assoc to report allele counts instead of frequencies.
  • 'set-test' tests the significance of variant sets; see below for details.
  • With --model, the 'dom', 'rec', 'gen', and 'trend' modifiers force the corresponding test to be used as the basis for --model permutation. If none of these modifiers is present, the most significant result among the allelic, dominant, and recessive tests is used.
  • 'trend-only' causes only the trend test to be performed.

With --model, --cell specifies the minimum number of observations per cell in the 2-by-3 phenotype-by-genotype table below which the genotypic, dominant, and recessive tests are skipped. This threshold defaults to 0 with 'fisher'/'fisher-midp', and 5 without it.

Stratified case/control analyses

--mh ['perm' | 'mperm='<value>] ['perm-count'] ['set-test']
  (alias: --cmh)
--bd ['perm' | 'perm-bd' | 'mperm='<value>] ['perm-count'] ['set-test']

--mh2
--homog

Motivation
(The following example has been shamelessly stolen from the Boston University School of Public Health.)

Suppose you have information on age, BMI, and cardiovascular disease status for 1000 people, and the data can be summarized as follows:

Age < 50 Age 50+
CVDNo CVDTotalCVDNo CVDTotal
Obese1090100Obese36164200
Not obese35465500Not obese25175200
Total45555600Total61339400

The obesity-CVD odds ratio for the pooled dataset is (46 * 640) / (254 * 60) = ~1.932. But this is an overestimate of the true obesity-CVD association: age is associated with both obesity and CVD, so the age-stratified odds ratios are both substantially lower: (10 * 465) / (90 * 35) = ~1.476 and (36 * 175) / (164 * 25) = ~1.537. There appears to be a common odds ratio to extract, but the pooled dataset value clearly isn't a good estimator of it.

A solution
That's one scenario where the 2x2xK Cochran-Mantel-Haenszel statistics may come in handy. Given a case/control phenotype and a set of clusters defined via --within/--family, --mh computes a weighted average of the per-stratum odds ratios for each variant, along with a 1df chi-square statistic and p-value (for the null hypothesis that odds ratios for all strata are equal to 1). --bd combines this procedure with the Breslow-Day test of the hypothesis that all the strata share a single odds ratio. Both of these commands write results to plink.cmh. Permutation testing based on the CMH (default) or Breslow-Day (when --bd's 'perm-bd' modifier is present; this always uses adaptive permutation) statistic is supported, as well as CMH-based variant set testing.

With only 2 or 3 clusters, the exact tests introduced in Mehta CR, Patel NR, Gray R (1985) Computing an Exact Confidence Interval for the Common Odds Ratio in Several 2x2 Contingency Tables and Zelen (1971) The analysis of several 2x2 contingency tables are computationally practical on even the largest datasets, and noticeably more accurate when some table entries are very small; contact us if you want these special cases implemented in PLINK.

The following similar analyses are also available:

  • --mh2 swaps the roles of case/control status and cluster membership, performing a phenotype-stratified IxJxK Cochran-Mantel-Haenszel test on association between cluster assignments and genotypes. Results are written to plink.cmh2.
  • --homog executes an alternative to the Breslow-Day test, based on partitioning of the chi-square statistic. Results are written to plink.homog. (They may be slightly different from PLINK 1.07 since clusters which don't contain at least one case and one control are now thrown out.)

Note that some types of stratification should usually be handled with other methods. For instance, principal component analysis is an effective way to deal with population stratification when family structure is not a concern: see Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies.

Basic reports, quantitative phenotype

--assoc ['perm' | 'mperm='<value>] ['perm-count'] ['qt-means'] ['lin'] ['set-test']
--gxe [covariate index]

Given a quantitative phenotype, --assoc writes regression statistics and Wald test results to plink.qassoc. The 'qt-means' modifier causes an additional .qassoc.means file to be generated, reporting trait means and standard deviations stratified by genotype. 'lin' causes the Lin (2006) statistic to be reported as well; in this case, if multiple test adjustments and/or permutation test results are also requested, they will be based on the Lin statistic.

Given both a quantitative phenotype and a case/control covariate (loaded with --covar) defining two groups, --gxe compares the regression coefficient derived from considering only members of one group to the regression coefficient derived from considering only members of the other, writing a report to plink.qassoc.gxe. By default, the first covariate in the --covar file defines the groups; use "--gxe n" to base them on the nth covariate instead. (--gxe ignores --covar-name/--covar-number.) Covariate values are interpreted as -9/0/nonnumeric = missing, 2 = second group, any other number = first group.

--gxe does not currently support permutation or the miscellaneous options below.

For mixed model analysis of categorical covariate interactions, you should use GCTA's --gxe flag instead.

Regression with multiple covariates

--linear ['perm' | 'mperm='<value>] ['genedrop'] ['perm-count'] ['set-test'] [{genotypic | hethom | dominant | recessive | no-snp}] ['hide-covar'] [{sex | no-x-sex}] ['interaction'] ['beta'] ['standard-beta'] ['intercept']
--logistic ['perm' | 'mperm='<value>] ['genedrop'] ['perm-count'] ['set-test'] [{genotypic | hethom | dominant | recessive | no-snp}] ['hide-covar'] [{sex | no-x-sex}] ['interaction'] ['beta'] ['intercept']

--condition <variant ID> [{dominant | recessive}]
--condition-list <variant ID file> [{dominant | recessive}]

--parameters <number(s)/range(s)...>
--tests <all> [number(s)/range(s)...]
--vif <max VIF>

--xchr-model <mode number>

Given a quantitative phenotype and possibly some covariates (in a --covar file), --linear writes a linear regression report to plink.assoc.linear. Similarly, --logistic performs logistic regression given a case/control phenotype and some covariates. If either flag is used with --all-pheno, the type of regression will automatically adapt based on whether the current phenotype is case/control or not.

  • 'perm' and 'mperm=<value>' normally request an adaptive or max(T) permutation test on the additive effect.
  • 'genedrop', 'perm-count', and 'set-test' have the same effects as they do with --assoc/--model.
  • The 'genotypic' modifier adds an additive effect/dominance deviation 2df joint test (with two genotype-dependent variables in the regression, one with 0/1/2 coding and the second with 0/1/0 coding), while 'hethom' uses 0/0/1 and 0/1/0 coding instead. If permutation is also requested, these modifiers cause permutation to be based on the joint test instead of just the additive effect (unless --tests overrides this).
  • 'dominant' and 'recessive' specify a model assuming full dominance or recessiveness, respectively, for the A1 allele.
  • 'no-snp' requests a regression only on the phenotype and the covariates, without reference to genomic data (except from --condition[-list]). If permutation is also requested (via 'mperm'; adaptive permutation is not supported in this subcase), this modifier causes permutation test results to be reported for every covariate.
  • 'hide-covar' removes covariate-specific lines from the main report.
  • By default, when at least one male and one female is present, sex (male = 1, female = 0) is automatically added as a covariate on X chromosome SNPs, and nowhere else. The 'sex' modifier causes it to be added everywhere, while 'no-x-sex' excludes it.
  • 'interaction' adds genotype x covariate interactions to the model. This cannot be used with permutation unless --tests is also specified.
  • The 'intercept' modifier causes intercepts to be included in the main report.
  • For logistic regressions, the 'beta' modifier causes regression coefficients instead of odds ratios to be reported.
  • With --linear, the 'standard-beta' modifier standardizes the phenotype and all predictors to zero mean and unit variance before regression. (This happens separately for each variant, since different samples can have missing genotypes for different variants.)

By default, --condition adds a single variant's allelic dosage as a covariate, while --condition-list does that for all variants named in the given file. The 'dominant' modifier causes 0/1/1 coding to be used instead of 0/1/2, while 'recessive' calls for 0/0/1 coding.

--parameters takes a list of 1-based indices (or ranges of them; syntax is similar to --chr) referring to the sequence of regression covariates which would normally be included in the model, and removes the unlisted covariates. The sequence is:

  1. Allelic dosage additive effect (or homozygous minor dummy variable)
  2. Dominance deviation, if present
  3. --condition[-list] covariate(s), if present
  4. --covar covariate(s), if present
  5. Genotype x non-sex covariate 'interaction' terms, if present
  6. Sex, if present
  7. Sex-genotype interaction(s), if present

For example, if tmp.cov contains two covariates, the command

plink --bfile mydata --linear genotypic interaction --covar tmp.cov --parameters 1-4, 7

induces the following parameter sequence:

  1. Additive effect ('ADD')
  2. Dominance deviation ('DOMDEV')
  3. First covariate ('COV1' if not named in the file)
  4. Second covariate
  5. Dosage-first covariate interaction ('ADDxCOV1')
  6. Dominance deviation-first covariate interaction ('DOMDEVxCOV1')
  7. Dosage-second covariate interaction ('ADDxCOV2')
  8. Dominance deviation-second covariate interaction ('DOMDEVxCOV2')

so "--parameters 1-4, 7" causes the dosage-first covariate interaction term, and both dominance deviation interaction terms, to be excluded from the model.

We have made two changes from PLINK 1.07 to keep indices consistent across all chromosomes:

  • When the 'genotypic' or 'hethom' modifier is in effect, dominance deviation is no longer removed from the parameter sequence on haploid chromosomes, even though it is omitted from the regression there. (The same goes for dominance deviation interaction terms.)
  • Sex-based covariates have been moved to the end.

--tests performs a (joint) test on the specified term(s). If used with e.g. "--linear genotypic" or "--linear hethom", this replaces the 2df joint test that would normally occur. Also, if permutation was requested, it is based on this test instead of the main effect (to support permutation tests on interaction terms). Syntax is similar to --parameters, except that

  • you can use the 'all' modifier to include all terms in the test, and
  • if --tests is used with --parameters, the --tests indices refer to the term sequence after --parameters has acted on it. For example,

plink --bfile mydata --linear interaction --covar tmp.cov --parameters 1-4, 7 --tests 1, 5

adds an ADD=0, ADDxCOV2=0 joint test, since ADDxCOV2 is the fifth remaining term after --parameters has been processed.

With --linear, no regression is performed (and an 'NA' result is reported) if any two terms have sample correlation exceeding 0.999 or the variance inflation factor for any term is greater than 50. You can change the latter threshold with --vif. (See the PLINK 1.07 documentation for some more discussion of multicollinearity.)

Treatment of the X chromosome can be adjusted with the --xchr-model flag. There are currently four modes:

  • 0. Exclude all sex and haploid chromosomes from the analysis. If the 'genotypic', 'hethom', 'dominant', or 'recessive' modifier was used with --linear/--logistic, this mode is forced.
  • 1. (default) Add sex as a covariate on the X chromosome; don't do anything else differently.
  • 2. Add sex as a covariate on the X chromosome, and code male X chromosome genotypes 0/2 instead of 0/1. If any --condition[-list] variants are on the X-chromosome (and 'dominant'/'recessive' has not been specified with --condition[-list]), this coding change applies to them as well.
  • 3. Add sex as a covariate on the X chromosome, and also add a dosage-sex interaction term to the model there. If a permutation test was requested without the --tests flag, this mode causes the X chromosome to be omitted from the permutation test.
Dosage data

--dosage <allele dosage file> ['noheader'] ['skip0='<i>] ['skip1='<j>] ['skip2='<k>] ['dose1'] ['format='<m>] ['Zout'] [{occur | standard-beta}] ['sex'] ['case-control-freqs']
--dosage <list file> list [{sepheader | noheader}] ['skip0='<i>] ['skip1='<j>] ['skip2='<k>] ['dose1'] ['format='<m>] ['Zout'] [{occur | standard-beta}] ['sex'] ['case-control-freqs']
--write-dosage

--dosage processes one or more (possibly gzipped) text files with variant-major allelic dosage data, and normally performs a linear or logistic regression-based association analysis on it (results written to plink.assoc.dosage). It must be used with the --fam input flag, and can be combined with --map (this adds CHR and BP columns to the association analysis output file, and causes variants not named in the .map or which fail a variant filter to be excluded). Other primary input flags, e.g. --bfile, cannot be present.

This command is poorly integrated with the rest of PLINK 1: genotype-based filters (e.g. --maf, --hwe) and many other flags are not supported. (See the dosage branch of the order of operations for details.) It will be replaced with a proper data import flag in PLINK 2.0.

  • By default, --dosage assumes that only one allelic dosage file (named in the first parameter) should be loaded. To specify multiple files,
    1. Create a master list with one entry per line. There are normally two supported formats for this list: variant batch numbers (all files with the same batch number should involve the same variants, but different samples) in the first column and filenames in the second; or just a filename per line (every file must contain the same variants).
    2. Provide the name of that list as the first --dosage parameter.
    3. Add the 'list' modifier.
  • By default, --dosage assumes the allelic dosage file(s) contain a header line, which has 'SNP' in column i+1, 'A1' in column i+j+2, 'A2' in column i+j+3, and sample FID/IIDs starting from column i+j+k+4. (i/j/k are normally zero, but can be changed with 'skip0', 'skip1', and 'skip2' respectively.) If such a header line is not present,
    • when all samples appear in the same order as they do in the .fam file, you can use the 'noheader' modifier.
    • Otherwise, use the 'sepheader' modifier, and append sample ID filenames to your 'list' file entries. The PLINK 1.07 documentation explains this in detail and provides several worked examples1.
  • The 'format' modifier lets you specify the number of values used to represent each dosage. 'format=1' normally indicates a single 0..2 A1 expected count; 'dose1' modifies this to a 0..1 frequency. 'format=2' (the default) indicates a 0..1 homozygous A1 likelihood followed by a 0..1 het likelihood, while 'format=3' indicates 0..1 hom A1, 0..1 het, 0..1 hom A2.
    Note that, for 'format=3', the third value in each triplet is not actually parsed by PLINK; as a result, '0 0 0' is interpreted as homozygous A2 rather than a missing call. (We suggest using --data or GTOOL/SNPTEST for proper handling of Oxford-format files which use a triple-zero missing call representation and do not otherwise guarantee a triplet sum of 1. Alternatively, you can use a script to transmute triple-zero to e.g. triple-'NA'.)
  • 'Zout' causes the output file to be gzipped.
  • With an association analysis, 'standard-beta' behaves as it does with --linear. 'sex' and --vif are also supported. 'case-control-freqs' causes case and control allele frequencies to be reported separately.
  • There are three alternate modes which cause the association analysis to be skipped.
    • 'occur' requests a simple variant occurrence report (written to plink.occur.dosage).
    • --write-dosage causes a simple merged file matching the 'format' specification (not including 'dose1') to be written to plink.out.dosage.
    • --score applies a linear scoring system to the dosages.

1: We have made one change from PLINK 1.07 --dosage: --keep/--remove now work like they do for regular datasets.

LASSO regression

--lasso <h2 estimate> [min lambda] ['report-zeroes']
--lasso-select-covars [covariate name(s)/range(s)...]

--lasso estimates variant effect sizes via LASSO regression, writing a report to plink.lasso. --covar can be used to add covariates. You must provide a h2 estimate to calibrate the regression.

Genotype, phenotype, and covariate values are automatically standardized to zero mean and unit variance. By default, only variants with nonzero effect size estimates are included in the report; use the 'report-zeroes' modifier to force all variants to be listed.

By default, model selection is not applied to covariates; --lasso-select-covars changes this. With no further parameters, all covariates are subject to model selection; alternatively, you can provide a space/comma-delimited list of covariate names (use 1-based 'COV1', 'COV2'... if your covariates are unnamed) to only subject the named covariates to model selection.

Note that this method may require a very large sample size (e.g. hundreds of thousands for complex human traits) to be effective.

Since it is not a conventional association test, --lasso does not support permutation or other miscellaneous association test options.

Case/control nonrandom missingness test

--test-missing ['perm' | 'mperm='<value>] ['perm-count'] ['midp']

Given a case/control phenotype, --test-missing tries to detect platform/batch differences between case and control genotype data by performing Fisher's exact test on case/control missing call counts at each variant. (Het. haploids are treated as missing. Variants with 0% or 100% missing call frequency are now skipped.) Results are written to plink.missing. Multiple-testing corrected p-values can be obtained from --adjust or permutation tests.

The 'midp' modifier causes Lancaster's mid-p adjustment to be applied to Fisher's exact test. We recommend its use, since the conservative bias (which the mid-p adjustment eliminates) exhibited by the regular Fisher's exact test has no value in this context.

Any variant which comes up as highly significant under this test should be treated with great caution; spurious association results are likely.

Miscellaneous options

--adjust ['gc'] ['log10'] ['qq-plot']

--lambda <value>
--ci <confidence interval size>
--pfilter <threshold>

--adjust causes an .adjusted file to be generated with each association test report, containing several basic multiple testing corrections for the raw p-values. These values tend to be overly conservative when significant LD is present; more useful p-values can be obtained from e.g. appropriate permutation tests. Entries in this file are sorted by significance value instead of genomic location.

For --model and case/control --assoc, '--ci X' causes size-X centered confidence intervals to be reported for odds ratios. (E.g. "--ci 0.95" corresponds to a 95% confidence interval.)

--pfilter causes only associations with p-values no larger than the given threshold to be reported. (Variants with 'NA' p-values are excluded as well; if that's the only effect you want, use "--pfilter 1".) This can dramatically reduce the report's size, and you can extract the remaining variant IDs with a command sequence like the following:

plink --bfile main_data --assoc --pfilter 0.001 --out report1

tail -n +2 report1.qassoc | sed s/^\ *//g | tr -s ' ' ' ' | cut -f 2 -d ' ' > candidates.txt

The second command

  1. removes the top line of the report,
  2. strips leading spaces from each line,
  3. collapses subsequent groups of spaces into single spaces,
  4. and finally, extracts the second column (variant ID) from each line. The resulting file can be used with e.g. --extract.

Monte Carlo permutation procedures

PLINK 1.9 includes an accelerated implementation of PLINK 1.07's Monte Carlo permutation procedures, applying several ideas from PRESTO and PERMORY to the full range of PLINK 1.07's association tests.

Reproduction

Due to the use of pseudorandom numbers, it is necessary to specify the initial random number seed(s) to make a run reproducible; use the --seed flag to accomplish this. --threads and --perm-batch-size are sometimes necessary as well.

Customization

Within-cluster restriction

You can combine any permutation test with --within/--family to force permutation to only occur within each cluster.

Adaptive permutation

--aperm <min perms/variant - 1> [max perms/variant] [pruning alpha threshold] [pruning beta] [initial pruning interval] [interval increase rate]

--aperm sets six parameters governing the behavior of PLINK's adaptive permutation tests.

  • The first two parameters control the minimum and maximum number of permutations that a variant may be included in; without --aperm, these values default to 5 and 1000000, respectively. For backwards compatibility, the actual permutation counts are strictly greater than the 'minimum' (i.e. at least 6 by default), and no larger than the maximum.
  • The next two parameters control the pruning test. If they are alpha and beta in that order, a 100% * (1 - (beta / 2T)) confidence interval is calculated for each empirical p-value, where T is the total number of variants; whenever this confidence interval does not contain alpha, the variant is exempted from further permutation testing. The default values are 0 and 0.0001, respectively.
  • The last two parameters control when pruning occurs: if they are b and a in that order, then the first pruning cycle occurs after max(b, <min perms/variant> + 1) permutations, and the number of permutations between subsequent pruning cycles is ap + b (where p is the number of completed permutations so far). The default values are b=1, a=0.001.

Max(T) permutation

--mperm-save
--mperm-save-all

--mperm-save writes each permutation's best test statistic to plink.mperm.dump.best, while --mperm-save-all produces a .mperm.dump.all file with every single permutation x SNP test statistic. The nature of the test statistic will be mentioned in the log and printed to the console. (It can differ slightly from PLINK 1.07's choice; e.g. when Fisher's exact test is used, PLINK 1.07 uses one minus the p-value while PLINK 1.9 just uses the p-value.)

Label-swapping + gene-dropping

--swap-sibs
--swap-parents
--swap-unrel

(Not implemented yet.)

Normally, no label-swapping occurs under a gene-dropping permutation test. --swap-sibs enables label-swapping for full siblings when parental genotype information is not present (e.g. it was excluded during quality control), while --swap-parents enables label-swapping for pairs of parents, and --swap-unrel enables label-swapping for singletons (i.e. those unrelated to everyone else).

Permutation without association testing

--make-perm-pheno <ct>

--make-perm-pheno simply generates the specified number of phenotype permutations and writes them to plink.pphe, without invoking an association test. Cluster restrictions apply.

Set-based tests

Note that PLINK/SEQ includes several other useful gene-based tests, such as SKAT-O. We currently restrict development to tests not already packaged with PLINK/SEQ.

--set-p <p-value>
--set-r2 [val] ['write']
--set-max <ct>
--set-test-lambda <value>

When an association test is run with the 'set-test' modifier, a set-based test is conducted as follows:

  1. Standard single-variant analysis is performed on all variants contained in at least one set. If this generates a t-statistic, it is converted to a 1df chi-square statistic with the same p-value.
  2. Raw chi-square statistics are divided by the --set-test-lambda parameter, if it's present.
  3. Sets containing no significant variants (p-value ≤ 0.05; adjust this with --set-p) are removed from the permutation test (though not entirely excluded from the report).
  4. Same-remaining-set variant pairs with r20.5 are identified. You can adjust this threshold by passing a number to --set-r2, and/or cause all identified pairs to be listed in plink.ldset with its 'write' modifier.
  5. For each set, a most-significant-variants subset is greedily constructed from the lowest p-value significant variants, avoiding the identified high-r2 pairs. Construction stops when the subset reaches size 5 (adjust this ceiling with --set-max), or no significant variants remain.
  6. Each set is assigned a score equal to the average of the chi-square statistics in its most-significant-variants subset.
  7. Phenotype permutations are generated, and steps 1, 5, and 6 are repeated for each permutation.
  8. Empirical set p-value (EMP1) is defined by how often permutation-based set scores are greater than or equal to the original score.

Results are written to a file with the .set.{perm|mperm} extension. Multiple-testing corrections (for the number of sets containing at least one variant, i.e. nonempty sets removed in step 3 are counted) can be obtained with --adjust. 'mperm' simply fixes the number of set test permutations here; it does not invoke a proper max(T) permutation test.

For some discussion of parameter choices, etc., refer to the PLINK 1.07 documentation.

REML additive heritability estimate (beta)

--unrelated-heritability ['strict'] [tol] [initial covg] [initial covr]

--unrelated-heritability computes Visscher et al.'s restricted maximum likelihood estimate of additive heritability, iterating with an accelerated variant of the EM algorithm until the rate of change of the log likelihood function is less than 'tol' (default 1e-7). Scalar phenotype data is required.

The 'strict' modifier forces regular EM to be used; otherwise, an ad hoc form of momentum is employed to speed up convergence (but it's still slow; we recommend using GCTA 1.1+ or REACTA to execute an alternative to the EM algorithm if speed is important). The genomic covariance prior (covg) defaults to 0.45, and the residual covariance prior defaults to (1 - covg).

You can run this command directly on a previously computed GCTA-format relationship matrix and a phenotype, by combining this with --grm/--grm-bin and --pheno.

For more details about the method, see Vattikuti S, Guo J, Chow CC (2012) Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits.

We do not feel this implementation has been adequately tested; if you try it and get anomalous results, please contact us and we will try to figure out what is going on (you may help us find an error). Also note that, since this function spends most of its time executing LAPACK calls, rebuilding from source while linking to a machine-tuned BLAS is likely to substantially improve its performance.

Family-based association analysis >>