S: 15 Feb 2019 (b6.8) D: 15 Feb 2019 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic)
Quick index search |
## Association analysis## Basic reports, case/control phenotype--assoc <perm |
mperm=[value]> <genedrop>
<perm-count> <fisher | fisher-midp> <counts>
<set-test> Given a case/control phenotype, - When the '
**fisher**' or '**fisher-midp**' modifier is present, Fisher's exact test is used to generate p-values. This is recommended, since the additional computation time is now usually minimal. 'fisher-midp' also applies Lancaster's mid-p adjustment. - '
**perm**' causes an adaptive Monte Carlo permutation test to be performed. Results are written to e.g. plink.model.fisher.perm. Similarly, '**mperm**=[value]' causes a max(T) permutation test with the specified number of replications to be performed, with results written to e.g. plink.model.fisher.mperm. See below for more discussion of permutation procedures and associated options. - '
**genedrop**' causes offspring genotypes to be regenerated via gene-dropping in the permutation test. (Also, if no permutation test is specified, it causes an adaptive test to be performed.) Note that, by default, no label-swapping occurs in the test; to request limited label-swapping, see below. Not implemented yet. - '
**perm-count**' causes the permutation test report to include permutation counts instead of frequencies. - '
**counts**' causes --assoc to report allele counts instead of frequencies. - '
**set-test**' tests the significance of variant sets; see below for details. - With --model, the '
**dom**', '**rec**', '**gen**', and '**trend**' modifiers force the corresponding test to be used as the basis for --model permutation. If none of these modifiers is present, the most significant result among the allelic, dominant, and recessive tests is used. - '
**trend-only**' causes only the trend test to be performed.
With --model, ## Stratified case/control analyses--mh <perm | mperm=[value]>
<perm-count> <set-test> --mh2
Suppose you have information on age, BMI, and cardiovascular disease status for 1000 people, and the data can be summarized as follows:
The obesity-CVD odds ratio for the pooled dataset is (46 * 640) / (254 *
60) = ~1.932. But this is an overestimate of the true obesity-CVD
association: age is associated with both obesity and CVD, so the
age-stratified odds ratios are
With only 2 or 3 clusters, the exact tests introduced in Mehta CR, Patel NR, Gray R (1985) Computing an Exact Confidence Interval for the Common Odds Ratio in Several 2x2 Contingency Tables and Zelen (1971) The analysis of several 2x2 contingency tables are computationally practical on even the largest datasets, and noticeably more accurate when some table entries are very small; contact us if you want these special cases implemented in PLINK. The following similar analyses are also available: **--mh2**swaps the roles of case/control status and cluster membership, performing a phenotype-stratified IxJxK Cochran-Mantel-Haenszel test on association between cluster assignments and genotypes. Results are written to plink.cmh2.**--homog**executes an alternative to the Breslow-Day test, based on partitioning of the chi-square statistic. Results are written to plink.homog. (They may be slightly different from PLINK 1.07 since clusters which don't contain at least one case and one control are now thrown out.)
Note that some types of stratification should usually be handled with other methods. For instance, principal component analysis is an effective way to deal with population stratification when family structure is not a concern: see Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. ## Basic reports, quantitative phenotype--assoc <perm |
mperm=[value]> <perm-count> <qt-means> <lin>
<set-test> Given a quantitative phenotype, Given both a quantitative phenotype and a case/control covariate (loaded
with --covar) defining two
groups, --gxe does not currently support permutation or the miscellaneous options below. For mixed model analysis of categorical covariate interactions, you should use GCTA's --gxe flag instead. ## Regression with multiple covariates--linear <perm | mperm=[value]> <genedrop> <perm-count>
<set-test> <genotypic | hethom | dominant | recessive | no-snp>
<hide-covar> <sex | no-x-sex> <interaction> <beta>
<standard-beta> <intercept> --condition [variant ID] <dominant | recessive> --parameters [number(s)/range(s)...] --xchr-model [mode number] Given a quantitative phenotype and possibly some covariates (in a --covar file), - '
**perm**' and '**mperm**=[value]' normally request an adaptive or max(T) permutation test on the additive effect. - '
**genedrop**', '**perm-count**', and '**set-test**' have the same effects as they do with --assoc/--model. - The '
**genotypic**' modifier adds an additive effect/dominance deviation 2df joint test (with two genotype-dependent variables in the regression, one with 0/1/2 coding and the second with 0/1/0 coding), while '**hethom**' uses 0/0/1 and 0/1/0 coding instead. If permutation is also requested, these modifiers cause permutation to be based on the joint test instead of just the additive effect (unless --tests overrides this). - '
**dominant**' and '**recessive**' specify a model assuming full dominance or recessiveness, respectively, for the A1 allele. - '
**no-snp**' requests a regression*only*on the phenotype and the covariates, without reference to genomic data (except from --condition{-list}). If permutation is also requested (via 'mperm'; adaptive permutation is not supported in this subcase), this modifier causes permutation test results to be reported for every covariate. - '
**hide-covar**' removes covariate-specific lines from the main report. - By default, when at least one male and one female is present, sex (male
= 1, female = 0) is automatically added as a covariate on X chromosome
SNPs, and nowhere else. The '
**sex**' modifier causes it to be added everywhere, while '**no-x-sex**' excludes it. - '
**interaction**' adds genotype x covariate interactions to the model. This cannot be used with permutation unless --tests is also specified. - The '
**intercept**' modifier causes intercepts to be included in the main report. - For logistic regressions, the '
**beta**' modifier causes regression coefficients instead of odds ratios to be reported. - With --linear, the '
**standard-beta**' modifier standardizes the phenotype and all predictors to zero mean and unit variance before regression. (This happens separately for each variant, since different samples can have missing genotypes for different variants.)
By default,
- Allelic dosage additive effect (or homozygous minor dummy variable)
- Dominance deviation, if present
- --condition{-list} covariate(s), if present
- --covar covariate(s), if present
- Genotype x non-sex covariate 'interaction' terms, if present
- Sex, if present
- Sex-genotype interaction(s), if present
For example, if tmp.cov contains two covariates, the command plink --bfile mydata --linear genotypic interaction --covar tmp.cov --parameters 1-4, 7 induces the following parameter sequence: - Additive effect ('ADD')
- Dominance deviation ('DOMDEV')
- First covariate ('COV1' if not named in the file)
- Second covariate
- Dosage-first covariate interaction ('ADDxCOV1')
- Dominance deviation-first covariate interaction ('DOMDEVxCOV1')
- Dosage-second covariate interaction ('ADDxCOV2')
- Dominance deviation-second covariate interaction ('DOMDEVxCOV2')
so '--parameters 1-4, 7' causes the dosage-first covariate interaction term, and both dominance deviation interaction terms, to be excluded from the model. We have made two changes from PLINK 1.07 to keep indices consistent across all chromosomes: - When the 'genotypic' or 'hethom' modifier is in effect, dominance deviation is no longer removed from the parameter sequence on haploid chromosomes, even though it is omitted from the regression there. (The same goes for dominance deviation interaction terms.)
- Sex-based covariates have been moved to the end.
- you can use the '
**all**' modifier to include all terms in the test, and - if --tests is used with --parameters, the --tests indices refer to the
term sequence
*after*--parameters has acted on it. For example,
plink --bfile mydata --linear interaction --covar tmp.cov --parameters 1-4, 7 --tests 1, 5 adds an ADD=0, ADDxCOV2=0 joint test, since ADDxCOV2 is the fifth remaining term after --parameters has been processed. With --linear, no regression is performed (and an 'NA' result is reported) if any two terms have sample correlation
exceeding 0.999 or the variance
inflation factor for any term is greater than 50. You can change the latter threshold
with Treatment of the X chromosome can be adjusted with the - 0. Exclude all sex and haploid chromosomes from the analysis. If the 'genotypic', 'hethom', 'dominant', or 'recessive' modifier was used with --linear/--logistic, this mode is forced.
- 1. (default) Add sex as a covariate on the X chromosome; don't do anything else differently.
- 2. Add sex as a covariate on the X chromosome, and code male X chromosome genotypes 0/2 instead of 0/1. If any --condition{-list} variants are on the X-chromosome (and 'dominant'/'recessive' has not been specified with --condition{-list}), this coding change applies to them as well.
- 3. Add sex as a covariate on the X chromosome, and also add a dosage-sex interaction term to the model there. If a permutation test was requested without the --tests flag, this mode causes the X chromosome to be omitted from the permutation test.
## Dosage data--dosage [allele
dosage file] <noheader> <skip0=[i]> <skip1=[j]>
<skip2=[k]> <dose1> <format=[m]> <Zout> <occur |
standard-beta> <sex> <case-control-freqs>
- By default, --dosage assumes that only one allelic dosage file (named in the first parameter) should be loaded. To specify multiple files,
- Create a master list with one entry per line. There are normally two supported formats for this list: variant batch numbers (all files with the same batch number should involve the same variants, but different samples) in the first column and filenames in the second; or just a filename per line (every file must contain the same variants).
- Provide the name of that list as the first --dosage parameter.
- Add the '
**list**' modifier. - By default, --dosage assumes the allelic dosage file(s) contain a
header line, which has 'SNP' in column i+1, 'A1' in column i+j+2, 'A2' in
column i+j+3, and sample FID/IIDs starting from column i+j+k+4. (i/j/k
are normally zero, but can be changed with '
**skip0**', '**skip1**', and '**skip2**' respectively.) If such a header line is not present, - when all samples appear in the same order as they do in the .fam
file, you can use the '
**noheader**' modifier. - Otherwise, use the '
**sepheader**' modifier, and append sample ID filenames to your 'list' file entries. The PLINK 1.07 documentation explains this in detail and provides several worked examples^{1}. - The '
**format**' modifier lets you specify the number of values used to represent each dosage. 'format=1' normally indicates a single 0..2 A1 expected count; '**dose1**' modifies this to a 0..1 frequency. 'format=2' (the default) indicates a 0..1 homozygous A1 likelihood followed by a 0..1 het likelihood, while 'format=3' indicates 0..1 hom A1, 0..1 het, 0..1 hom A2. Note that, for 'format=3', the third value in each triplet is not actually parsed by PLINK; as a result, '0 0 0' is interpreted as homozygous A2 rather than a missing call. (We suggest using --data or GTOOL/SNPTEST for proper handling of Oxford-format files which use a triple-zero missing call representation and do not otherwise guarantee a triplet sum of 1. Alternatively, you can use a script to transmute triple-zero to e.g. triple-'NA'.) - '
**Zout**' causes the output file to be gzipped. - With an association analysis, '
**standard-beta**' behaves as it does with --linear. '**sex**' and --vif are also supported. '**case-control-freqs**' causes case and control allele frequencies to be reported separately. - There are three alternate modes which cause the association analysis to be skipped.
- '
**occur**' requests a simple variant occurrence report (written to plink.occur.dosage). **--write-dosage**causes a simple merged file matching the 'format' specification (not including 'dose1') to be written to plink.out.dosage.- --score applies a linear scoring system to the dosages.
1: We have made one change from PLINK 1.07 --dosage: --keep/--remove now work like they do for regular datasets. ## LASSO regression--lasso [h2
estimate] {min lambda} <report-zeroes>
Genotype, phenotype, and covariate values are automatically standardized
to zero mean and unit variance. By default, only variants with nonzero
effect size estimates are included in the report; use the ' By default, model selection is not applied to covariates; Note that this method may require a very large sample size (e.g. hundreds of thousands for complex human traits) to be effective. Since it is not a conventional association test, --lasso does not support permutation or other miscellaneous association test options. ## Linear mixed model association--mlma (Not implemented yet. We'll wait until it's clear that this part of the GCTA codebase has stabilized.) These will be ports of the GCTA 1.21+ flags of the same name, primarily for the convenience of bioinformaticians (such as ourselves) who do some work in OS X, Windows, and/or systems with insufficient memory for the GCTA implementation. Refer to the GCTA website for documentation and citation instructions. ## Case/control nonrandom missingness test--test-missing <perm | mperm=[value]> <perm-count> <midp> Given a case/control phenotype, The ' Any variant which comes up as highly significant under this test should be treated with great caution; spurious association results are likely. ## Miscellaneous options--adjust <gc> <log10> <qq-plot> --lambda [value]
- The '
**gc**' modifier causes genomic-controlled p-values to be used in the formulas. This can only be used with basic additive models; for genomic control of nonadditive models, see Tsepilov YA et al. (2013) Development and Application of Genomic Control Methods for Genome-Wide Association Studies Using Non-Additive Models. - '
**log10**' replaces the p-values in the .adjusted file with their negative base 10 logarithms. - '
**qq-plot**' adds a quantile column to simplify QQ plotting. - By default, the genomic control lambda is estimated from the data; to
set it manually, use the
**--lambda**flag.
For --model and case/control --assoc, '
plink --bfile main_data --assoc --pfilter 0.001 --out report1 tail -n +2 report1.qassoc | sed s/^\ *//g | tr -s ' ' ' ' | cut -f 2 -d ' ' > candidates.txt The second command - removes the top line of the report,
- strips leading spaces from each line,
- collapses subsequent groups of spaces into single spaces,
- and finally, extracts the second column (variant ID) from each line. The resulting file can be used with e.g. --extract.
## Monte Carlo permutation proceduresPLINK 1.9 includes an accelerated implementation of PLINK 1.07's Monte Carlo permutation procedures, applying several ideas from PRESTO and PERMORY to the full range of PLINK 1.07's association tests. ## ReproductionDue to the use of pseudorandom numbers, it is necessary to specify the initial random number seed(s) to make a run reproducible; use the --seed flag to accomplish this. --threads and --perm-batch-size are sometimes necessary as well. ## Customization
You can combine any permutation test with --within/--family to force permutation to only occur within each cluster. --aperm [min perms/variant - 1] {max perms/variant} {pruning alpha threshold} {pruning beta} {initial pruning interval} {interval increase rate}
- The first two parameters control the minimum and maximum number of
permutations that a variant may be included in; without --aperm, these
values default to 5 and 1000000, respectively. For backwards
compatibility, the actual permutation counts are
*strictly*greater than the 'minimum' (i.e. at least 6 by default), and no larger than the maximum. - The next two parameters control the pruning test. If they are alpha and beta in
that order, a 100% * (1 - (beta /
2
**T**)) confidence interval is calculated for each empirical p-value, where**T**is the total number of variants; whenever this confidence interval does not contain alpha, the variant is exempted from further permutation testing. The default values are 0 and 0.0001, respectively. - The last two parameters control when pruning occurs: if they are
**b**and**a**in that order, then the first pruning cycle occurs after max(**b**, [min perms/variant] + 1) permutations, and the number of permutations between subsequent pruning cycles is**ap**+**b**(where**p**is the number of completed permutations so far). The default values are**b**=1,**a**=0.001.
--mperm-save
## Label-swapping + gene-dropping--swap-sibs (Not implemented yet.) Normally, no label-swapping occurs under a gene-dropping permutation
test. ## Permutation without association testing--make-perm-pheno [ct]
## Set-based testsNote that PLINK/SEQ includes several
other useful gene-based tests, such as SKAT-O. We currently restrict
development to tests --set-p [p-value] When an association test is run with the 'set-test' modifier, a set-based test is conducted as follows: - Standard single-variant analysis is performed on all variants contained in at least one set. If this generates a t-statistic, it is converted to a 1df chi-square statistic with the same p-value.
- Raw chi-square statistics are divided by the
**--set-test-lambda**parameter, if it's present. - Sets containing no significant variants (p-value ≤ 0.05; adjust this with
**--set-p**) are removed from the permutation test (though not entirely excluded from the report). - Same-remaining-set variant pairs with r
^{2}≥ 0.5 are identified. You can adjust this threshold by passing a number to**--set-r2**, and/or cause all identified pairs to be listed in plink.ldset with its '**write**' modifier. - For each set, a most-significant-variants subset is greedily
constructed from the lowest p-value significant variants, avoiding the
identified high-r
^{2}pairs. Construction stops when the subset reaches size 5 (adjust this ceiling with**--set-max**), or no significant variants remain. - Each set is assigned a score equal to the average of the chi-square statistics in its most-significant-variants subset.
- Phenotype permutations are generated, and steps 1, 5, and 6 are repeated for each permutation.
- Empirical set p-value (EMP1) is defined by how often permutation-based set scores are greater than or equal to the original score.
Results are written to a file with the .set.{m}perm extension. Multiple-testing corrections (for the number of sets containing at least one variant, i.e. nonempty sets removed in step 3 are counted) can be obtained with --adjust. 'mperm' simply fixes the number of set test permutations here; it does not invoke a proper max(T) permutation test. For some discussion of parameter choices, etc., refer to the PLINK 1.07 documentation. (Not implemented yet.) This will execute the VEGAS-sum gene-based test discussed in Liu JZ et al. (2010) A Versatile Gene-Based Test for Genome-wide Association Studies. LD will be estimated from the input data. (Not implemented yet.) This will execute the GATES test discussed in Li MX, Gui HS, Kwan J, Sham PC (2011) GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes Procedure. ## REML additive heritability estimate (beta)--unrelated-heritability <strict> {tol} {initial covg} {initial covr}
The ' You can run this command directly on a previously computed GCTA-format relationship matrix and a phenotype, by combining this with --grm/--grm-bin and --pheno. For more details about the method, see Vattikuti S, Guo J, Chow CC (2012) Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits. We do not feel this implementation has been adequately tested; if you try it and get anomalous results, please contact us and we will try to figure out what is going on (you may help us find an error). Also note that, since this function spends most of its time executing LAPACK calls, rebuilding from source while linking to a machine-tuned BLAS is likely to substantially improve its performance. |