S: 18 Aug 2024 (b7.4) D: 18 Aug 2024 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic) Core algorithms Quick index search |
Association analysisBasic reports, case/control phenotype--assoc ['perm' | 'mperm='<value>] ['genedrop'] ['perm-count'] [{fisher | fisher-midp}] ['counts'] ['set-test'] Given a case/control phenotype, --assoc writes the results of a 1df chi-square allelic test to plink.assoc (or .assoc.fisher with 'fisher'/'fisher-midp'), while --model performs four other tests as well (1df dominant gene action, 1df recessive gene action, 2df genotypic, and Cochran-Armitage trend) and writes the combined results to plink.model. Refer to the PLINK 1.07 documentation for more details about the statistical tests employed.
With --model, --cell specifies the minimum number of observations per cell in the 2-by-3 phenotype-by-genotype table below which the genotypic, dominant, and recessive tests are skipped. This threshold defaults to 0 with 'fisher'/'fisher-midp', and 5 without it. Stratified case/control analyses--mh ['perm' | 'mperm='<value>] ['perm-count'] ['set-test'] --mh2 Motivation Suppose you have information on age, BMI, and cardiovascular disease status for 1000 people, and the data can be summarized as follows:
The obesity-CVD odds ratio for the pooled dataset is (46 * 640) / (254 * 60) = ~1.932. But this is an overestimate of the true obesity-CVD association: age is associated with both obesity and CVD, so the age-stratified odds ratios are both substantially lower: (10 * 465) / (90 * 35) = ~1.476 and (36 * 175) / (164 * 25) = ~1.537. There appears to be a common odds ratio to extract, but the pooled dataset value clearly isn't a good estimator of it. A solution With only 2 or 3 clusters, the exact tests introduced in Mehta CR, Patel NR, Gray R (1985) Computing an Exact Confidence Interval for the Common Odds Ratio in Several 2x2 Contingency Tables and Zelen (1971) The analysis of several 2x2 contingency tables are computationally practical on even the largest datasets, and noticeably more accurate when some table entries are very small; contact us if you want these special cases implemented in PLINK. The following similar analyses are also available:
Note that some types of stratification should usually be handled with other methods. For instance, principal component analysis is an effective way to deal with population stratification when family structure is not a concern: see Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Basic reports, quantitative phenotype--assoc ['perm' | 'mperm='<value>] ['perm-count'] ['qt-means'] ['lin'] ['set-test'] Given a quantitative phenotype, --assoc writes regression statistics and Wald test results to plink.qassoc. The 'qt-means' modifier causes an additional .qassoc.means file to be generated, reporting trait means and standard deviations stratified by genotype. 'lin' causes the Lin (2006) statistic to be reported as well; in this case, if multiple test adjustments and/or permutation test results are also requested, they will be based on the Lin statistic. Given both a quantitative phenotype and a case/control covariate (loaded with --covar) defining two groups, --gxe compares the regression coefficient derived from considering only members of one group to the regression coefficient derived from considering only members of the other, writing a report to plink.qassoc.gxe. By default, the first covariate in the --covar file defines the groups; use "--gxe n" to base them on the nth covariate instead. (--gxe ignores --covar-name/--covar-number.) Covariate values are interpreted as -9/0/nonnumeric = missing, 2 = second group, any other number = first group. --gxe does not currently support permutation or the miscellaneous options below. For mixed model analysis of categorical covariate interactions, you should use GCTA's --gxe flag instead. Regression with multiple covariates--linear ['perm' | 'mperm='<value>] ['genedrop'] ['perm-count'] ['set-test'] [{genotypic | hethom | dominant | recessive | no-snp}] ['hide-covar'] [{sex | no-x-sex}] ['interaction'] ['beta'] ['standard-beta'] ['intercept'] --condition <variant ID> [{dominant | recessive}] --parameters <number(s)/range(s)...> --xchr-model <mode number> Given a quantitative phenotype and possibly some covariates (in a --covar file), --linear writes a linear regression report to plink.assoc.linear. Similarly, --logistic performs logistic regression given a case/control phenotype and some covariates. If either flag is used with --all-pheno, the type of regression will automatically adapt based on whether the current phenotype is case/control or not.
By default, --condition adds a single variant's allelic dosage as a covariate, while --condition-list does that for all variants named in the given file. The 'dominant' modifier causes 0/1/1 coding to be used instead of 0/1/2, while 'recessive' calls for 0/0/1 coding. --parameters takes a list of 1-based indices (or ranges of them; syntax is similar to --chr) referring to the sequence of regression covariates which would normally be included in the model, and removes the unlisted covariates. The sequence is:
For example, if tmp.cov contains two covariates, the command plink --bfile mydata --linear genotypic interaction --covar tmp.cov --parameters 1-4, 7 induces the following parameter sequence:
so "--parameters 1-4, 7" causes the dosage-first covariate interaction term, and both dominance deviation interaction terms, to be excluded from the model. We have made two changes from PLINK 1.07 to keep indices consistent across all chromosomes:
--tests performs a (joint) test on the specified term(s). If used with e.g. "--linear genotypic" or "--linear hethom", this replaces the 2df joint test that would normally occur. Also, if permutation was requested, it is based on this test instead of the main effect (to support permutation tests on interaction terms). Syntax is similar to --parameters, except that
plink --bfile mydata --linear interaction --covar tmp.cov --parameters 1-4, 7 --tests 1, 5 adds an ADD=0, ADDxCOV2=0 joint test, since ADDxCOV2 is the fifth remaining term after --parameters has been processed. With --linear, no regression is performed (and an 'NA' result is reported) if any two terms have sample correlation exceeding 0.999 or the variance inflation factor for any term is greater than 50. You can change the latter threshold with --vif. (See the PLINK 1.07 documentation for some more discussion of multicollinearity.) Treatment of the X chromosome can be adjusted with the --xchr-model flag. There are currently four modes:
Dosage data--dosage <allele dosage file> ['noheader'] ['skip0='<i>] ['skip1='<j>] ['skip2='<k>] ['dose1'] ['format='<m>] ['Zout'] [{occur | standard-beta}] ['sex'] ['case-control-freqs'] --dosage processes one or more (possibly gzipped) text files with variant-major allelic dosage data, and normally performs a linear or logistic regression-based association analysis on it (results written to plink.assoc.dosage). It must be used with the --fam input flag, and can be combined with --map (this adds CHR and BP columns to the association analysis output file, and causes variants not named in the .map or which fail a variant filter to be excluded). Other primary input flags, e.g. --bfile, cannot be present. This command is poorly integrated with the rest of PLINK 1: genotype-based filters (e.g. --maf, --hwe) and many other flags are not supported. (See the dosage branch of the order of operations for details.) It will be replaced with a proper data import flag in PLINK 2.0.
1: We have made one change from PLINK 1.07 --dosage: --keep/--remove now work like they do for regular datasets. LASSO regression--lasso <h2 estimate> [min lambda] ['report-zeroes'] --lasso estimates variant effect sizes via LASSO regression, writing a report to plink.lasso. --covar can be used to add covariates. You must provide a h2 estimate to calibrate the regression. Genotype, phenotype, and covariate values are automatically standardized to zero mean and unit variance. By default, only variants with nonzero effect size estimates are included in the report; use the 'report-zeroes' modifier to force all variants to be listed. By default, model selection is not applied to covariates; --lasso-select-covars changes this. With no further parameters, all covariates are subject to model selection; alternatively, you can provide a space/comma-delimited list of covariate names (use 1-based 'COV1', 'COV2'... if your covariates are unnamed) to only subject the named covariates to model selection. Note that this method may require a very large sample size (e.g. hundreds of thousands for complex human traits) to be effective. Since it is not a conventional association test, --lasso does not support permutation or other miscellaneous association test options. Case/control nonrandom missingness test--test-missing ['perm' | 'mperm='<value>] ['perm-count'] ['midp'] Given a case/control phenotype, --test-missing tries to detect platform/batch differences between case and control genotype data by performing Fisher's exact test on case/control missing call counts at each variant. (Het. haploids are treated as missing. Variants with 0% or 100% missing call frequency are now skipped.) Results are written to plink.missing. Multiple-testing corrected p-values can be obtained from --adjust or permutation tests. The 'midp' modifier causes Lancaster's mid-p adjustment to be applied to Fisher's exact test. We recommend its use, since the conservative bias (which the mid-p adjustment eliminates) exhibited by the regular Fisher's exact test has no value in this context. Any variant which comes up as highly significant under this test should be treated with great caution; spurious association results are likely. Miscellaneous options--adjust ['gc'] ['log10'] ['qq-plot'] --lambda <value> --adjust causes an .adjusted file to be generated with each association test report, containing several basic multiple testing corrections for the raw p-values. These values tend to be overly conservative when significant LD is present; more useful p-values can be obtained from e.g. appropriate permutation tests. Entries in this file are sorted by significance value instead of genomic location.
For --model and case/control --assoc, '--ci X' causes size-X centered confidence intervals to be reported for odds ratios. (E.g. "--ci 0.95" corresponds to a 95% confidence interval.) --pfilter causes only associations with p-values no larger than the given threshold to be reported. (Variants with 'NA' p-values are excluded as well; if that's the only effect you want, use "--pfilter 1".) This can dramatically reduce the report's size, and you can extract the remaining variant IDs with a command sequence like the following: plink --bfile main_data --assoc --pfilter 0.001 --out report1 tail -n +2 report1.qassoc | sed s/^\ *//g | tr -s ' ' ' ' | cut -f 2 -d ' ' > candidates.txt The second command
Monte Carlo permutation proceduresPLINK 1.9 includes an accelerated implementation of PLINK 1.07's Monte Carlo permutation procedures, applying several ideas from PRESTO and PERMORY to the full range of PLINK 1.07's association tests. ReproductionDue to the use of pseudorandom numbers, it is necessary to specify the initial random number seed(s) to make a run reproducible; use the --seed flag to accomplish this. --threads and --perm-batch-size are sometimes necessary as well. CustomizationWithin-cluster restriction You can combine any permutation test with --within/--family to force permutation to only occur within each cluster. --aperm <min perms/variant - 1> [max perms/variant] [pruning alpha threshold] [pruning beta] [initial pruning interval] [interval increase rate] --aperm sets six parameters governing the behavior of PLINK's adaptive permutation tests.
--mperm-save --mperm-save writes each permutation's best test statistic to plink.mperm.dump.best, while --mperm-save-all produces a .mperm.dump.all file with every single permutation x SNP test statistic. The nature of the test statistic will be mentioned in the log and printed to the console. (It can differ slightly from PLINK 1.07's choice; e.g. when Fisher's exact test is used, PLINK 1.07 uses one minus the p-value while PLINK 1.9 just uses the p-value.) Label-swapping + gene-dropping--swap-sibs (Not implemented yet.) Normally, no label-swapping occurs under a gene-dropping permutation test. --swap-sibs enables label-swapping for full siblings when parental genotype information is not present (e.g. it was excluded during quality control), while --swap-parents enables label-swapping for pairs of parents, and --swap-unrel enables label-swapping for singletons (i.e. those unrelated to everyone else). Permutation without association testing--make-perm-pheno <ct> --make-perm-pheno simply generates the specified number of phenotype permutations and writes them to plink.pphe, without invoking an association test. Cluster restrictions apply. Set-based tests--set-p <p-value> When an association test is run with the 'set-test' modifier, a set-based test is conducted as follows:
Results are written to a file with the .set.{perm|mperm} extension. Multiple-testing corrections (for the number of sets containing at least one variant, i.e. nonempty sets removed in step 3 are counted) can be obtained with --adjust. 'mperm' simply fixes the number of set test permutations here; it does not invoke a proper max(T) permutation test. For some discussion of parameter choices, etc., refer to the PLINK 1.07 documentation. REML additive heritability estimate (beta)--unrelated-heritability ['strict'] [tol] [initial covg] [initial covr] --unrelated-heritability computes Visscher et al.'s restricted maximum likelihood estimate of additive heritability, iterating with an accelerated variant of the EM algorithm until the rate of change of the log likelihood function is less than 'tol' (default 1e-7). Scalar phenotype data is required. The 'strict' modifier forces regular EM to be used; otherwise, an ad hoc form of momentum is employed to speed up convergence (but it's still slow; we recommend using GCTA 1.1+ or REACTA to execute an alternative to the EM algorithm if speed is important). The genomic covariance prior (covg) defaults to 0.45, and the residual covariance prior defaults to (1 - covg). You can run this command directly on a previously computed GCTA-format relationship matrix and a phenotype, by combining this with --grm/--grm-bin and --pheno. For more details about the method, see Vattikuti S, Guo J, Chow CC (2012) Heritability and Genetic Correlations Explained by Common SNPs for Metabolic Syndrome Traits. We do not feel this implementation has been adequately tested; if you try it and get anomalous results, please contact us and we will try to figure out what is going on (you may help us find an error). Also note that, since this function spends most of its time executing LAPACK calls, rebuilding from source while linking to a machine-tuned BLAS is likely to substantially improve its performance. |