D: 5 May 2021 Main functions (--make-grm-bin...) Quick index search |
## Association analysis## Linear and logistic/Firth regression with covariates--glm ['zs'] ['omit-ref'] [{sex | no-x-sex}] ['log10'] ['pheno-ids'] --ci <size> --vif <max VIF> --max-corr <val>
For quantitative phenotypes, --glm fits the linear model for every variant (one at a time), where For binary phenotypes, --glm fits a logistic or Firth regression model instead, with the same Before we continue, three usage notes. - It is now standard practice to include top principal components (usually computed by --pca) as covariates in any association analysis, to correct for population stratification. See Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies for discussion.
- This method does not properly adjust for small-scale family structure. As a consequence, it is usually necessary to prune close relations with e.g. --king-cutoff before using --glm for genome-wide association analysis. (Note that biobank data usually comes with a relationship-pruned sample ID list; you can use --keep on that list, instead of performing your own expensive --king-cutoff run.) If this throws out more samples than you'd like, consider using mixed model association software such as SAIGE, BOLT-LMM, GCTA, or FaST-LMM instead; or regenie's whole genome regression.
- Finally, the statistics computed by --glm are not calibrated well
^{1}when the minor allele count is very small. "--mac 20" is a reasonable filter to apply before --glm; it's possible to make good use of --glm results for rarer variants (e.g. they could be input for a gene-based test), but some sophistication is required. Also, when working with unbalanced binary phenotypes, be aware that Firth regression can be similar to adding a pseudocount of 0.5 to the number of case and control minor allele observations, so weird things happen when the*expected*number of case minor allele observations is less than 0.5. You probably don't want to throw out every variant with MAC < 300 when your case:control ratio is 1:600 (you may still have excellent power to detect*positive*association between the minor allele and case status, after all), but you shouldn't take reported odds-ratios or p-values literally for those variants.
Now, the technical details: - For biallelic variants,
**G**normally contains a single column with*minor*allele dosages. To make it always contain ALT allele dosages instead, add the '**omit-ref**' modifier. (Why isn't omit-ref the default? We'll get to that.) This allele is listed in the A1 column of the main report. (Note that 'A1' just means "current allele" in PLINK 2.0 output files; it no longer is the overloaded global setting that it was in PLINK 1.x.) - Be aware that minor alleles are dataset-dependent. If rs71220063 has {freq(C)=0.493, freq(T)=0.507} in one dataset and {freq(C)=0.502, freq(T)=0.498} in another, C is the minor allele for rs71220063 in the first dataset and T is the minor allele in the second dataset, and --glm results will differ accordingly. When this is a problem, you can use --read-freq on an agreed set of allele frequencies to force all --glm runs to use the same set of minor alleles.
- Similarly, for multiallelic variants,
**G**normally contains one column for each nonmajor^{2}allele. 'omit-ref' changes this to one column for each ALT allele. If some but not all of these allele columns are constant, the constant columns are omitted. (Before 20 Mar 2020, the entire variant was skipped in this case.) For each such variant, the main report normally contains one line for each nonmajor allele, followed by a line for each covariate. The allele-specific lines have just one allele in the A1 column, while the covariate lines list*all*nonmajor alleles in the A1 column. - chrX is special in two ways:
- First, sex (as defined in the .fam/.psam input file) is normally included as an additional covariate. If you don't want this, add the '
**no-x-sex**' modifier. Or you can add the '**sex**' modifier to include .fam/.psam sex as a covariate everywhere. Whatever you do,**don't include sex from the .fam/.psam file and the --covar file at the same time; otherwise the duplicated column will cause the regression to fail.** Note that PLINK 2.0 encodes the .fam/.psam sex covariate as male = 1, female = 2, to match the actual numbers in the input file. This is a minor change from PLINK 1.x. - See --xchr-model below.
- First, sex (as defined in the .fam/.psam input file) is normally included as an additional covariate. If you don't want this, add the '
- Outside of chrX, dosage is on a 0..1 scale on haploid chromosomes (chrY, chrM).
- For each phenotype, --glm writes a regression report to
**plink2.<pheno name>.glm.<regression type>[.zst]**. - Yes, 'each' phenotype. This is a change from PLINK 1.x; the old --all-pheno flag is now effectively always on.
- If you have multiple quantitative phenotypes
*with either no missing values, or missing values for the same samples*,**analyze them all in a single --glm run!**PLINK 2.0's linear regression 'only' tends to be a few hundred times as fast as PLINK 1.9 when you analyze one quantitative phenotype at a time. But --glm also has a quantitative-phenotype-group optimization that can multiply the speedup by another factor of ~10. - The regression-type file extension is always 'linear' for quantitative phenotypes.
- For binary phenotypes, there are now three regression modes:
- The '
**no-firth**' modifier requests PLINK 1.x's basic logistic regression. - The '
**firth-fallback**' modifier requests logistic regression, followed by Firth regression whenever the logistic regression fails to converge. This is now the default. - The '
**firth**' modifier requests Firth regression all the time. This is unlikely to be worth the additional computational cost for common variants (see e.g. Ma C, et al. (2013) Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants, which suggests that Firth regression becomes relevant when minor allele count is less than 400).
- The '
- By default, for every variant, this file contains a line for each genotype column
*and a line for each non-intercept covariate column*. If you're not actually using any information in the covariate lines, the '**hide-covar**' modifier can greatly reduce file sizes. (See also --pfilter below.) Or, going in the other direction, the '**intercept**' modifier lets you also see the intercept-column fit. - Firth regression can be slow. To trade off some accuracy for speed, you can use the '
**firth-residualize**' modifier, which implements the shortcut described in Mbatchou J et al. (2020) Computationally efficient whole genome regression for quantitative and binary traits. '**cc-residualize**' causes the shortcut to be applied to ordinary logistic regression as well. - This shortcut must be used with 'hide-covar', and disables some other --glm features.
- It is not recommended if you have a significant number of missing genotypes, or if you have any other reason to expect covariate betas to change in a relevant way between variants.
- Since running --glm without at least e.g. principal component covariates is usually an analytical mistake, the '
**allow-no-covars**' modifier is now required when you're intentionally running --glm without a covariate file. (This modifier did not exist, and the corresponding check was not performed, before 28 Mar 2020.) - The '
**log10**' modifier causes p-values to be reported in -log10(p) form. This works^{3}for p-values smaller than DBL_MIN. **--ci**causes confidence intervals with the given width to be reported for each beta or odds-ratio.- Refer to the
**file format entry**for a list of supported column sets. - A multicollinearity check is performed before each regression. When it fails, the regression is skipped and 'NA' results are reported.
- This is a change from PLINK 1.9, which only performed the check for linear regressions.
- The main part of this check is a variance inflation factor calculation. If that value is larger than 50, the check fails. You can change the upper bound with
**--vif**. - Correlations between predictors are also checked; if any correlation is larger than 0.999, the check fails. You can change this upper bound with
**--max-corr**. - The ERRCODE column reports which variants failed the multicollinearity check.
This column distinguishes some other error types, too. The following error codes are currently reported: - '.': No error.
- 'SAMPLE_CT<=PREDICTOR_CT': Too few samples.
- 'CONST_OMITTED_ALLELE': The omitted allele had constant dosage, so the entire variant was skipped. (This usually means that all alleles are constant.)
- 'CONST_ALLELE': This allele has constant dosage, and was skipped; at least one other allele was nonconstant. (Meaning was different before 20 Mar 2020.)
- 'CORR_TOO_HIGH': The correlation between two predictors exceeded the --max-corr threshold. (Note that, in 'genotypic' mode, this happens for every biallelic variant with no homozygous-minor genotypes, since the additive and dominance-deviation columns are identical in that case. Those variants must be analyzed without 'genotypic'.)
- 'VIF_INFINITE': The predictor correlation matrix couldn't be inverted at all.
- 'VIF_TOO_HIGH': VIF exceeded the --vif threshold.
- 'SEPARATION': [Quasi-]complete separation was detected, and --glm was operating in no-firth mode.
- 'RANK_DEFICIENT': The final predictor matrix could not be inverted.
- 'LOGISTIC_CONVERGE_FAIL': Logistic regression failed to converge, and --glm was operating in no-firth mode.
- 'FIRTH_CONVERGE_FAIL': Firth regression failed to converge.
- 'UNFINISHED': Logistic/Firth regression didn't fail in an obvious manner, but the result didn't satisfy the usual convergence criteria when the iteration limit was hit. (The result is still reported in this case, but it's less accurate than usual.)
- 'INVALID_RESULT': While the underlying regression ran to completion, the result was extreme enough that inversion of the predictor matrix plausibly 'should' have failed.
- The VIF check is known to be overly strict in several common scenarios; in particular, categorical covariates with a large number of categories will set it off. "When Can You Safely Ignore Multicollinearity?" has more discussion of this. Do not be afraid to greatly increase the --vif threshold after you have studied the problem and confirmed that moderate multicollinearity does not interfere with your analysis.
- Covariates are also checked before any variants are processed. This includes a covariate-only version of the multicollinearity check described above, along with a covariate-scale check (which identifies scenarios where --covar-variance-standardize can be expected to help a lot). By default, if this check fails, PLINK 2 errors out; to just skip the affected regressions instead, add the '
**skip-invalid-pheno**' modifier. - Finally, if PLINK 2 determines that any samples and covariates are irrelevant to all regressions (e.g. a covariate could be zero-valued for all but one sample), they are removed before any variants are processed. You can use the '
**pheno-ids**' modifier to make PLINK 2 report the remaining samples to (per-phenotype) .id files. (When the sample set changes on chrX or chrY, .x.id and/or .y.id files are also written.) - If the phenotype is constant across the remaining samples at this point, PLINK 2 errors out (or, if 'skip-invalid-pheno' was specified, skips the phenotype).
- Occasionally, it is useful to include selected variants in the immediate dataset as fixed covariates. This can be accomplished by running "--export A" on those variants and cut+pasting the data columns onto the end of the --covar input file. But there's also a shorthand:
**--condition**adds a single variant as a fixed covariate, while**--condition-list**does the same for all variants named in a file. The '**dominant**'/'**recessive**' modifiers let you change how these covariate columns are encoded (see below). - It is also possible to include "local covariates", which are not constant across all variants, in the regression. (These can be e.g. local ancestry coefficients, or polygenic effect predictions from a whole-genome fitting step.) To do so, add the '
**local-covar=**' and '**local-psam=**' modifiers, use full filenames for each, and use either '**local-pvar=**' or '**local-pos-cols=**' to provide variant ID or position information. - Normally, the local-covar file should have
*cn*real-valued columns, where the first*c*columns correspond to the first sample in the local-psam file, columns (*c*+1) to 2*c*correspond to the second sample, etc.; and the*m*th line of the local-covar file corresponds to the*m*th nonheader line of the local-pvar file. (Variants not mentioned in the local-pvar file are excluded from the regression.) The local covariates are assigned the names LOCAL1, LOCAL2, etc. To exclude the last local covariate from the regression (necessary if they are e.g. local ancestry coefficients which sum to 1), add the '**local-omit-last**' modifier. - Alternatively, when '
**local-cats=**'<*k*> is specified, the local-covar file is expected to have*n*columns with integer-valued entries in [1,*k*]. (This range is [0,*k*-1] with '**local-cats0=**'.) These category assignments are expanded into (*k*-1) local covariates, with the last category omitted. - When position information is in the local-covar file, this should be indicated by '
**local-pos-cols=**'<number of header rows>,<chrom col #>,<pos start col #>,<first covariate col #>. - '
**local-haps**' indicates that there's one column or column-group per haplotype instead of per sample; they are averaged by --glm. - As a practical matter, if you only have a single set of local covariate values per chromosome, you're probably better off with per-chromosome --glm runs which don't use local-covar= at all; that enables some additional optimizations.
- The '
**genotypic**' modifier adds a dominance-deviation column (heterozygous-A1 = 1, any other genotype = 0, linear interpolation applied to dosages; "0..1..0" for short) to**G**, and adds genotype + dominance-deviation joint F-test^{4}results to the main report. The '**hethom**' modifier does almost the same thing, except that the first genotype column is also replaced, by 'HOM' column(s) with 0..0..1 encoding. - The '
**dominant**' modifier specifies a model assuming full dominance for the A1 allele, i.e. the first genotype column is changed to 0..1..1 encoding. Similarly, '**recessive**' makes the first genotype column use 0..0..1 encoding. - The '
**interaction**' modifier adds genotype x covariate interaction terms to**G**. More precisely, the additional columns are entrywise (Hadamard) products between a genotype/dosage column and a (non-intercept) covariate column. - When
**G**contains a major allele with >90% frequency, the interaction terms can be very highly correlated with the genotype column. This is likely to cause the multicollinearity check to fail, and it*isn't*a situation where overriding the multicollinearity-check defaults is wise—numerical stability problems are likely. So you probably don't want to use 'omit-ref' when performing interaction testing. (And this is why omit-ref is no longer --glm's default setting; it was, back in 2017, until the ~5th time this specific problem came up...) - For multiallelic variants,
*'interaction' causes a separate model to be fitted for each A1 allele*. In each model, interaction terms are only added for one of the alleles in**G**; the other genotype/dosage columns are treated much like fixed covariates (except that no interaction term is created between them and the A1 allele). - If you want to include some, but not all, interaction terms, use
**--parameters**to specify your choices. This flag takes a list of 1-based indices (or ranges of them; syntax is similar to --chr) referring to the sequence of predictors which would normally be included in the model, and removes the unlisted predictors. The sequence is: - Genotype/dosage additive effect column (or 'HOM' column)
- Dominance deviation, if present
- Local covariate(s), if present
- --condition[-list] covariate(s), if present
- --covar covariate(s), if present
- Genotype x non-sex covariate 'interaction' terms, if present
- Sex, if present
- Sex-genotype interaction(s), if present
For example, if tmp.cov contains two covariates, the command plink2 --pfile mydata \ induces the following parameter sequence: - Additive effect ('ADD')
- Dominance deviation ('DOMDEV')
- First covariate ('COVAR1' if not named in the file)
- Second covariate
- Dosage-first covariate interaction ('ADDxCOVAR1')
- Dominance deviation-first covariate interaction ('DOMDEVxCOVAR1')
- Dosage-second covariate interaction ('ADDxCOVAR2')
- Dominance deviation-second covariate interaction ('DOMDEVxCOVAR2')
so "--parameters 1-4, 7" causes the dosage-first covariate interaction term, and both dominance deviation interaction terms, to be excluded from the model. No, this interface won't be winning any ease-of-use awards. But it does let you get the job done; it's backward-compatible; it isn't actually restricted to interaction testing; and when all else fails you can look at the order in which predictors appear in --glm's main report (the same sequence is used).
- you can use the '
**all**' modifier to include all predictors in the test, and - if --tests is used with --parameters, the --tests indices refer to the term sequence
*after*--parameters has acted on it. For example,
plink2 --pfile mydata \ adds an ADD=0, ADDxCOVAR2=0 joint test, since ADDxCOVAR2 is the fifth remaining term after --parameters has been processed. One last tip. Since --glm linear regression is now much faster than logistic/Firth regression, it is reasonable to recode binary phenotypes as quantitative phenotypes (by e.g. adding 2 to all the values, and ensuring missing values are encoded as 'NA') for exploratory analysis of very large datasets. See "Details and considerations of the UK Biobank GWAS" from the Neale lab blog for detailed discussion of an example (executed with the Hail data analysis platform, but equivalent to the standard PLINK-based workflow). However, the results should only be treated as a rough approximation. There is no guarantee that a genome-wide significant association which would be revealed by logistic/Firth regression will be significant under the misspecified linear model (especially when you have less than ~25 minor allele observations in the case group). 1: PLINK 1.x --linear/--logistic's adaptive permutation mode mostly addresses the calibration problem, and this mode will be added to --glm before PLINK 2.0 is finished. However, the imprecision and high computational cost of this mode make it a last resort; realistically, its main function is to provide an unbiased ground-truth-approximation for researchers developing more computationally practical methods to compare against.
plink2 --pfile main_data --glm hide-covar --pfilter 0.001 --out report1 tail -n +2 report1.PHENO1.glm.linear | sed s/^\ *//g | tr -s ' ' ' ' | cut -f 2 -d ' ' > candidates.txt The second command - removes the top line of the report,
- strips leading spaces from each line,
- collapses subsequent groups of spaces into single spaces,
- and finally, extracts the second column (variant ID) from each line. The resulting file can be used with e.g. --extract.
PLINK 2 dosages are on a 0..2 scale on regular diploid chromosomes, and 0..1 on regular haploid chromosomes. However, chrX doesn't fit neatly in either of those categories. The following three modes are supported: - 0. Skip chrX. (This no longer causes other haploid chromosomes to be skipped.)
- 1. Male dosages are on a 0..1 scale on chrX, while females are 0..2. This was the PLINK 1.x default.
- 2. Males and females are both on a 0..2 scale on chrX.
**This is the PLINK 2 default.**
(PLINK 1.x's mode 3 is no longer supported since it duplicates --glm's interaction testing options, and does not apply to the other commands now covered by --xchr-model.) ## Basic multiple testing correction--adjust-file <filename> ['zs'] ['gc'] ['cols='<col set descriptor>] --adjust ['zs'] ['gc'] ['log10'] ['cols='<col set descriptor>] --lambda <value> Given an unfiltered PLINK association analysis report, - By default, the genomic-control lambda is estimated from the data (as <median 1df chi-square stat> / 0.456).
**--lambda**lets you manually set it. - '
**gc**' causes genomic-controlled instead of unadjusted p-values to be used as input for the main multiple-testing correction algorithms. - When specified without --lambda, this tends to be overly conservative. We note that LD Score regression usually does a better job of calibrating the --lambda value than (median / 0.456); see Lee JJ, McGue M, Iacono WG, Chow CC (2018) The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies.
- '
**log10**' causes negative base 10 logs of p-values to be reported, instead of raw p-values. '**input-log10**' specifies that the input file contains -log10(p) values. - The
**--adjust-...-field**flags let you set the field-name search order for each field --adjust-file might scrape. (The default settings are designed to work with PLINK 1.x --linear/--logistic and PLINK 2.0 --glm output.) When multiple arguments are provided for one of these flags, --adjust-file will first search for the first argument in the header line, and then only search for the second argument if the first search fails, etc. - Refer to the file format entry for a list of supported column sets. (That's where the old 'qq-plot' functionality now lives.)
In combination with --glm, |