D: 20 Oct 2024 Main functions (makegrmbin...) Quick index search

GWAS 2: PostHocSignificanceThe goal of GWAS is to run large genotypephenotype analyses with the intent of discovering predictive or causal genetic variants using a somewhat hypothesis free approach. GWAS often employs marginal regression (MR) (as with Plink 2's GLM) to estimate effect sizes and their Pvalues. MR breaks up multivariate regressions into many univariate (single) regressions. This brings up two important issues: 1) adjusting the univariate Pvalue for multiple hypotheses being tested (i.e., the multiple comparisons problem, MCP), and 2) accounting for correlations among genetic variants. A third factor that we also explore is unrelated to MR; accounting for (residual) population structure or relationships (aka genomic inflation or genomic control).
ObjectiveHere we continue with the exercises from GWAS 1. You will need those results to run the exercises here. You can also skip that tutorial and generate the files via Tutorial Shortcuts  GWAS 1. We will explore a few posthoc methods offered by Plink 2 to assess and mitigate spurious effects, namely: adjust and clump.
R:
Part 1: Adjusting PvaluesAdjust allows you to adjust the GWAS statistics based on multiple comparison problem (MCP) considerations and (residual) population/kin structure aka genomic inflation (GI). There are two main ways to run this, posthoc (adjustfile) or during the GLM (adjust). Let's try the former adjustfile using results from a previous tutorial. Plink 2:
adjustfile
When you run the command, you may notice in the terminal output and log file that the genomic inflation (GI) lambda is reported (66.5 in this case). This is an inflation factor used to adjust the GLM statistic (e.g., tscore). The tscore is shrunk by this factor and, thus, results in an larger (less significant) adjusted Pvalue. This is reported in the adjust file. Open the *.adjusted file in R. R:
Let's focus on 4:87071601G,A a true positive. We see the SNP ID and some other information, and then Pvalue columns: UADJ, GC, BONF, ... GC is the GI lambda adjusted Pvalue. Notice that the unadjusted Pvalue is 4.5e21 but the GCadjusted is 0.25. So, we went from very significant to not at all. As noted above, the estimated lambda is very large but why is this so high? GI (GC) adjusts for population structure. However, we used the 10 population PCs as covariates. Thus, the regression coefficients and their Pvalues should already account for this structure. At least, to a large degree. Let's investigate.
We will now run a similar regression as in GWAS 1, but increase the pfilter to 1e1, and rerun the posthoc adjustfile on this. Plink 2:
Plink 2:
As you can see from ther terminal/log, the genomic inflation lambda dropped dramatically from 66.5 to 8.5. Let's look at 4:87071601G,A . R:
Unadjusted Pvalue is 4.5e21 (as expected, it's the same as above). The GC adjusted Pvalue is now 0.001. What happened? Same dataset, same GLM model... GI lambda is an estimate of genomewide "spurious" associations with the phenotype. It is estimated by randomly sampling the genome. If you recall, our quant1b GLM run used a pfilter of 1e6. This is far from random but instead is selecting SNPs that are strongly associated with our trait. So, of course, randomly sampling on true positives will result in GI being high and misleading. In our new GLM run, we increased the Pvalue cutoff to 1e1; a more liberal threshold. So, we are now selecting more SNPs from the null (not trait specific) distribution. This exercise mainly highlights the fact that estimating/adjusting for cryptic population structure (e.g., GC, LDscore regression) requires randomly sampling the genome. This is important to remember for posthoc adjustments. If you run adjust during the regression, then Plink 2 will estimate GI (GC) using a broader (more random) set of variants assuming that is what you feed it. Let's test this by running the original regression that gave a lambda of 66.5 but add adjust. Plink 2:
adjust
Way better. GI lambda went from 66.5 to 1. There is no spurious population effect as expected since we included population covariates in the regression. The high lambda was an artifact of the posthoc adjustment. On your own, you can run the above without the population PCs and see if GI (GC) picks up the population structure. Replace glm hidecovar covar ./results/qc/qc3/all_hg38_qcd_LE1_pca.eigenvec with glm allownocovars.
We leave this section noting that the genomic inflation (genomic control) and MCP adjusted Pvalues like BONF are not combined by default. Run adjust or adjustfile with the gc argument to incorporate GI (GC) into the MCP calculations.
Part 2: ClumpIn GWAS 1, we were left with a number of false positives (FPs). This was even after controlling for population structure. One hypothesis, is that the FPs are variants correlated to our known true positives (TPs). Let's use Plink 2's posthoc clump methods to explore this. Plink 2
pfile
clump
clumpunphased
clumpr2
R:
Of 341 variants that were returned for simgwas_quant1b, we previously saw that this consisted of 48 (of 49) true positives. We now see that 81% of all the 341 variants also fall into correlated clumps with true positives. This will change to some degree with the clumping parameters. Thus, if we recall the original GWAS from the GWAS 1 tutorial, much of the false positives (that passed statistical tests) are due to population structure and correlations among variants. 