Data Exploration 2  Genomic Structure  Linkage
Significance
Characterizing population genomic structure (including familial relationships) is both a compelling research objective, and can have a significant impact on GWAS (genotypephenotype association) findings.
Studies of genomic structure can roughly be broken down into two related spaces. These are appreciated by defining a genomics matrix, G, consisting of nsamples (individuals) and pmeasures (genetic variants). This defines a row (samples/individualspace) and column (geneticvariantsspace). In matrix form, it can be appreciated that these spaces are intrinsically related, thus operations in one space impact the other. This is further appreciated by considering that G=UΣV.T, where U are the samplespace eigenvectors and V the variantspace eigenvectors. When G is standardnormalized these are the respective principal components (PCs) for the sample and variantspaces. In other words, let's say that we estimated the samplespace PCs U. These samples (individuals) coordinates can be used to visualize similarities and differences between individuals. Let's say now that we want to see where new individuals lie against these reference samples. To do this, we need V. We can either blindly run a PCA() computer command (or SVD() in many cases) as we did to get U or note that we can derive V from G, U, and Σ. V is the weighted sum of individual genotypes scaled by Σ. The take away is that principal genomic vectors are exactly determined by the individuals in your dataset and viceversa. On one hand, this fundamental relationship complicates "ancestry" applications due to sampling biases in both the samples/individual and variantspace. On the other hand, purposeful biased sampling methods can be leveraged to glean different insights from the same GWAS dataset. For example, Vattikuti et al., 2012 and Zaitlen et al., 2013, have taken advantage of these properties to learn different aspects of heritability by common and rare variants.
Objective
In this tutorial we will explore common Plink 2 methods used to characterize structure in the genetic variant and individual spaces.
We will use the 1000 Genomes Phase 3 data (1kGP3), see 1000 Genomes website and Nature paper. You should have this raw data in your /GWAS/project1/data/raw/ folder. If not see Setting Up the Plink 2 Tutorial. The latter also explains the expected compute environment. Do not proceed until you have read through it.
Part 1: Preprocess Data
Let's create a cleaned up dataset for the following examples. We will also recode the variant IDs to avoid duplicate ID issues. This can be an issue for these exercises. The resultant processed data ./data/processed/all_hg38_qcd.* will be our starting point for all the following analyses.
Plink 2:
time plink2 \
pfile 'vzs' ./data/raw/all_hg38 \
makepgen \
autosome \
snpsonly \
hwe 0.05 keepfewhet \
maf 0.05 \
setallvarids @:#\$r,\$a \
out ./data/processed/all_hg38_qcd
autosomes and snpsonly
Restrict the variants to autosomal chromosomes and single nucleotide polymorphisms.
hwe 0.05 keepfewhet
Threshold (remove) variants with a Pvalue <0.05 (higher p removes more) that violate hardyweinberg equilibrium  see Data Exploration 1  Basic Statistics (HWE, Allele Frequency Spectrum).
keepfewhet
Keeps variants that show excess homozygosity. These may be due to population structure rather than genotyping error.
maf 0.05
Removes variants with minor allele frequencies less than 0.05.
setallvarids @:#\$r,\$a
Recodes the variant IDs to unique labels with @=chromosome, #=basepair position, \$r=reference allele, and \$a=alternative allele.
Usage Note: Plink 2 requires unique variant IDs for some functions. If this is not the case, it will error out and log this issue. This dataset has nonunique IDs (likely due to multiallelic variants being split into separate lines). Here we recode them as above. Another option is to remove duplicate IDs. For this use rmdup.
Part 2: Linkage Disequilibrium (LD)
Here we will estimate the nonrandom cosegregation of alleles at different loci; aka, linkage disequilibrium (LD). In other words, how often does an allele A/a at one genomic location cosegregate with an allele B/b at another location. We only address biallelic two loci, pairwise, estimates here.
About:Coefficient of Linkage Disequilibrium, D
There are several methods for estimating LD. Historically, they have their roots in what is called the coefficient of linkage disequilibrium, D. The measure D is essentially the mathematical covariance between two loci. It is calculated by the difference in their observed joint occurrence (of AB, for example) to that expected by chance. For example, the D for allele A at locus 1 and allele B at locus 2 is given by:
D_{AB}=Π_{AB}Π_{A}Π_{B},
where Π:=frequency
Note the subscript AB. That is only one of several allelic combinations or haplotypes. Typically, we want a more general metric over all pairwise allele combinations. It can be shown for two biallelic loci, that only one set needs to be computed to get the magnitude of the general D, which is what we often operate on (i.e., filter on). The sign will depend on the specific allelic pair.
Let's estimate LD using Plink 2 r[2][un]phased... wait! Plink 2 needs to know if we want to run "phased" or "unphased".. Looking into the reference docs, it says that we should run "phased" for D and the like. Okay, but let's dive a little deeper into what this means.
About:Phase
Phase refers to whether the alleles at the two loci originate from the same parental chromosome. Recall that the historic definition of LD involves cosegregation and not just cooccurrence. Using (inferred) phase is one major difference between LD and a method that more naively takes the innerproduct of two vectors. Traditionally, LD often referred to coinheritance which was used for tracing causal genetic factors.
Plink 2 can use phase information if it was precomputed and part of your data (e.g., pgen). Plink 2 can also estimate phase without family trios, etc. It does this via a simple statistical model based on genotypes alone, using the cubic exact equation from Gaunt, Rodriguez, Day (2007) and maximum likelihood to select between multiple solutions. You can inspect whether multiple solutions exist and other information using Plink 2's ld method.
Back to our exercise. Let's estimate D for 1kGP3.
Plink 2:
time plink2 \
pfile ./data/processed/all_hg38_qcd \
chr 22 \
thincount 1000 \
seed 111 threads 1 memory 8000 require \
rphased cols=+d \
ldwindowr2 0 \
out ./results/qc/qc2/all_hg38_qcd_prephased_phasedr
chr 22
Restricts the analysis to chromosome 22 variants.
thincount
Randomly selects 1000 variants.
seed 111 threads 1 memory 8000 require
Sets the random seed for replicating results.
rphased cols=+d
Calculates the phased r and D statistics.
ldwindowr2 0
This removes LD filtering except for NaN. We do this for our comparisons below.
out ./results/qc/qc2/all_hg38_qcd_prephased
Add the descriptor 'prephased' to note that the originating 1kGP3 data has precomputed phase information.
To check if your data has phase information use:
pgeninfo
About: r and different D statistics
Thus far, we only talked about D. The metric r is a correlation, aka normalized transformation of the D (covariance) value. It is given by:
r=D/(Π_{A}(1Π_{A})Π_{B}(1Π_{B}))^{0.5}
This brings up an important aspect of using the D statistic. Although it gives information about the magnitude of associations between loci, it is a function of their allele frequencies and thus difficult to interpret across pairwise associations.
Thus it is common to rescale by some function of the allele frequencies. The statistic r uses the geometric mean of the loci variances (thus r is a correlation statistic). Another popular normalization is called Dprime that uses something like the correlation normalization (it can be called in Plink 2 using rphased cols=+d,+dprime). The statistics r, D, and Dprime are related but can give quite different values. In addition, the unphased r can also be different from the phased r statistic due to how the haplotype frequencies are estimated.
Let's calculate and compare four common LD statistics: 1) D using 1kGP3 prephasing, 2) phasedr using 1kGP3 prephasing, 3) phasedr using Plink 2 phasing, and 4) unphasedr. We already calculated (1) and (2) above.
Phasedr using Plink 2
1. Remove phase information from 1kGP3
Plink 2:
time plink2 \
pfile ./data/processed/all_hg38_qcd \
makepgen erasephase \
out ./data/processed/all_hg38_qcd_unphased
makepgen erasephase
Makes a pgen set of files with phase information removed.
2. Calculate phasedr using Plink 2
Plink 2:
time plink2 \
pfile ./data/processed/all_hg38_qcd_unphased \
chr 22 \
thincount 1000 \
seed 111 threads 1 memory 8000 require \
rphased \
ldwindowr2 0 \
out ./results/qc/qc2/all_hg38_qcd_unphased_phasedr
Unphasedr
Plink 2:
time plink2 \
pfile ./data/processed/all_hg38_qcd_unphased \
chr 22 \
thincount 1000 \
seed 111 threads 1 memory 8000 require \
runphased \
ldwindowr2 0 \
out ./results/qc/qc2/all_hg38_qcd_prephased_unphasedr
runphased
Use the unphased flag here to ignore phase information and estimate r.
Inspect Results
Now in the R console, run ...
R:
source("../R/plotLD.R")
As expected there is generally good agreement between the r statistics compared to D. While the unphasedr deviates to some degree with the phasedr using 1kGP3phasing, this is somewhat corrected by Plink 2's inferred phase.
We'll end this tutorial with another common operation, LD pruning. LD pruning primarily reduces computational burden and mitigates bias in our analyses. However, it is important to note that removing correlated variants does not save from the posthoc analyses of determining what is the causal loci (signal) among these correlated sets. We will revisit this concept and Plink 2's clump operation in tutorials where we perform genotypephenotype association analyses. For now, let's create an LDpruned dataset that we will use for other tutorials.
time plink2 \
pfile ./data/processed/all_hg38_qcd \
indeppairphase 100 0.8 \
out ./results/qc/qc3/all_hg38_qcd
time plink2 \
pfile ./data/processed/all_hg38_qcd \
makepgen \
extract ./results/qc/qc3/all_hg38_qcd.prune.in \
out ./data/processed/all_hg38_qcd_LE1
Data Exploration 2 — Genomic Structure — Relationship Matrix >>
