D: 19 May 2022 Main functions (--make-grm-bin...) Quick index search |
## Sample-distance and similarity matrices## Relationship/covariance--make-rel ['cov'] ['meanimpute'] [{square | square0 | triangle}]
- The '
**square**', '**square0**', and '**triangle**' modifiers affect the shape of the output matrix. 'square' yields a symmetric matrix; 'triangle' (normally the default) yields a lower-trianglar matrix where the first row contains only the <sample #1-sample #1> relationship, the second row has the <sample #1-sample #2> and <sample #2-sample #2> relationships in that order, etc.; and 'square0' yields a square matrix with all cells in the upper right triangle zeroed out. - The '
**bin**' modifier causes the matrix to be written to plink2.rel.bin using little-endian IEEE-754 double encoding (suitable for loading from R). When using 'bin', the default output shape is 'square' instead of 'triangle'. - '
**bin4**' uses IEEE-754 single-precision encoding, and is otherwise identical to 'bin'. This saves disk space, but you'll need to specify 4-byte single-precision input for your next analysis step. The following does so in R: readBin('<filename>', what="numeric", n=<number of entries>,**size=4**) - As usual, '
**zs**' requests Zstd compression of the matrix file. Note that it cannot be combined with binary output, since general-purpose compression is much less effective in that context.
Other notes: - This calculation is not LD-sensitive; if that's a problem, an alternative is Doug Speed et al.'s LDAK software.
- Dosages are used when available.
- Multiallelic variants are handled properly; as of 30 Dec 2019, PLINK 2 no longer collapses all minor allele dosages together. (The sum over variants in the numerator of the original biallelic formula is replaced by a sum over alleles, and the denominator is doubled.)
- By default, mean-imputation is not performed for missing values, and we generally recommend using dedicated imputation software instead. However, "--pca approx" is based on the relationship matrix with mean-imputed values, and in practice this has been good enough for --pca's usual applications when the missingness rate isn't too high. To force mean-imputation here, add the '
**meanimpute**' modifier. - If you have hundreds of thousands of samples, you may not have a machine with enough RAM to compute the entire matrix at once. In this case, you can use --parallel to divide this computation into manageable pieces (though it'll still be tricky to use the output effectively...).
- Special handling of the diagonal is no longer supported.
## Exporting to GCTA--make-grm-list ['cov'] ['meanimpute'] ['zs'] [{id-header | iid-only}]
These computations can be subdivided with --parallel. ## KING-robust kinship estimatorThe relationship matrix computed by --make-rel/--make-grm-list/--make-grm-bin can be used to reliably identify close relations within a single population, if your MAFs are decent. However, Manichaikul et al.'s KING-robust estimator can also be mostly trusted on mixed-population datasets (with one uncommon exception noted below), and doesn't require MAFs at all. Therefore, we have added this computation to PLINK 2, and the relationship-based pruner is now based on KING-robust. The exception is that KING-robust underestimates kinship when the parents are from very different populations. You may want to have some special handling of this case; --pca can help detect it.
--make-king [{square | square0 | triangle}] [{zs | bin | bin4}]
- Only autosomes are included in this computation.
- Pedigree information is currently ignored; the between-family estimator is used for all pairs.
- For multiallelic variants, REF allele counts are used.
- --make-king jobs with the 'square0' or 'triangle' output shapes and all --make-king-table jobs can be subdivided with --parallel.
In addition, with --make-king-table, - The '
**counts**' modifier causes counts rather than 0..1 frequencies to be reported in the output columns that support both. - The '
**rel-check**' modifier causes only same-FID pairs to be reported. (The between-family KING estimator is still used.) **--king-table-filter**causes only kinship coefficients ≥ the given threshold to be reported.**--king-table-subset**causes only sample-pairs mentioned in the given .kin0 file (and optionally passing a kinship-coefficient threshold) to be processed. This allows you to start with a screening step which considers all sample pairs but only a small number of variants scattered across the genome (try --maf + --bp-space), and follow up with accurate kinship-coefficient computations for just the sample pairs identified as possible relations during the screening step. (This two-step approach remains practical with millions of samples!)- Refer to the file format entry for other output details and optional columns. --make-king-table now covers much of PLINK 1.x --genome's functionality.
See also the original KING software package, which has some useful two-step workflows directly built in, along with handy additional features like pedigree inference. ## Relationship-based pruning--king-cutoff [.king.bin + .king.id fileset prefix] <threshold> If used in conjunction with a later calculation (see the order of operations page for details), PLINK tries to maximize the final sample size, but this maximum independent set problem is NP-hard, so we use a greedy algorithm which does not guarantee an optimal result. In practice, --king-cutoff does yield a maximum set whenever there aren't too many intertwined close relations, but if you want to try to beat it (or optimize a fancier function that takes the exact kinship-coefficient values into account), use the --make-king and --keep/--remove flags and patch your preferred algorithm in between. --king-cutoff usually computes kinship coefficients from scratch. However, you can provide a precomputed kinship-coefficient matrix (must be --make-king binary format, triangular shape, either precision ok) as input; this is a time-saver when experimenting with different thresholds. |