S: 28 Apr 2020 (b6.17) D: 28 Apr 2020 Main functions (--distance...) (--make-grm-bin...) (--ibs-test...) (--assoc, --model) (--mh, --mh2, --homog) (--assoc, --gxe) (--linear, --logistic)
Quick index search |
## Population stratification## Clustering--cluster ['cc'] [{group-avg | old-tiebreaks}] ['missing'] ['only2']
- By default, everyone starts in their own cluster. However, if the --within or --family flag is present, that cluster assignment is used as the starting point instead.
- The '
**cc**' modifier prevents two all-case or two all-control clusters from being merged. For consistency with --mcc, missing-phenotype samples are treated as controls (this is a change from PLINK 1.07). - By default, the distance between two clusters is defined as the maximum pairwise distance between a member of the first cluster and a member of the second cluster. The '
**group-avg**' modifier causes average pairwise distance to be used instead. - The '
**missing**' modifier causes clustering to be based on identity-by-missingness instead of identity-by-state. It also causes an identity-by-missingness matrix to be written to plink.mdist.missing. - By default,
**three files describing the dendrogram**are produced; the '**only2**' modifier causes only the .cluster2 (which describes only the final cluster configuration, and can be read with --within) to be written. - When equal IBS values are present, PLINK 1.9 normally does not try to break the tie in the same manner as PLINK 1.07, so the final cluster solutions tend to differ. This is generally harmless—there is no
*a priori*reason to prefer one tiebreak scheme to the other. However, for testing purposes, you can use the '**old-tiebreaks**' modifier to force PLINK 1.9 to emulate the old algorithm. (Due to floating point imprecision, it would be pointless to use this with 'group-avg'.)
--cluster automatically launches an appropriate IBS calculation when necessary, so you don't have to use it with --distance/--ibs-matrix/--genome unless you want to save the distance matrix to disk. ## Reusing an IBS/IBD calculation--read-genome <filename>
You can also invoke --read-dists to reuse the results of a "--distance triangle bin ibs" run. If --read-dists and --ppc are present in the same run, PPC test p-values are calculated from scratch (or loaded via --read-genome) while distances are loaded from the .mibs.bin file. ## Adding clustering constraints--ppc <minimum p-value> --mc <maximum cluster size> **--ppc**prevents two clusters from being merged if, for any cross-cluster pair of samples, the --genome PPC test p-value is below the given threshold.**--mc**prevents two clusters from being merged if the new cluster would be larger than the given size.**--mcc**prevents two clusters from being merged if the new cluster would contain too many cases or controls. Missing-phenotype samples are treated as controls.**--K**stops cluster merging once there are no more than the given number of clusters remaining.**--ibm**prevents two clusters from being merged if any cross-cluster pair of samples has identity-by-missingness smaller than the given value.
If the initial cluster assignment violates any of these constraints, a warning will be printed. --match <filename> [missing value] Given a file where each line has the following fields: - 1. Family ID
- 2. Within-family ID
- 3. Covariate 1
- 4. Covariate 2
- ...
**M**+2. Covariate**M**
To instead force members of the same cluster to differ on some or all of these covariates, you can combine --match with To enforce within-cluster similarity (but not uniformity) on some quantitative trait(s), you can use For backwards compatibility, if no second parameter is provided to --qmatch, the --missing-phenotype value (default '-9') is still treated as missing. If there are fewer than --match and --qmatch can be used in the same run (in which case their input files don't have to contain the same number of covariates). If the initial cluster assignment violates a --match or --qmatch constraint, a warning will be printed. ## Dimension reductionPLINK 1.9 provides two dimension reduction routines: --pca, for principal components analysis (PCA) based on the variance-standardized relationship matrix, and --mds-plot, for multidimensional scaling (MDS) based on raw Hamming distances. Top principal components are generally used as covariates in association analysis regressions to help correct for population stratification, while MDS coordinates help with visualizing genetic distances. --pca [count] ['header'] ['tabs'] ['var-wts'] --pca-cluster-names <name(s)...> By default, This is a simple port of GCTA's --pca flag, which generates the same files from a previously computed relationship matrix. For more full-featured principal component analysis, including automatic outlier removal, high-speed randomized approximation for very large datasets, and LD regression, try EIGENSOFT 6. If clusters are defined (via --within), you can base the principal components off a subset of samples and then project everyone else onto those PCs with --mds-plot <dimension count> ['by-cluster'] ['eigendecomp'] ['eigvals'] In combination with --cluster, The default, singular value decomposition-based algorithm is designed to give the same results as PLINK 1.07 and the R cmdscale() function (up to rounding errors and sign flips, anyway). The ' The ' ## Outlier detection diagnostics--neighbour <n1> <n2> For each sample, Note that PLINK 1.9 does not require --neighbour to be used with --cluster. |