D: 18 Aug 2024 Main functions (--make-grm-bin...) Quick index search |
Input filteringThe following flags allow you to exclude samples and/or variants from an analysis batch based on a variety of criteria. Two general notes:
ID lists--keep <filename(s)...> --keep-fam <filename(s)...> --keep accepts one or more space/tab-delimited text files with sample IDs, and removes all unlisted samples from the current analysis; --remove does the same for all listed samples. Similarly, --keep-fam and --remove-fam accept text files with family IDs in the first column, and keep or remove entire families. --keep/--remove now support a wider variety of sample ID file formats:
Single sample ID--indv <sample ID> --indv accepts a single 1-3 part sample ID, and removes all samples with different IDs. Separate sample ID parts with spaces. --extract [{bed0 | bed1}] <filename(s)...> --extract-intersect [{bed0 | bed1}] <filename(s)...> --extract normally accepts one or more text file(s) with variant IDs (usually one per line, but it's okay for them to just be separated by spaces), and removes all unlisted variants from the current analysis. With the 'bed0' or 'bed1' modifier, the input file should be in 0-based or 1-based interval-BED format instead. For backward compatibility, 'range' is an alias for 'bed1'. --exclude does the same for all listed variants. --extract-intersect is just like --extract, except that a variant must be in the intersection, rather than just the union, of the --extract-intersect files to be kept. --bed-border-bp <#> --bed-border-bp extends all the intervals in an input BED file (for e.g. "--extract bed0") by the given number of base-pairs on both sides. --bed-border-kb interprets its argument as a kilobase count, and is otherwise identical. --extract-col-cond <filename> [value col. number] [ID col.] [skip] --extract-col-cond-match <(sub)string(s)...> --extract-col-cond-min <min> --extract-col-cond excludes all variants which either don't appear in the given input file, or are associated with a value which doesn't satisfy the given condition. (This is a generalization of PLINK 1.x's --qual-scores flag.) It is designed to support filtering on INFO-like values stored in a separate tab-delimited file.
QUAL, FILTER, INFO--var-min-qual <value> --var-min-qual causes all variants with QUAL value smaller than the given number, or with no QUAL value at all, to be skipped. --var-filter [exception(s)...] To skip variants which failed one or more filters tracked by the FILTER field, use --var-filter. This can be combined with one or more (space-delimited) filter names to ignore. ##FILTER=<ID=q10,Description="Quality below 10"> For example, given the .pvar file above:
--extract-if-info <key> <operator> <value> --extract-if-info removes all variants which don't satisfy a comparison predicate on an INFO key. For numbers, the supported operators are '!=', '<', '<=', '==' (single '=' also ok), '>=', and '>'; for strings, only '!=' and '='/'==' can be used. If the key or value is missing, the predicate evaluates to true iff the operator is '!='. As a special case, you can specify the empty-string value as ';'. (This was not supported before 8 Jan 2023.) Note that the '<', '>', and ';' characters have special meanings in practically all shells; it is necessary to wrap them in quoted expressions. In bash, you can either quote the special characters individually, e.g. --extract-if-info AFR_AF '>'= 0.05 or put the entire predicate in double-quotes (recommended): --extract-if-info "AFR_AF >= 0.05" Similarly, --exclude-if-info removes all variants which do satisfy such a comparison predicate. --require-info removes all variants which don't have all of the listed INFO keys. (A key is treated as if it isn't present when the associated value is '.'.) Similarly, --require-no-info removes all variants which have any of the listed keys. Chromosomes--chr <number(s)/range(s)...> --chr excludes all variants not on the listed chromosome(s). Normally, valid choices for humans are 0 (i.e. unknown), 1-22, X, Y, MT, PAR1/PAR2 (pseudo-autosomal region of X; see --split-par/--merge-par), and XY (deprecated PLINK 1.x code intended to refer to the pseudo-autosomal region). Separate multiple chromosomes with spaces or commas, and use dashes to specify ranges. Spaces are not permitted immediately before or after a range-denoting dash. For example, the following are all valid and equivalent: --chr 1-4, 22, xy You might wonder about the '25'. Several non-autosomal chromosomes can also be identified by numeric code: if there are n autosomes, n+1 is the X chromosome, n+2 is Y, n+3 is XY, and n+4 is MT. (However, no numeric codes are associated with PAR1/PAR2.) --not-chr is the reverse of --chr: variants on listed chromosome(s) are excluded. So --not-chr 0 5-21 x y mt par1 par2 is equivalent to the three --chr examples above (assuming human data). (Yes, if your data uses PAR1/PAR2 codes, "--chr xy" will not cause them to be included. If this is problematic, see --autosome-par below.) If you specified --allow-extra-chr, you can refer to the extra chromosome codes by name, e.g. --allow-extra-chr --not-chr chr1_gl000191_random --autosome-par --autosome excludes all unplaced and non-autosomal variants, while --autosome-par does not exclude XY/PAR1/PAR2. They can be combined with --not-chr, e.g. --autosome-par --not-chr 5-21 xy is also equivalent to the three --chr examples. Keep only SNPs--snps-only ['just-acgt'] --snps-only excludes all variants with one or more multi-character allele codes. With 'just-acgt', variants with single-character allele codes outside of {'A', 'C', 'G', 'T', 'a', 'c', 'g', 't', <missing code>} are also excluded. Simple variant window--from <variant ID> --from excludes all variants on different chromosomes than the named variant, as well as those with smaller base-pair position values. --to is similar, excluding variants with larger position values instead. If they are used together but the --from variant is after the --to variant, they are automatically swapped. --snp <variant ID> --snp specifies a single variant to load by name. If it's combined with --window, all variants with physical position no more than half the specified kb distance (decimal permitted) from the named variant are loaded as well. Similarly, --exclude-snp specifies a single variant to exclude; this can also be combined with --window. --from-bp <pos> These flags let you use physical positions to specify a variant range to load. Kilobase and megabase values can include decimals. You are required to specify a single chromosome when using these. Multiple ranges--snps <variant ID(s)/range(s)...> --snps accepts a collection of individual variant IDs and variant ranges. For example, --snps rs1111-rs2222, rs3333, rs4444 tells PLINK to load all variants between rs1111 and rs2222 inclusive, as well as rs3333 and rs4444. (Syntax works the same way as --chr. If your variant IDs contain dashes, you'll want to use the --d flag as well.) If rs1111 and rs2222 are on different chromosomes i < j, then all variants on chromosomes numbered between i and j are loaded, as well as the last variants on chromosome i and the first variants on chromosome j. (You can exclude some intermediate chromosomes by combining --snps with --not-chr.) --exclude-snps excludes all the specified variants/ranges instead. To reduce the potential for confusion, PLINK 2 normally errors out when multiple variant-inclusion filters (--extract[-intersect], --extract-col-cond, --from/--to, --from-bp/--to-bp, --snp, --snps) are specified, since it may not be obvious whether the intersection or union will be taken. --force-intersect allows the run to proceed; the set intersection will be taken. Deduplicate variants--rm-dup [mode] ['list'] --rm-dup usually removes all but one instance of each duplicate-ID variant (ignoring the missing ID). With the 'list' modifier, the original duplicated IDs are written to plink2.rmdup.list. The following modes of operation are supported:
Arbitrary thinning--thin <p> --thin removes variants at random by retaining each variant with probability p, --thin-count removes variants at random until only n remain, and --bp-space excludes one variant from each pair closer than the given bp count. (Yes, --bp-space is equivalent to VCFtools --thin; we can't do much about this mixup without breaking backward compatibility.) Note that LD-based pruning also has a variant thinning effect, and is normally more useful than these three commands. Similarly, --thin-indiv removes samples at random by retaining each sample with probability p, while --thin-indiv-count removes samples at random until only n remain. Phenotype/covariate-based--keep-if <phenotype/covariate name> <operator> <value> --keep-if removes all samples which don't satisfy a comparison predicate on a phenotype or covariate, while --remove-if does the reverse.
--require-pheno [phenotype name(s)...] When parameters are provided, --require-pheno removes samples missing any of the named phenotypes; otherwise, it removes samples missing any loaded phenotype. --require-covar does the same things for covariates. --keep-cats <filename> --keep-cat-pheno <phenotype/covariate name> --remove-cat-pheno <phenotype/covariate name> If exactly one categorical phenotype/covariate is loaded, --keep-cats and --keep-cat-names can be used individually or in combination to define a list of categories to keep; all samples not in one of those categories are then removed from the current analysis. --keep-cats accepts a text file with one category name per line, and --keep-cat-names takes a space-delimited sequence of category names on the command line. If multiple categorical phenotypes/covariates are loaded, use --keep-cat-pheno to specify which variable --keep-cats/--keep-cat-names should apply to. (This is still safe when only one categorical variable is present.) Similarly, --remove-cats removes all samples in categories named in a file, --remove-cat-names removes all samples in categories named on the command line, and --remove-cat-pheno specifies which variable --remove-cats/--remove-cat-names should apply to. String match--keep-col-match <filename> <string(s) to match...> --keep-col-match-name <column name> --keep-col-match-num <n> --keep-col-match accepts a space/tab-delimited text file with sample IDs in the first columns and a string to filter on in a later column. You can specify this column with either --keep-col-match-num or --keep-col-match-name (the latter requires a header line starting with #FID or #IID); with neither, "--keep-col-match-num 3" is assumed. All samples either missing from the file, or with a string value which doesn't match any of the strings you provided are removed from the analysis. The string comparison is case-sensitive, and numbers are not parsed, so '9', '9.0', '9e0', and '9E0' all compare unequal. (This is a minor extension of PLINK 1.x's --filter flag.) Missing genotype rates--geno [maximum per-variant] [{dosage | hh-missing}] --geno filters out all variants with missing call rates exceeding the provided value (default 0.1) to be removed, while --mind does the same for samples. If any samples were removed by --mind, their IDs are written to plink2.mindrem.id. By default, when a dosage is present but a hardcall is not, the genotype is treated as missing; add the 'dosage' modifier to treat this case as nonmissing. Alternatively, you can use 'hh-missing' to also treat heterozygous haploid calls as missing. Number of distinct alleles--min-alleles <count> --min-alleles excludes variants with fewer than the given number of alleles in the .pvar/.bim file, while --max-alleles excludes variants with more. For example, "--max-alleles 2" filters out the multiallelic variants which would otherwise make --make-bed error out. When a variant has exactly one ALT allele and it's a missing-code, these filters treat it as having only one allele. --import-max-alleles <count> --import-max-alleles is similar to --max-alleles, but applied during VCF/BCF/BGEN dataset import. This allows e.g. VCF/BCF files containing a few records with 255+ ALT alleles to be (partially) imported by PLINK 2 without a slow bcftools preprocessing step. Count must be at least 2. Allele frequencies/counts--maf [minimum freq] [mode] --maf filters out all variants with allele frequency below the provided threshold (default 0.01), while --max-maf imposes an upper bound. Similarly, --mac and --max-mac impose lower and upper allele count bounds, respectively. By default, these flags operate on 'nonmajor' (i.e. sum of all but the largest value) allele frequencies/counts. Three other modes are supported: 'nref' (nonreference), 'alt1', and 'minor' (smallest). You can use bcftools-style freq:mode notation for this. When pedigree information is present, --maf and --max-maf default to ignoring nonfounders when applying these filters; this can be changed with --nonfounders. There is no longer an analogous default for --mac/--max-mac; you now must explicitly specify how you want nonfounders to be handled (with --nonfounders or --ac-founders) when using those flags. --af-pseudocount causes allele frequencies to be estimated as qhat := (x + <# of observations of current allele>) / (x · <# of distinct alleles> + <# of obs. of any allele>) instead of the usual qhat := <# of observations of current allele> / <# of observations of any allele>. When the --read-freq file contains observation counts, --af-pseudocount acts on those counts. Hardy-Weinberg equilibrium tests--hwe <p-value> ['midp'] ['keep-fewhet'] --hwe filters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold. We recommend setting a low threshold—serious genotyping errors often yield extreme p-values like 1e-501 which are detected by any reasonable configuration of this test, while genuine SNP-trait associations can be expected to deviate slightly from Hardy-Weinberg equilibrium (so it's dangerous to choose a threshold that filters out too many variants). This HWE p-value calculator may be helpful. On chrX, p-values are now computed using the method described in Graffelman J, Weir BS (2016) Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome. --hwe's 'midp' modifier applies the mid-p adjustment described in Graffelman J, Moreno V (2013) The mid p-value in exact tests for Hardy-Weinberg equilibrium. The mid-p adjustment tends to bring the null rejection rate in line with the nominal p-value, and also reduces the filter's tendency to favor retention of variants with missing data. We recommend its use. However, even with 'midp', you should not apply a single QC-filtering p-value threshold across a batch of variants with highly variable missing call rates: if you have e.g. 100000 genotypes for one variant and 1000 genotypes for another, the same p-value could be within the range of normal variation for the first variant while corresponding to obvious genotyping error for the second. A warning is given whenever observation counts vary by more than 10%. For multiallelic variants, a separate biallelic test is performed for every allele, and the variant is filtered out iff any of the tests yields a [mid-]p-value below the threshold. Only founders are considered by this test; use --nonfounders to change this. When significant population stratification is present, this test can be expected to fail in the too-few-hets direction on some normal variants. When using --hwe for quality control, you probably want to keep these variants; use the 'keep-fewhet' modifier to do this. (This causes the two-sided [mid-]p-value to only be checked when the number of hets is above the equilibrium value; it's similar to performing a one-sided test with half the [mid-]p-value threshold.) On chrX, the ratio between the Graffelman/Weir p-value and the female-only p-value is considered here. There is currently no special handling of case/control phenotypes; --keep-if <phenotype name> == control is frequently a good idea when using --hwe in a genome-wide association analysis (and matches PLINK 1.x's behavior). 1: As genomic datasets continue to grow, scenarios may arise where it actually makes sense to filter on a HWE p-value smaller than DBL_MIN (~2.22507e-308). Support for this was added to PLINK 2 in February-March 2024. Imputation quality--mach-r2-filter [min] [max] --mach-r2-filter excludes variants where the MaCH Rsq imputation quality metric (frequently labeled as 'INFO') is outside [0.1, 2.0]; change the bounds by providing parameters. Monomorphic variants, where Rsq == nan, are not excluded by this filter: the problem with them isn't imputation quality. Similarly, --minimac3-r2-filter excludes variants where Minimac3's imputation quality metric is outside the given range. Note that this metric assumes that phased dosages have been imported with e.g. --vcf's dosage=HDS option; the computation still proceeds when unphased dosages are present, but the results will be underestimates. If you don't need phased dosages for any other reason, --{extract,exclude}-if-info is usually a more efficient way to do this properly. "--minimac3-r2-filter 1" can be used to keep only perfectly-imputed-and-phased variants. Sex--keep-males --keep-nosex --keep-females excludes all male and unknown-sex samples, --keep-males excludes females and unknown-sex samples, and --keep-nosex excludes all known-sex samples. Conversely, --remove-females only excludes known females, --remove-males only excludes known males, and --remove-nosex only excludes unknown-sex samples. Founder status--keep-founders --keep-founders excludes all samples with at least one known parental ID from the current analysis (note that it is not necessary for that parent to be in the current dataset), while --keep-nonfounders does the reverse. --ac-founders By default, nonfounders are not counted by --freq or --maf/--max-maf/--mac/--max-mac/--hwe. Use the --nonfounders flag to include them. Conversely, --ac-founders confirms that nonfounders should be excluded by --mac/--max-mac/"--freq counts". Why does this flag exist? Because we overlooked this detail when processing the preliminary 1000 Genomes hg38 callset. When nonfounders remain during execution of --mac/--max-mac/"--freq counts" and neither --ac-founders nor --nonfounders are specified, PLINK 2 now errors out. --make-founders <require-2-missing> <first> By default, if parental IDs are provided for a sample, they are not treated as a founder even if neither parent is in the dataset. With no modifiers, --make-founders clears both parental IDs whenever at least one parent is not in the dataset, and the affected samples are now considered founders. The 'require-2-missing' modifier causes this to only happen when both parents are missing. This normally happens after all sample-affecting filters have been applied (so it's too late to affect e.g. --filter-founders). If you want this to happen before all filters instead, add the 'first' modifier. |