Linear scoring
Sample scores
score <filename> [i] [j] [k] [{header  headerread}]
[{center  variancestandardize  dominant  recessive}]
['nomeanimputation'] ['se'] ['zs'] ['ignoredupids']
[{listvariants  listvariantszs}]
['cols='<column set descriptor>]
scorelist <fnm> [i] [j] [k] [{header  headerread}]
[{center  variancestandardize  dominant  recessive}]
['nomeanimputation'] ['se'] ['zs'] ['ignoredupids']
['cols='<column set descriptor>]
scorecolnums <number(s)/range(s)...>
qscorerange <range file> <data file> [i] [j] ['header'] ['min']
score and scorelist apply one or more linear scoring systems to each sample, and report results to plink2.sscore. More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoringsystem vector with one coefficient per allele, score[list] computes the vectormatrix product a^{T}G, and then divides by the number of allele observations (i.e. usually twice the number of variants; online documentation incorrectly said "variants" here before 16 May 2023) when reporting scoreaverages.
For scorelist, the input file should contain a list of filenames, one per line; each of those files is then processed as if it were passed to score, then results are merged together. Note that, if all your files contain the same variants and alleles, scorelist tends to be a lot slower than passing in a single wide file to score and using scorecolnums.
The rest of this section describes score.
 The input file must have exactly one line per scored allele. Variant IDs are read from column #i and allele codes are read from column j, where i defaults to 1 and j defaults to i+1.
 By default, a single column of coefficients is read from column #k, where k defaults to j+1. To specify multiple columns, use scorecolnums.
 'headerread' causes the first line of the input file to be treated as a header line containing score names. Otherwise, score(s) are assigned the names 'SCORE1', 'SCORE2', etc.; and 'header' just causes the first line to be entirely ignored.
 By default, copies of unnamed alleles contribute zero to score, while missing genotypes contribute an amount proportional to the loaded (via readfreq) or imputed allele frequency. To throw out missing observations instead (decreasing the denominator in the final average when this happens), use the 'nomeanimputation' modifier.
 By default, G contains basic allelic dosages (0..2 on diploid chromosomes, 0..1 on haploid, male chrX encoding controlled by xchrmodel). The following modifiers affect this:
 'center' translates all dosages to mean zero. (More precisely, they are translated based on allele frequencies, which you can control with readfreq.)
 'variancestandardize' linearly transforms each variant's dosage vector to have mean zero, variance 1.
 'dominant' causes dosages greater than 1 to be treated as 1, while 'recessive' uses max(dosage  1, 0) on diploid chromosomes.
'dominant', 'recessive', and 'variancestandardize' cannot be used with chrX.
 The 'se' modifier causes the input coefficients to be treated as independent standard errors; in this case, standard errors for the score average/sum are reported, under a Gaussian approximation. (This will of course tend to underestimate standard errors when scored variants are in LD.)
 By default, score errors out if a variant ID in the input file appears multiple times in the main dataset. Use the 'ignoredupids' modifier to skip them instead (a warning is still printed if such variants are present).
 The 'listvariants[zs]' modifier causes variant IDs used for scoring to be written to plink2.sscore.vars[.zst].
 Refer to the file format entry for a list of supported column sets.
qscorerange can be used to apply score to many variant subsets at once, based on e.g. pvalue ranges.
 The "range file" should have range labels in the first column, pvalue lower bounds in the second column, and upper bounds in the third column, e.g.
S1 0.00 0.01
S2 0.00 0.20
S3 0.10 0.50
(Lines with too few entries, or nonnumeric values in the second or third column, are ignored.) This would cause three samplescore reports to be generated: plink2.S1.sscore would only consider variants with pvalues in [0, 0.01], plink2.S2.sscore would only consider [0, 0.2], and plink2.S3.sscore would only consider [0.1, 0.5].
 The "data file" should contain a variant ID and a pvalue on each line (except possibly the first). Variant IDs are read from column #i and pvalues are read from column #j, where i defaults to 1 and j defaults to i+1. The 'header' modifier causes the first nonempty line of this file to be skipped.
 By default, qscorerange errors out when a variant ID appears multiple times in the data file (and is also present in the main dataset). To use the minimum pvalue in this case instead, add the 'min' modifier.
For more sophisticated polygenic risk scoring, we recommend looking at the LDpred2 and PRSice2 software packages.
Since score's new 'variancestandardize' modifier applies the same transformation to G as pca does, score can now execute the vectormatrix multiply corresponding to PCA projection.
The following command exports PCs to project onto, along with the allele frequencies needed to calibrate the 'variancestandardize' operation:
plink2 pfile ref_data \
freq counts \
pca allelewts vcols=chrom,ref,alt \
out ref_pcs
You can then project onto those PCs with
plink2 pfile new_data \
readfreq ref_pcs.acount \
score ref_pcs.eigenvec.allele 2 5 headerread nomeanimputation \
variancestandardize \
scorecolnums 615 \
out new_projection
Note that these PCs will be scaled a bit differently from ref_data.eigenvec; you need to multiply or divide the PCs by a multiple of sqrt(eigenvalue) to put them on the same scale.
Also note that later PC coordinates for outofreference samples will tend to be shrunk toward zero; see e.g. Wang C, Zhan X, Liang L, Abecasis GR, Lin X (2015) Improved Ancestry Estimation for both Genotyping and Sequencing Data using Projection Procrustes Analysis and Genotype Imputation for discussion.
variantscore <filename> ['bin'  'bin4'  'cols='<column set descriptor>]
['zs'] ['singleprec']
(alias: vscore)
vscorecolnums <number(s)/range(s)...>
variantscore is roughly the transpose of score: it applies one or more linear scoring systems to each variant, and reports results to plink2.vscore[.zst]. More precisely, if G is the full genotype/dosage matrix (rows = variants, columns = samples) and s is a scoringsystem vector with one coefficient per sample, variantscore computes the vectormatrix product Gs. However, there are some details which differ, since the main purpose of this command is different.
 The input file should contain one line per sample, each starting with a sample ID and followed by scoring weight(s). It can also have a header line with the sample ID representation (e.g. "#FID IID") and the score name(s).
 By default, all score columns are read. vscorecolnums lets you select a subset.
 Each entry of G is the sum of all nonREF dosages for that (sample, variant) combination; i.e. all ALT alleles in multiallelic variants are effectively collapsed together. Scaling is the same as for score (including chrX being affected by xchrmodel). MAFbased mean imputation is always applied to missing dosages, since there's no option for computing a scoreaverage.
 Refer to the file format entry for a list of column sets supported by the usual text report.
 The 'bin' and 'bin4' modifiers request binary output instead. In this case, the main plink2.vscore.bin output file contains floatingpoint values (doubleprecision with 'bin', singleprecision with 'bin4'), column (score) ID(s) are saved to plink2.vscore.cols, and variant IDs are saved to plink2.vscore.vars[.zst].
 By default, the computation uses doubleprecision numbers internally (even when singleprecision output is requested); you can use 'singleprec' to sacrifice some accuracy for speed.
Distributed computation >>
