D: 3 Mar 2023 Main functions (--make-grm-bin...) Quick index search |
File format referenceThis page describes specialized PLINK 2.0 input and output file formats which are identifiable by file extension. (Most extensions not listed here have very simple one-entry-per-line or two-entry-per-line text formats.) Unless otherwise specified, all multicolumn text files generated by PLINK 2.0 are tab-delimited, with one header line starting with '#'. In the column summaries, columns which are present unless removed by the column set descriptor are boldface, and columns which only appear under some data/flag/modifier combination(s) are italicized. Jump to: .acount | .adjusted | .afreq | .bcf | .bed | .bgen | .bim | .bins | .cov | .eigenvec{,.allele,.var} | .fam | .fst.summary | .fst.var | .gcount | .gen | .glm.firth | .glm.linear | .glm.logistic[.hybrid] | .grm | .grm.N.bin | .grm.bin | .haps | .hardy | .hardy.x | .het | .*.id | .kin0 | .king[.bin] | .legend | .map | .pdiff | .ped | .pgen{,.pgi} | .psam | .pvar | .raw | .rel[.bin] | .sample | .scount | .sdiff | .sdiff.summary | .smiss | .sscore | .tfam | .tped | .traw | .vcf | .vmiss | .vscore | .vscore.bin .acount, .afreq (allele count/frequency report)Produced by --freq. A text file with a header line, and then one line per variant with the following columns:
.adjusted (basic multiple-testing corrections)Produced by --adjust[-file]. A text file with a header line, and then one line per tested allele with the following columns:
Entries are sorted in increasing p-value order. (Thus, if the QQ field is present, its values just increase linearly.) .bcf (binary Variant Call Format)Variant information + sample ID + genotype call binary file. Imported with --bcf, and produced by "--export bcf". Refer to the hts-specs GitHub repository for a detailed description of the format. "--export bcf" uses binary encoding v2.2. .bed (PLINK 1 binary biallelic genotype table)PLINK 1's preferred way to represent genotype calls. Must be accompanied by .bim and .fam files. Loaded with --bfile, and generated by --make-bed. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. (It is safe to change a PLINK 1 .bed file's extension to .pgen and use --bpfile to load it.) See the PLINK 1.9 documentation for a detailed description of the usual variant-major form, along with an example. PLINK 2 can also efficiently export the sample-major form ("--export ind-major-bed"); it has third byte equal to zero instead of one, but is otherwise analogous. .bgen (Oxford variant info + genomic data binary file)Native binary file format for Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST. BGEN v1.1 files should always be accompanied by a .sample file. Loaded with --bgen, and produced by "--export bgen-1.{1,2,3}". Refer to https://www.well.ox.ac.uk/~gav/bgen_format/ for a detailed description of the format. .bim (PLINK extended MAP file)Variant information file accompanying a .bed or biallelic .pgen binary genotype table. (--make-just-bim can be used to update just this file.) A text file with no header line, and one line per variant with the following six fields:
A few notes:
.bins (allele count or frequency histogram)A text file with a header line, followed by one line per [start, end) histogram bin with the following two fields:
The end of the current bin interval is the next line's BIN_START value (or positive infinity if there is no next line). .cov (covariate table)Produced by --write-covar, --make-[b]pgen/--make-bed, and --export when covariates have been loaded/specified. Valid input for --covar. A text file with a header line, and one line per sample with the following columns:
(Note that --covar can also be used with files lacking a header row.) .eigenvec, .eigenvec.allele, .eigenvec.var (principal components)Produced by --pca. Accompanied by an .eigenval file, which contains one eigenvalue per line. The .eigenvec file is a text file with a header line and between 1+V and 3+V columns per sample, where V is the number of requested principal components. The first columns contain the sample ID, and the rest are principal component weights in the same order as the .eigenval values (with column headers 'PC1', 'PC2', ...). With the 'allele-wts' modifier, an .eigenvec.allele file is also generated. It's a text file with a header line, followed by one line per allele with the following columns:
Alternatively, with the 'biallelic-var-wts' modifier, an old-style .eigenvec.var file is generated. It's a text file with a header line, followed by one line per variant with the following columns:
.fam (PLINK sample information file)Sample information file accompanying a .bed or biallelic .pgen binary genotype table. (--make-just-fam can be used to update just this file.) A text file with no header line, and one line per sample with the following six fields:
.fst.summary (all-population-pairs Wright's FST report)Produced by --fst. A text file with a header line, and then one line per population-pair with the following columns:
.fst.var (per-variant Wright's FST report for one population pair)Produced by --fst when 'report-variants' is specified. A separate file is generated for each population pair. A text file with a header line, and then one line per autosomal variant with the following columns:
.gcount (genotype count report)Produced by --geno-counts. A text file with a header line, and then one line per variant with the following columns:
.gen (Oxford text genotype file format)Native text genotype file format for Oxford statistical genetics tools, such as IMPUTE2 and SNPTEST. Should always be accompanied by a .sample file. Imported with --data/--gen, and produced by "--export oxford[-v2]". A text file with no header line, and one line per variant with either 3N+5 or 3N+6 fields where N is the number of samples. Each line stores information for a single SNP. In the 3N+5 case (corresponding to the original specification), the first five fields are:
Unless the chromosome code was declared with --oxford-single-chr (in which case the SNP ID column is ignored), PLINK has no choice but to assume that the "SNP ID" column actually stores chromosome codes. (This is the convention when PLINK exports a 5-leading-column .gen file.) The newer 3N+6 column flavor has a dedicated chromosome column in front. This was not supported by PLINK 1.9 or 2.0 before 16 Apr 2021. Each subsequent triplet of values then indicate likelihoods of homozygote A1, heterozygote, and homozygote A2 genotypes at this variant, respectively, for one sample. If they add up to less than one, the remainder is a no-call probability weight. The PLINK 2 binary format can represent allele count expected values, but it does not distinguish between e.g. {P(hom-ref)=0.28, P(het)=0.52, P(hom-alt)=0.2} and {P(hom-ref)=0.08, P(het)=0.92, P(hom-alt)=0}, and it ignores the no-call probability weight (though "0 0 0" will be correctly converted to a missing call). The --import-dosage-certainty flag can be used during import to replace some of the most uncertain genotype calls with missing values. .glm.firth, .glm.logistic[.hybrid] (logistic/Firth regression association statistics)Produced by --glm with a case/control phenotype. A text file with a header line, and then one line per variant with the following columns:
.glm.linear (linear regression association statistics)Produced by --glm with a quantitative phenotype. A text file with a header line, and then one line per variant with the following columns:
1: For multiallelic variants, this column may contain multiple comma-separated alleles when the result doesn't depend on which allele is A1. .grm (GCTA text relationship matrix)Produced by --make-grm-list. A text file with no header line, and one line per pair of samples (not necessarily distinct) with the following four fields:
.grm.N.bin, .grm.bin (GCTA 1.1+ triangular binary relationship matrix)Produced by --make-grm-bin. These files contain single-precision (4-byte) floating point values. Using 1-based matrix indices, the first value in each file is the (1, 1) relationship value (.grm.bin) or observation count (.grm.N.bin); the second and third values are the (2, 1) and (2, 2) relationships/counts; the fourth through sixth values are the (3, 1), (3, 2) and (3, 3) relationships/counts in that order; and so on. Note that .grm.bin files generated by GCTA versions before 1.1 have a different format. .haps (Oxford phased haplotype file)Reference panel haplotype file format for IMPUTE2. Must be accompanied by a .legend file when no variant info header columns are present. Imported with --haps, and produced by "--export haps[legend]". A text file with no header line, and either 2N+5 or 2N fields where N is the number of samples. In the former case, the first five columns are:
This is followed by a pair of 0/1-valued haplotype columns for the first sample, then a pair of haplotype columns for the second sample, etc. (For male samples on chrX, the second column may contain dummy '-' entries; otherwise, missing genotype calls are not permitted.) .hardy (Hardy-Weinberg equilibrium exact test report)Produced by --hardy when autosomal diploid variants are present. A text file with a header line, and one line per autosomal diploid variant with the following columns:
.hardy.x (Graffelman-Weir extended chrX HWE test report)Produced by --hardy when chrX variants are present. A text file with a header line, and one line per chrX variant with the following columns:
.het (method-of-moments F coefficient estimates)Produced by --het. A text file with a header line, and one line per sample with the following columns:
.id (Sample ID list)When generated by PLINK 2, this is a text file which may or may not have a header line. If there's no header line (default with .grm.id files, can be forced for other .id files with --no-id-header), and there's a single column, they are IIDs; if there are two columns, they are FID/IID. Otherwise, there's one line per sample after the header line with the following columns:
.kin0 (KING-robust kinship coefficient report)Produced by --make-king-table. A text file with a header line, and one line per sample pair with kinship coefficient no smaller than the --king-table-filter value. When --king-table-filter is not specified, all sample pairs are included. The following columns are present:
.king[.bin] (KING-robust kinship coefficient matrix)Produced by --make-king. If text, a tab-delimited file that is either lower-triangular (excluding the diagonal) or square. If it's square, the upper-right triangle may be either zeroed out or the mirror-image of the lower-left triangle, depending on whether the 'square0' or 'square' modifier was used. The binary format is semantically identical; it just has nothing but single- (4-byte) or double-precision (8-byte) floating point values, instead of text+delimiters+linebreaks. .legend (Oxford single-chromosome variant information file)Single-chromosome variant information file accompanying a bare .haps reference panel haplotype file. Imported with --legend, and produced by "--export hapslegend". A text file with a header line, and one line per variant with the following four columns:
.map (PLINK 1 text fileset variant information file)Variant information file accompanying a .ped text pedigree + genotype table. A text file with no header line, and one line per variant with the following 3-4 fields:
All lines must have the same number of columns (so either no lines contain the centimorgans column, or all of them do). .pdiff (two-fileset genotype/dosage discordance report)Produced by --pgen-diff. A text file with a header line, and then one line per discordance with the following columns:
.ped (PLINK 1/MERLIN/Haploview sample-major text genotype table)Pedigree information + genotype call text file. Must be accompanied by a .map file. Loaded with --pedmap, and produced by "--export ped". This format is simultaneously highly inefficient, even relative to other text formats, and limited in scope (unobserved minor allele codes can't be stored); continued use is strongly discouraged. Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file. The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on. All variants must be biallelic (or monomorphic, or all-missing). If all alleles are single-character, PLINK 1.9 and 2.0 will correctly parse the more compact "compound genotype" variant of this format, where each genotype call is represented as a single two-character string. This does not require the use of an additional loading flag. You can produce such a file with "--export compound-genotypes". It is also possible to load .ped files missing some initial fields. .pgen, .pgen.pgi (PLINK 2 binary genotype table)PLINK 2's preferred way to represent genotype calls. Must be accompanied by .pvar/.bim and .psam/.fam files. Loaded with --pfile/--bpfile, and generated with --make-pgen/--make-bpgen and all import commands. Most .pgen files have an embedded index, and do not have an accompanying .pgen.pgi file. When the index is not embedded, PLINK 2 expects it to be stored in "<.pgen filename>.pgi". A draft specification of these formats is available. The first version will be finalized around the beginning of PLINK 2.0 beta testing. .psam (PLINK 2 sample information file)Sample information file accompanying a .pgen binary genotype table. (--make-just-psam can be used to update just this file.) A text file which usually has at least one header line, where only the last header line starts with '#FID' or '#IID'. This final header line specifies the columns in the .psam file; the following intermediate column headers are recognized:
(FID must either be the first column, or absent. If it's absent, all FID values are now assumed to be '0'.) Any other value is treated as a phenotype/covariate name. If no header line is present, the columns are assumed to be in .fam file order (FID, IID, PAT, MAT, SEX, PHENO1). .pvar (PLINK 2 variant information file)Variant information file accompanying a .pgen binary genotype table. (--make-just-pvar can be used to update just this file.) A text file which usually has at least one header line, where only the last header line starts with '#CHROM'. This final header line specifies the columns in the .pvar file; the following intermediate column headers are recognized:
In particular, a VCF file, or a trimmed VCF file with all columns past the 5th (or 6th, etc.) removed, is valid input for anything expecting a .pvar-format file. The following VCF-style header lines are also recognized:
When no header line is present, the columns are assumed to be in .bim file order (CHROM, ID, CM, POS, ALT, REF; or if only 5 columns are present, CM is assumed to be omitted). .raw (additive + dominant component file)Produced by "--export {A,AD}"; suitable for loading from R. This format cannot be loaded by PLINK. A text file with a header line, and then one line per sample with V+6 (for "--export A") or 2V+6 (for "--export AD") fields, where V is the number of variants. The header line does not contain a preceding '#'. The first six fields are:
This is followed by one or two fields per variant:
If 'include-alt' was specified, the header line also names alternate allele codes in parentheses, e.g. 'rs5939319_G(/A)'. .rel[.bin] (relationship matrix)Produced by --make-rel. Contents are identical to that of a .grm/.grm.bin file. Possible shapes are essentially the same as for .king files; the only difference is that .king files have an omitted or constant-0.5 diagonal while .rel files do not. .sample (Oxford sample information file)Sample information file accompanying a .gen or .bgen genotype dosage file, or a .haps phased reference panel. Loaded with --data/--sample, and produced by --export in several cases. By default, the .sample space-delimited files emitted by --export have two header lines, and then one line per sample with 4+ fields:
(As of 6 Apr 2021, PLINK 2 accepts 'C' as a synonym for column type 'P' in .sample input files.) With --export's 'sample-v2' modifier, this is adjusted to:
Note that older programs are likely to support only the first .sample dialect. A specification for this format is on the QCTOOL v2 website. .scount (sample variant-count report)Produced by --sample-counts. A text file with a header line, and then one line per discordance with the following columns:
The 'hetsnp', 'dipts'/'ts'/'diptv'/'tv', 'dipnonsnpsymb'/'nonsnpsymb', 'symbolic', and 'nonsnp' columns count each ALT allele in a heterozygous ALTx-ALTy genotype separately, since they can be of different subtypes. (I.e. if they are of the same subtype, the corresponding count is incremented by 2.) As a consequence, these columns are unaffected by variant split/join. 3: If the ALT allele in a chrX biallelic variant appears in exactly one female and one male, that counts as a singleton in this column for just the female. .sdiff (sample-pair discordance report)Produced by --sample-diff. A text file with a header line, and then one line per discordance with the following columns:
.sdiff.summary (sample-pair discordance count summary)Produced by --sample-diff. A text file with a header line, and then one line per sample pair with the following columns:
.smiss (sample-based missing data report)Produced by --missing. A text file with a header line, and then one line per variant with the following columns:
When dosages are present, MISSING_DOSAGE_CT will typically be slightly lower than MISSING_CT, since hardcalls normally aren't saved for dosages in (0.1, 0.9) or (1.1, 1.9). .sscore (sample scores)Produced by --score. A text file with a header line, and then one line per sample with the following columns:
.tfam (PLINK 1 sample information file)Sample information file accompanying a .tped file; identical format to .fam files. .tped (PLINK 1 variant-major text genotype table)Variant information + genotype call text file. Must be accompanied by a .tfam file. Loaded with --tfile, and produced by "--export tped". Contains no header line, and one line per variant with 2N+4 fields where N is the number of samples. The first four fields are the same as those in a .map file. The fifth and sixth fields are allele calls for the first sample in the .tfam file ('0' = no call); the 7th and 8th are allele calls for the second sample; and so on. All variants must be biallelic (or monomorphic, or all-missing). .traw (variant-major additive component file)Produced by "--export Av"; suitable for loading from R. Loaded with --import-dosage (note that several modifiers must be specified). A text file with a header line without a leading '#', and then one line per variant with the following N+6 fields (where N is the number of samples):
.vcf, .bcf (1000 Genomes Project text Variant Call Format)Variant information + sample ID + genotype call file; text if .vcf, binary if .bcf. Imported with --vcf/--bcf, and produced by "--export {b,v}cf". Note that, while PLINK 2.0 supports a much larger subset of the VCF standard than PLINK 1.9, it still isn't appropriate for general-purpose VCF handling. Instead, the goal is to provide a very useful complement to bcftools. For example, PLINK 2.0 does not save per-call read depths, so any data management or analysis which requires them to be kept around should be done with bcftools or a similarly general tool; but once you're done with variant calling/imputation and are ready to treat your data as a single matrix of hardcalls or dosages (possibly with missing entries), PLINK 2.0 is much more efficient. The VCFv4.3 files emitted by "--export vcf" start with the following three header lines:
This is usually followed by all the VCF header lines (if any) present in the loaded .pvar file, a "##chrSet=" chromosome set description when appropriate, and additional "##contig=", INFO/PR, and FORMAT header lines when necessary to make the file conform to the VCF standard. Next comes a tab-delimited header line with the following N+9 fields (where N is the number of samples), and one tab-delimited line per variant with the same fields:
Allele codes are supposed to either start with '<', only contain characters in the set {A,C,G,T,N,a,c,g,t,n}, be an isolated '*', or represent a breakend. --export issues a warning if an allele code does not satisfy this restriction. The full VCFv4.3 specification is in the hts-specs GitHub repository; this includes details on the BCF binary encoding. .vmiss (variant-based missing data report)Produced by --missing. A text file with a header line, and then one line per variant with the following columns:
When dosages are present, MISSING_DOSAGE_CT will typically be slightly lower than MISSING_CT, since hardcalls normally aren't saved for dosages in (0.1, 0.9) or (1.1, 1.9). .vscore (text variant score report)Produced by --variant-score. A text file with a header line, and then one line per sample with the following columns:
.vscore.bin (binary variant scores)Produced by "--variant-score bin". Accompanied by .vscore.cols and .vscore.vars text files containing column (score) and row (variant ID) labels, respectively. A matrix of double-precision (8-byte) floating point variant scores. |