This page is under construction. If there's something you consider to be an essential PLINK resource which is not mentioned on this page, contact us and/or comment in the plink2-users Google group.
The linked files are currently hosted by Dropbox. If you are unable to download them, contact us for access to an alternate source; we understand that Dropbox is blocked in some locations.
The no-singleton dataset can be a good starting point if you were planning on filtering out low-MAF variants anyway, or you're constrained to ≤ 8 GiB of workspace memory.
The KING-robust algorithm is very effective at identifying 1st-degree relations within a population, and its output can also be used to distinguish between parent-child vs. sibling relationships — IBS0 is much higher for the latter, all other things being equal. One relationship previously flagged by KING-robust (NA20317-NA20318) was formally acknowledged to be a probable clerical error, and added to the official pedigree on 2020-07-31. (phase3_orig.psam does not contain this relationship, since it is based on the 2016-05-05 snapshot of the official pedigree.)
The KING-corrected .psam files contain this relationship, along with a few others with similarly strong supporting evidence (see the .kin0 file below). The sex of sample HG02300 was corrected from male to female in the official pedigree after 2021-07-12. This was propagated to the KING-corrected file on 8 Jan 2023. See also the note on sample HG03511 below.The sex of sample HG03511 was changed from female to male in the KING-corrected .psam file on 9 Jan 2023, due to having >30x fewer heterozygous calls on chrX (outside the pseudoautosomal regions) than average for females in the hg38 callset, no autosomal evidence of inbreeding, the largest fraction of non-SNPs among its chrX heterozygous calls, and the largest percentage decline in chrX heterozygous call count from build 37 to 38.
Due to a header line and an INFO annotation quirk, PLINK 2 builds older than 8 Jan 2023 are unable to convert this dataset to or from BCF.
.pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose: plink2 --zst-decompress all_hg38.pgen.zst > all_hg38.pgen
In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:
These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0
In addition to ~600 trios which were intentionally included, this dataset contains a few close relations which are not described in the .psam file, e.g. sibships where neither parent was sequenced. Use --remove with one of the following ID lists when you don't want close relations:
These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_hg38.kin0
This dataset was intended to contain only unrelated samples; unfortunately, a few parent-child pairs, sibships, and second-degree relationships snuck in. Use --remove with one of the following ID lists when you don't want close relations:
These lists were generated from the original dataset with "--king-cutoff 0.177" and "--king-cutoff 0.0884", respectively. If you're curious, here's the --make-king-table + --king-table-filter report listing all 1st/2nd-degree related sample pairs: deg2_phase3.kin0
The sex of sample HG03511 was changed from female to male in the KING-corrected .psam file on 9 Jan 2023, due to having >30x fewer heterozygous calls on chrX (outside the pseudoautosomal regions) than average for females in the hg38 callset, no autosomal evidence of inbreeding, the largest fraction of non-SNPs among its chrX heterozygous calls, and the largest percentage decline in chrX heterozygous call count from build 37 to 38.
Unfortunately, this creates a few heterozygous haploid genotypes on chrX where there previously weren't any. When that's a problem, we suggest filtering out the small number of affected variants for now, since we expect them to be enriched for mapping errors.
This dataset fuses results from two different pipelines. The primary chr1..chrX genotypes are phased, contain no missing calls, and only have biallelic left-normalized variants (multiallelic variants were "split"). The chrY/chrM/contigs genotypes are unphased, contain some missing calls, multiallelic variants there are unsplit, and there are a few variants which aren't left-normalized.
There was previously an option to download "no-singleton" files. This is no longer available, since the Byrska-Bishop et al. quality-control pipeline removed almost all genuine singletons on chr1..chrX.
This dataset contains (unsplit) multiallelic variants, and a few variants which aren't left-normalized.
.pgen.zst file(s) must be decompressed before use. (This isn't necessary for .pvar.zst files: see --pfile's 'vzs' modifier.) If you don't have another .zst decompressor installed, you can use PLINK 2 for this purpose: plink2 --zst-decompress hgdp_all.pgen.zst > hgdp_all.pgen
This dataset was aligned to GRCh38, and variant calls were made on the autosomes, chrX, and chrY. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.
The dataset contains (unsplit) multiallelic variants, and one variant on chrY which isn't left-normalized. ~6.57% of genotype calls are missing.
The source material contains per-genotype AD, DP, GQ, and PL fields which cannot be represented by the .pgen file format, and are consequently not preserved.
This dataset was aligned to GRCh38, and variant calls were made on only the autosomes. There are 929 samples, with no 1st-degree relations. Samples have been sorted by ID.
The dataset contains (unsplit) multiallelic variants, and ~4.27% of genotype calls are missing. All variants are left-normalized.
These are the reference genomes that the aforementioned 1000 Genomes and HGDP samples were aligned against. Note that --fa can directly read these compressed files.