September 2019 dataset: Exome sequencing data from 8921 individuals - East London, Birmingham, Bradford
The updated files below contain lists of predicted loss of function (LoF) and functional (missense or inframe indels) variants in the current Genes & Health callset, released in September 2019. Variants were called in 5236 East London Genes & Health volunteers (Bangladeshi and Pakistani, with self stated related parents), 2624 Bradford volunteers (Pakistani, mostly self-stated or DNA autozygous individuals) and 1061 Birmingham volunteers (Pakistani, unselected). Bradford and Birmingham samples are as described in Narasimhan et al Science 2016, with new additional Bradford samples in 2017.
--
The vcf files include variant calls (single nucleotide variants and small insertions/deletions) from 8086 (mostly British Pakistani/British Bangladeshi) individuals from the following studies:
1. 5236 British Pakistani/British Bangladeshi adults from East London Genes and Health (ELGH)
2. 2624 British South Asian mothers from Born in Bradford (mostly Pakistani) (BiB)
3. 1061 British South Asian adults from Birmingham (mostly Pakistani) (Birm)
All of the Birmingham and most of the Born in Bradford samples were previously sequenced as part of PMID: 26940866.
In the sample list file, the columns of interest to most people will be:
· vcf.id - sample ID from the vcf
· cohort - which cohort they're in
· sex.assigned - sex inferred from coverage on the X and Y chromosomes. Individuals for whom this did not match their reported sex have been discarded
· total, chrX and chrY - coverage within bait regions across all chromosomes, chrX and chrY respectively
Mapping was done with bwa-mem and variant calling was carried out with GATK HaplotypeCaller. We removed variant sites for which the following was true: SNPs: "QD < 2.0 || FS > 30 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" Indels: "QD < 2.0 || FS > 30 || ReadPosRankSum < -20.0"
The vcf data is available from the EBI-EGA genotype phenotype archive. Users need to complete the standard Wellcome Sanger Institute Data Access Agreement.
The data is here: https://ega-archive.org/datasets/EGAD00001005469