The updated files below contain lists of predicted loss of function (LoF) and functional (missense or inframe indels) variants in the current Genes & Health callset, released in November 2017. Variants were called in 3781 East London Genes & Health volunteers (Bangladeshi and Pakistani, with self stated related parents), 2624 Bradford volunteers (Pakistani, mostly self-stated or DNA autozygous individuals) and 1060 Birmingham volunteers (Pakistani, unselected). Bradford and Birmingham samples are as described in Narasimhan et al Science 2016, with new additional Bradford samples.
Only variants that were present as homozygotes in at least 1 Genes & Health sample have been included in these files, which contains detailed variant annotation (from Ensembl Variant Effect Predictor) as well as allele frequencies in other populations, and from ExAC/gnomAD.
Mapping was done with bwa-mem and variant calling was carried out with GATK HaplotypeCaller. We removed variant sites for which the following was true:
SNPs: "QD < 2.0 || FS > 30 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
Indels: "QD < 2.0 || FS > 30 || ReadPosRankSum < -20.0"
All files contain genotype counts after the variants have either been through the basic variant-level filtering (Files 2, 4, 6) or, subsequently, have also been through genotype-level filtering, setting to missing genotypes with GQ<20 or allele balance p-value <0.01 (Files 1, 3, 5). Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.
Please also note that whilst some samples have been sequenced to a read depth of ~ 40X, some have only a read depth ~ 20X.
Files containing predicated loss of function only variants (Files 1, 2, 3, 4,) are in pairs. The *all_transcripts_printed.txt files contain the annotations across all transcripts for which the variant is a predicted loss of function. The files *annotation_not_in_last_exon_and_present_in_all_transcripts.txt contain just the predicted loss of function variants which will be present in all transcripts of a gene and are not located in the last exon.
For more information about the files see File 7.
ADDED 20 JULY 2018:
File 8 and 9: Variants that were present in at least 1 sample from any cohort (East London Genes and Health, Bradford, Birmingham) have been included in these files
Files
File 1. ** MOST USERS WILL PROBABLY WANT TO USE THIS FILE ** - Filtered list of predicted loss of function variants with basic GATK filtering (gatk_PASS) followed by more stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01), containing only those variants which will be present in all annotated Ensembl transcripts of a gene and are also not located in the last exon. There is only one transcript for each variant listed.
Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.
File 2. Filtered list of predicted loss of function variants with basic GATK filtering (gatk_PASS) containing only those variants which will be present in all annotated Ensembl transcripts of a gene and are also not located in the last exon. There is only one transcript for each variant listed.
File 3. List of all predicted loss of function variants with basic GATK filtering (gatk_PASS), followed by stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01) showing annotations across all the transcripts within a gene for which the variant is a predicted loss of function. The variant annotations with respect to all transcripts are printed.
Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.
File 4. List of all predicted loss of function variants with basic GATK filtering (gatk_PASS) showing annotations across all the transcripts within a gene for which the variant is a predicted loss of function. The variant annotations with respect to all transcripts are printed.
File 5. List of all predicted functional variants (missense or inframe indels) with basic GATK filtering, followed by more stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01). The variant annotations with respect to all transcripts are printed.
Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.
File 6. List of all predicted functional variants (missense or inframe indels) with basic GATK filtering (gatk_PASS). The variant annotations with respect to all transcripts are printed.
File 7. UPDATED 13 JUNE 2018 -Word document giving more information about file column headings and processes.
File 8. ADDED 20 JULY 2018 List of all predicted loss of function variants with basic GATK filtering (gatk_PASS), followed by stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01) showing annotations across all the transcripts within a gene for which the variant is a predicted loss of function. The variant annotations with respect to all transcripts are printed. Variants present in at least 1 sample from any cohort (East London Genes and Health, Bradford, Birmingham).
File 9. ADDED 20 JULY 2018 List of all predicted functional variants (missense or inframe indels) with basic GATK filtering, followed by more stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01). Variants present in at least 1 sample from any cohort (East London Genes and Health, Bradford, Birmingham).