SlideShare a Scribd company logo
Deep phenotyping to aid identification
of coding & non-coding rare disease
variants
Melissa Haendel, PhD
March 2017@monarchinit
@ontowonka haendel@ohsu.edu
Acknowledgments
Charite
Max Schubach
Sebastian Koehler
Univ of Milan
Giorgio Valentini
RTI
Jim Balhoff
OHSU
Kent Shefchek
John Letaw
Julie McMurry
Nicole Vasilevsky
Matt Brush
Tom Conlin
Dan Keith
Genomics
England/Queen Mary
Damian Smedley
Julius Jacobsen
Jackson Laboratory
Peter Robinson
Stanford
Shruti Marwaha
Matthew Wheeler
Euan Ashley
Lawrence Berkeley
Chris Mungall
Suzanna Lewis
Jeremy Nguyen
Seth Carbon
Garvan
Tudor Groza
https://monarchinitiative.org/page/team
The genome is sequenced, but...
3,398
OMIM
Mendelian Diseases with
no known genetic basis
?
At least 120,000*
ClinVar
Variants with no known
pathogenicity
…we still don’t know very much about what it does
*This is > twice what it was
in 2016!
Prevailing clinical genomic pipelines
leverage only a tiny fraction of the available
data
PATIENT EXOME
/ GENOME
PATIENT CLINICAL
PHENOTYPES
PUBLIC GENOMIC DATA
PUBLIC CLINICAL PHENOTYPE,
DISEASE DATA
POSSIBLE DISEASES
DIAGNOSIS & TREATMENT
PATIENT ENVIRONMENT
PUBLIC ENVIRONMENT,
DISEASE DATA
PATIENT OMICS PHENOTYPES PUBLIC OMICS PHENOTYPES,
CORRELATIONS
Under-utilized data
The Human Phenotype Ontology
Hyposmia
Abnormality of
globe location
eyeball of
camera-type eye
sensory
perception of smell
Abnormal eye
morphology
Motor neuron
atrophyDeeply set eyes
motor neuronCL
34571 annotations in
22 species
157534 phenotype
annotations
2150 phenotype
annotations
 11,813
phenotype
terms
 127,125 rare
disease -
phenotype
annotations
 136,268
common
disease -
phenotype
annotations
bit.ly/hpo-paper
Adding other species’ data
helps fill knowledge gaps in human genome
More species = more coverage
19,008
78%
14,779
Number of human protein-coding genes in ExAC DB as per Lek et al. Nature 2016
19,008
Even inclusion of just four species boosts
phenotypic coverage of genes by 38%
(5189%)
Combined = 89%
19,008
2,195 7,544 7,235 = 16,974
(union of coverage in any species)
9,739
51%
Mungall et al Nucleic Acids Research bit.ly/monarch-nar-2016
Phenotypic profile matching
Combining G2P data for variant
prioritization
Whole exome
Remove off-target and
common variants
Variant score from allele
freq and pathogenicity
Phenotype score from phenotypic similarity
PHIVE score to give final candidates
Mendelian filters
Exomiser results for UDP diagnosed
patients
Inclusion of phenotype data improves variant prioritization
In 60% of first 1000 genomes at GEL, Exomiser
predicts top candidate
In 86% of cases, Exomiser predicts within top 5
Example case solved by Exomiser
Phenotypic
profile
Genes
Heterozygous,
missense mutation
STIM-1
N/A
Heterozygous,
missense mutation
STIM-1
N/A
Stim1Sax/Sax
Ranked STIM-1 variant maximally pathogenic
based on cross-species G2P data,
in the absence of traditional data sources
http://bit.ly/exomiser
How to make sense of whole genomes
…when there are 3.5 Billion base pairs and
so little is known about non-coding regions?
bit.ly/genomiser-2016
1) Gather all evidence at each position
(3.5B)
• ancestral conservation
• GC content
• Max methylation, Acetylation, trimethylation levels
• DNAse hypersensitivity
• Enhancer attributes (robust, permissive)
• # overlapping transcription factor binding sites
• # rare variants (<0:5% AF) +/-500 nt
• # common variants (> 0:5% AF) +/- 500 nt
• Overlapping CNVs (ISCA , dbVAR, DGV)
• (… 26 features in total)
bit.ly/genomiser-2016
2) Predict negative controls
> 5% prevalence
14.7 M putative
non-deleterious
positions
Highly conserved in
ancestral genomes
bit.ly/genomiser-2016
3) Hand-curate positives from literature
We curated 453
regulatory
mutations judged
as pathogenic by
reported
phenotypes
(HPO) and other
metrics
bit.ly/genomiser-2016
4) Address positive-negative imbalance
14.7 M
Putative non-
deleterious
453
Known regulatory mutations
?
36,000 negative examples are
available for every positive one
bit.ly/genomiser-2016
Synthetically oversample positives,
& undersample negatives
14.7 M
Putative non-
deleterious
453
Known regulatory mutations
1) Partition negatives into 100 groups
2) Add to each negative
group, all 453 known
positives
3) In each group, oversample positives AND
undersample negatives
Strongest predictors of deleterious
mutation
• Higher DNAse hypersensitivity
• Greater methylation
• Richer GC content
• Higher ratio of rare:common variation
• Higher conservation
bit.ly/genomiser-2016
4. Benchmark using synthetic genomes
 10,235 simulated disease genomes using 1000 Genomes Data
 Novel Regulatory Mendelian Mutation (ReMM) scoring method
Genomiser +ReMM outperforms other methods/tools across non-coding region types
bit.ly/genomiser-2016
www.monarchinitiative.org
Leadership: Melissa Haendel, Chris Mungall, Peter Robinson,
Tudor Groza, Damian Smedley, Sebastian Köhler, Julie McMurry
Funding: NIH Office of Director: 2R24OD011883; NHGRI UDP: HHSN268201300036C,
HHSN268201400093P;

More Related Content

What's hot (20)

PPTX
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
mhaendel
 
PPTX
Why the world needs phenopacketeers, and how to be one
mhaendel
 
PPTX
Integrating clinical and model organism G2P data for disease discovery
mhaendel
 
PPTX
Global phenotypic data sharing standards to maximize diagnostic discovery
mhaendel
 
PPTX
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
mhaendel
 
PPTX
On the frontier of genotype-2-phenotype data integration
mhaendel
 
PDF
The Monarch Initiative: From Model Organism to Precision Medicine
mhaendel
 
PPTX
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
mhaendel
 
PPTX
Phenopackets as applied to variant interpretation
mhaendel
 
PPTX
The Monarch Initiative: A semantic phenomics approach to disease discovery
mhaendel
 
PPTX
Enhancing the Human Phenotype Ontology for Use by the Layperson
Nicole Vasilevsky
 
PPTX
Envisioning a world where everyone helps solve disease
mhaendel
 
PPTX
Empowering patients by increasing accessibility to clinical terminology
Nicole Vasilevsky
 
PPTX
Cell authentication by str profile
Bennie George
 
PDF
ISEV2014 - Introduction to Pathogen Derived EV's (H. Del Portillo)
andyfhill
 
PPTX
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Databricks
 
PPTX
Fundamentals of Analysis of Exomes
daforerog
 
PDF
Hum. reprod. 2013-enciso-1707-15
t7260678
 
PDF
Exosomes lecture
Dr.Mahmoud Abbas
 
PDF
6 clinical cytogenetics-disorders of the autosomes and the sex chromosomes
Ali Qatrawi
 
Global Phenotypic Data Sharing Standards to Maximize Diagnostics and Mechanis...
mhaendel
 
Why the world needs phenopacketeers, and how to be one
mhaendel
 
Integrating clinical and model organism G2P data for disease discovery
mhaendel
 
Global phenotypic data sharing standards to maximize diagnostic discovery
mhaendel
 
The Monarch Initiative: An integrated genotype-phenotype platform for disease...
mhaendel
 
On the frontier of genotype-2-phenotype data integration
mhaendel
 
The Monarch Initiative: From Model Organism to Precision Medicine
mhaendel
 
Data Translator: an Open Science Data Platform for Mechanistic Disease Discovery
mhaendel
 
Phenopackets as applied to variant interpretation
mhaendel
 
The Monarch Initiative: A semantic phenomics approach to disease discovery
mhaendel
 
Enhancing the Human Phenotype Ontology for Use by the Layperson
Nicole Vasilevsky
 
Envisioning a world where everyone helps solve disease
mhaendel
 
Empowering patients by increasing accessibility to clinical terminology
Nicole Vasilevsky
 
Cell authentication by str profile
Bennie George
 
ISEV2014 - Introduction to Pathogen Derived EV's (H. Del Portillo)
andyfhill
 
Identify Disease-Associated Genetic Variants Via 3D Genomics Structure and Re...
Databricks
 
Fundamentals of Analysis of Exomes
daforerog
 
Hum. reprod. 2013-enciso-1707-15
t7260678
 
Exosomes lecture
Dr.Mahmoud Abbas
 
6 clinical cytogenetics-disorders of the autosomes and the sex chromosomes
Ali Qatrawi
 

Viewers also liked (18)

PPTX
Science in the open, what does it take?
mhaendel
 
PPTX
Ejemplos de Parafraseo
Jose Manuel Meza
 
PDF
マイクロサービスバックエンドAPIのためのRESTとgRPC
disc99_
 
PPSX
As máquinas tema 5
mteribg
 
PPTX
Three things that helped us in agile transformation
Abhijith Prabhudev
 
PDF
Solucion a la Prueba 1
Wilson Brito
 
PPTX
Automated Machines and Productivity
Steven Tyler
 
PPTX
Modelo entidad relación Ejercicio 1
Lizzeth Jiménez Castro
 
PDF
R. Villano - Archivio Rotary International Distretto 2100-italia (p.te 3-4)
Raimondo Villano
 
PDF
Llavero pictos dientes_con_y_sin_grafia
Traficante De Ternura :$
 
PDF
Banco de Preguntas ser Bachiller 2017 - Cienespe
Wilson Brito
 
PPT
CONTEXTUALIZACION PRACMATICA
EVA KARELYS GUEVARA DE RODRIGUEZ
 
PDF
Revista infantil missionaria
HELENA BEATRIZ COSTA DE OLIVEIRA
 
PDF
The Happy Healthy Nonprofit: Linking Self-Care and Wellbeing to Impact
Beth Kanter
 
PDF
Startup Pitch Decks
Steve Schlafman
 
PPTX
How open is open? An evaluation rubric for public knowledgebases
mhaendel
 
PDF
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
HubSpot
 
PDF
How to Earn the Attention of Today's Buyer
HubSpot
 
Science in the open, what does it take?
mhaendel
 
Ejemplos de Parafraseo
Jose Manuel Meza
 
マイクロサービスバックエンドAPIのためのRESTとgRPC
disc99_
 
As máquinas tema 5
mteribg
 
Three things that helped us in agile transformation
Abhijith Prabhudev
 
Solucion a la Prueba 1
Wilson Brito
 
Automated Machines and Productivity
Steven Tyler
 
Modelo entidad relación Ejercicio 1
Lizzeth Jiménez Castro
 
R. Villano - Archivio Rotary International Distretto 2100-italia (p.te 3-4)
Raimondo Villano
 
Llavero pictos dientes_con_y_sin_grafia
Traficante De Ternura :$
 
Banco de Preguntas ser Bachiller 2017 - Cienespe
Wilson Brito
 
CONTEXTUALIZACION PRACMATICA
EVA KARELYS GUEVARA DE RODRIGUEZ
 
Revista infantil missionaria
HELENA BEATRIZ COSTA DE OLIVEIRA
 
The Happy Healthy Nonprofit: Linking Self-Care and Wellbeing to Impact
Beth Kanter
 
Startup Pitch Decks
Steve Schlafman
 
How open is open? An evaluation rubric for public knowledgebases
mhaendel
 
10 Things You Didn’t Know About Mobile Email from Litmus & HubSpot
HubSpot
 
How to Earn the Attention of Today's Buyer
HubSpot
 
Ad

Similar to Deep phenotyping to aid identification of coding & non-coding rare disease variants (20)

PPTX
Monarch Initiative Poster - Rare Disease Symposium 2015
Nicole Vasilevsky
 
PDF
How to transform genomic big data into valuable clinical information
Joaquin Dopazo
 
PPTX
Computing on Phenotypes AMP 2015
Chris Mungall
 
PDF
Supporting Genomics in the Practice of Medicine by Heidi Rehm
Knome_Inc
 
PDF
Data sharing and analysis
EURORDIS Rare Diseases Europe
 
PDF
Digging into thousands of variants to find disease genes in Mendelian and com...
Joaquin Dopazo
 
PPTX
BiPday 2014 -- Santorsola Mariangela
eventi-ITBbari
 
PPTX
Identification of pathological mutations from the single-gene case to exome p...
Vall d'Hebron Institute of Research (VHIR)
 
PDF
Genomics: Shifting the Paradigm for Rare Diseases
Hannes Smárason
 
PPTX
Identifying Oncogenic Variants in VarSeq
Golden Helix
 
PPT
testing123
callroom
 
PDF
RDD Conf Day 1: Genomics for Rare Diseases Dr. Anna Lehman
Canadian Organization for Rare Disorders
 
PDF
The State of Play in Diagnosis
EURORDIS Rare Diseases Europe
 
PPTX
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Michel Dumontier
 
PPTX
Evaluating Oncogenicity in VSClinical
Golden Helix
 
PPTX
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
Gabe Rudy
 
PDF
Amir
amirabdelazim
 
PPTX
Pediatric Genetics & Genomics
CHC Connecticut
 
PDF
MLGG_for_linkedIn
Manuel Gonzalez-Garay
 
PDF
A New Generation Of Mechanism-Based Biomarkers For The Clinic
Joaquin Dopazo
 
Monarch Initiative Poster - Rare Disease Symposium 2015
Nicole Vasilevsky
 
How to transform genomic big data into valuable clinical information
Joaquin Dopazo
 
Computing on Phenotypes AMP 2015
Chris Mungall
 
Supporting Genomics in the Practice of Medicine by Heidi Rehm
Knome_Inc
 
Data sharing and analysis
EURORDIS Rare Diseases Europe
 
Digging into thousands of variants to find disease genes in Mendelian and com...
Joaquin Dopazo
 
BiPday 2014 -- Santorsola Mariangela
eventi-ITBbari
 
Identification of pathological mutations from the single-gene case to exome p...
Vall d'Hebron Institute of Research (VHIR)
 
Genomics: Shifting the Paradigm for Rare Diseases
Hannes Smárason
 
Identifying Oncogenic Variants in VarSeq
Golden Helix
 
testing123
callroom
 
RDD Conf Day 1: Genomics for Rare Diseases Dr. Anna Lehman
Canadian Organization for Rare Disorders
 
The State of Play in Diagnosis
EURORDIS Rare Diseases Europe
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Michel Dumontier
 
Evaluating Oncogenicity in VSClinical
Golden Helix
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
Gabe Rudy
 
Pediatric Genetics & Genomics
CHC Connecticut
 
MLGG_for_linkedIn
Manuel Gonzalez-Garay
 
A New Generation Of Mechanism-Based Biomarkers For The Clinic
Joaquin Dopazo
 
Ad

More from mhaendel (10)

PPTX
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
mhaendel
 
PPTX
Equivalence is in the (ID) of the beholder
mhaendel
 
PPTX
Building (and traveling) the data-brick road: A report from the front lines ...
mhaendel
 
PPTX
Reusable data for biomedicine: A data licensing odyssey
mhaendel
 
PPTX
Credit where credit is due: acknowledging all types of contributions
mhaendel
 
PPTX
Getting (and giving) credit for all that we do
mhaendel
 
PPTX
Force11: Enabling transparency and efficiency in the research landscape
mhaendel
 
PPTX
Dataset description using the W3C HCLS standard
mhaendel
 
PPTX
On the nature of Credit
mhaendel
 
PPTX
Standardizing scholarly output with the VIVO ontology
mhaendel
 
The Software and Data Licensing Solution: Not Your Dad’s UBMTA
mhaendel
 
Equivalence is in the (ID) of the beholder
mhaendel
 
Building (and traveling) the data-brick road: A report from the front lines ...
mhaendel
 
Reusable data for biomedicine: A data licensing odyssey
mhaendel
 
Credit where credit is due: acknowledging all types of contributions
mhaendel
 
Getting (and giving) credit for all that we do
mhaendel
 
Force11: Enabling transparency and efficiency in the research landscape
mhaendel
 
Dataset description using the W3C HCLS standard
mhaendel
 
On the nature of Credit
mhaendel
 
Standardizing scholarly output with the VIVO ontology
mhaendel
 

Recently uploaded (20)

PPTX
Ghent University Global Campus: Overview
Ghent University Global Campus
 
PDF
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
PPTX
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
PDF
Unit-3 ppt.pdf organic chemistry unit 3 heterocyclic
visionshukla007
 
PDF
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
PPTX
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
PPTX
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
PPT
Restriction digestion of DNA for students of undergraduate and post graduate ...
DrMukeshRameshPimpli
 
PPT
Experimental Design by Cary Willard v3.ppt
MohammadRezaNirooman1
 
PDF
The emergence of galactic thin and thick discs across cosmic history
Sérgio Sacani
 
PDF
FYS 100 final presentation on Afro cubans
RowanSales
 
PDF
Preserving brand authenticity amid AI-driven misinformation: Sustaining consu...
Selcen Ozturkcan
 
PDF
Unit-5 ppt.pdf unit 5 organic chemistry 3
visionshukla007
 
PPTX
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
PPTX
770043401-q1-Ppt-pe-and-Health-7-week-1-lesson-1.pptx
AizaRazonado
 
PDF
Treatment and safety of drinking water .
psuvethapalani
 
PPTX
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
PPTX
Immunopharmaceuticals and microbial Application
xxkaira1
 
PPTX
Akshay tunneling .pptx_20250331_165945_0000.pptx
akshaythaker18
 
PDF
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
Ghent University Global Campus: Overview
Ghent University Global Campus
 
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
Unit-3 ppt.pdf organic chemistry unit 3 heterocyclic
visionshukla007
 
Plant growth promoting bacterial non symbiotic
psuvethapalani
 
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
Restriction digestion of DNA for students of undergraduate and post graduate ...
DrMukeshRameshPimpli
 
Experimental Design by Cary Willard v3.ppt
MohammadRezaNirooman1
 
The emergence of galactic thin and thick discs across cosmic history
Sérgio Sacani
 
FYS 100 final presentation on Afro cubans
RowanSales
 
Preserving brand authenticity amid AI-driven misinformation: Sustaining consu...
Selcen Ozturkcan
 
Unit-5 ppt.pdf unit 5 organic chemistry 3
visionshukla007
 
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
770043401-q1-Ppt-pe-and-Health-7-week-1-lesson-1.pptx
AizaRazonado
 
Treatment and safety of drinking water .
psuvethapalani
 
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
Immunopharmaceuticals and microbial Application
xxkaira1
 
Akshay tunneling .pptx_20250331_165945_0000.pptx
akshaythaker18
 
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 

Deep phenotyping to aid identification of coding & non-coding rare disease variants

  • 1. Deep phenotyping to aid identification of coding & non-coding rare disease variants Melissa Haendel, PhD March 2017@monarchinit @ontowonka [email protected]
  • 2. Acknowledgments Charite Max Schubach Sebastian Koehler Univ of Milan Giorgio Valentini RTI Jim Balhoff OHSU Kent Shefchek John Letaw Julie McMurry Nicole Vasilevsky Matt Brush Tom Conlin Dan Keith Genomics England/Queen Mary Damian Smedley Julius Jacobsen Jackson Laboratory Peter Robinson Stanford Shruti Marwaha Matthew Wheeler Euan Ashley Lawrence Berkeley Chris Mungall Suzanna Lewis Jeremy Nguyen Seth Carbon Garvan Tudor Groza https://monarchinitiative.org/page/team
  • 3. The genome is sequenced, but... 3,398 OMIM Mendelian Diseases with no known genetic basis ? At least 120,000* ClinVar Variants with no known pathogenicity …we still don’t know very much about what it does *This is > twice what it was in 2016!
  • 4. Prevailing clinical genomic pipelines leverage only a tiny fraction of the available data PATIENT EXOME / GENOME PATIENT CLINICAL PHENOTYPES PUBLIC GENOMIC DATA PUBLIC CLINICAL PHENOTYPE, DISEASE DATA POSSIBLE DISEASES DIAGNOSIS & TREATMENT PATIENT ENVIRONMENT PUBLIC ENVIRONMENT, DISEASE DATA PATIENT OMICS PHENOTYPES PUBLIC OMICS PHENOTYPES, CORRELATIONS Under-utilized data
  • 5. The Human Phenotype Ontology Hyposmia Abnormality of globe location eyeball of camera-type eye sensory perception of smell Abnormal eye morphology Motor neuron atrophyDeeply set eyes motor neuronCL 34571 annotations in 22 species 157534 phenotype annotations 2150 phenotype annotations  11,813 phenotype terms  127,125 rare disease - phenotype annotations  136,268 common disease - phenotype annotations bit.ly/hpo-paper
  • 6. Adding other species’ data helps fill knowledge gaps in human genome
  • 7. More species = more coverage 19,008 78% 14,779 Number of human protein-coding genes in ExAC DB as per Lek et al. Nature 2016 19,008 Even inclusion of just four species boosts phenotypic coverage of genes by 38% (5189%) Combined = 89% 19,008 2,195 7,544 7,235 = 16,974 (union of coverage in any species) 9,739 51% Mungall et al Nucleic Acids Research bit.ly/monarch-nar-2016
  • 9. Combining G2P data for variant prioritization Whole exome Remove off-target and common variants Variant score from allele freq and pathogenicity Phenotype score from phenotypic similarity PHIVE score to give final candidates Mendelian filters
  • 10. Exomiser results for UDP diagnosed patients Inclusion of phenotype data improves variant prioritization In 60% of first 1000 genomes at GEL, Exomiser predicts top candidate In 86% of cases, Exomiser predicts within top 5
  • 11. Example case solved by Exomiser Phenotypic profile Genes Heterozygous, missense mutation STIM-1 N/A Heterozygous, missense mutation STIM-1 N/A Stim1Sax/Sax Ranked STIM-1 variant maximally pathogenic based on cross-species G2P data, in the absence of traditional data sources http://bit.ly/exomiser
  • 12. How to make sense of whole genomes …when there are 3.5 Billion base pairs and so little is known about non-coding regions? bit.ly/genomiser-2016
  • 13. 1) Gather all evidence at each position (3.5B) • ancestral conservation • GC content • Max methylation, Acetylation, trimethylation levels • DNAse hypersensitivity • Enhancer attributes (robust, permissive) • # overlapping transcription factor binding sites • # rare variants (<0:5% AF) +/-500 nt • # common variants (> 0:5% AF) +/- 500 nt • Overlapping CNVs (ISCA , dbVAR, DGV) • (… 26 features in total) bit.ly/genomiser-2016
  • 14. 2) Predict negative controls > 5% prevalence 14.7 M putative non-deleterious positions Highly conserved in ancestral genomes bit.ly/genomiser-2016
  • 15. 3) Hand-curate positives from literature We curated 453 regulatory mutations judged as pathogenic by reported phenotypes (HPO) and other metrics bit.ly/genomiser-2016
  • 16. 4) Address positive-negative imbalance 14.7 M Putative non- deleterious 453 Known regulatory mutations ? 36,000 negative examples are available for every positive one bit.ly/genomiser-2016
  • 17. Synthetically oversample positives, & undersample negatives 14.7 M Putative non- deleterious 453 Known regulatory mutations 1) Partition negatives into 100 groups 2) Add to each negative group, all 453 known positives 3) In each group, oversample positives AND undersample negatives
  • 18. Strongest predictors of deleterious mutation • Higher DNAse hypersensitivity • Greater methylation • Richer GC content • Higher ratio of rare:common variation • Higher conservation bit.ly/genomiser-2016
  • 19. 4. Benchmark using synthetic genomes  10,235 simulated disease genomes using 1000 Genomes Data  Novel Regulatory Mendelian Mutation (ReMM) scoring method Genomiser +ReMM outperforms other methods/tools across non-coding region types bit.ly/genomiser-2016
  • 20. www.monarchinitiative.org Leadership: Melissa Haendel, Chris Mungall, Peter Robinson, Tudor Groza, Damian Smedley, Sebastian Köhler, Julie McMurry Funding: NIH Office of Director: 2R24OD011883; NHGRI UDP: HHSN268201300036C, HHSN268201400093P;

Editor's Notes

  • #4: There is a lot we don’t know about the genome As of March 2017, OMIM number: 3398 unknown 4,964 known ClinVar number: 121,000 at least with the addition that these are variants that researchers have found suspicious, due to rarity in the population or something else, contextually 160k variants in the entire genome is not much
  • #8: If clinvar + omim 20  80%
  • #12: This was the novel case we solved. The UDP patient had a number of signs and symptoms including various platelet abnormalities. The same heterozygous, missense mutation was seen in 2 patients and ranked top by Exomiser. It had never been seen in any of the SNP databases and was predicted maximally pathogenic. Finally a mouse curated by MGI involving a heterozygous, missense point mutation introduced by chemical mutagenesis exhibited strikingly similar platelet abnormalities.
  • #13: Genomiser scores variants either through existing methods such as CADD or a bespoke machine learning method and combines these with allele frequency, regulatory sequences, chromosomal topological domains, and phenotypic relevance to discover variants associated to specific Mendelian disorders.
  • #14: We included only those variations and publications judged to provide plausible evidence of pathogenicity. First, the phenotypic abnormalities of the individual carrying the variant were assessed and a variant was included only if the disease association was regarded as plausible on the basis of evidence such as familial cosegregation or experimental validation, using techniques such as luciferase reporter assays, electrophoretic mobility assay, or telomerase activity assay. In some cases pathogenicity was assigned based on curator judgment or computational predictions; for instance, mutations in RNA genes that affected RNA secondary structure elements such as stem loops were included. https://pixabay.com/en/night-forest-trees-moonlight-1245875/ https://pixabay.com/en/taiwan-macaques-shoushan-macaque-1345438/
  • #15: https://pixabay.com/en/night-forest-trees-moonlight-1245875/ https://pixabay.com/en/taiwan-macaques-shoushan-macaque-1345438/
  • #16: We included only those variations and publications judged to provide plausible evidence of pathogenicity. First, the phenotypic abnormalities of the individual carrying the variant were assessed and a variant was included only if the disease association was regarded as plausible on the basis of evidence such as familial cosegregation or experimental validation, using techniques such as luciferase reporter assays, electrophoretic mobility assay, or telomerase activity assay. In some cases pathogenicity was assigned based on curator judgment or computational predictions; for instance, mutations in RNA genes that affected RNA secondary structure elements such as stem loops were included.
  • #17: Thus, approximately 36,000 negative examples are available for every positive one. In such extremely unbalanced conditions, classical computational and machine learning methods tend to perform poorly. This is because they learn overwhelmingly from negative examples, which leads to a sensitivity and precision close to zero on new (test) data.50
  • #18: ANIMATION NOTE, CLICK ONCE FOR EACH STEP In order to train the ReMM model, we first divided the majority class (probably non-deleterious variant sites) randomly into n = 100 partitions and then we added all the minority instances (non-coding Mendelian mutations) to every partition. We chose 100 partitions because no substantial performance improvements were observed when more partitions were utilized (data not shown). Moreover, in each partition we synthetically oversampled the minority positive class, using the synthetic minority over-sampling technique51 (SMOTE) with a number of nearest neighbors k = 5. With the SMOTE approach we generated synthetic instances two times the cardinality of the positive class. We then randomly undersampled the majority negative class to obtain a three times larger set of negative examples. The resulting dataset was used to train a random forest (RF) classifier52 (forest size 10; larger forests do not significantly improve the performances; data not shown) that outputs a probability to estimate whether a given position in non-coding genome can cause a Mendelian disease if mutated. The overall process of over- and undersampling and the training of the RF was repeated for all the n partitions. Finally, the probabilities estimated by each RF were averaged and the resulting “consensus” probability of the hyperensemble represents the final ReMM score. Our method was implemented in Java using Weka.53
  • #20: Benchmarking experiments for Genomiser were performed using 10,419 simulated rare disease genomes based on the 453 regulatory Mendelian mutations and 1,092 whole-genomes VCF files from the 1000 Genomes Project14 (05/02/2013 release). For autosomal-dominant diseases, one heterozygous mutation was added, and for autosomal-recessive diseases, either one homozygous mutation or two heterozygous mutations were added to the 1000 Genomes VCF file. For these experiments, the phenotypic (HPO) annotations for the corresponding disease in OMIM were taken on 8/7/2015 from the annotation files of the HPO team. To measure the ability of Genomiser to detect known disease-gene associations, we repeated the analysis with incomplete (maximum of three HPO annotations), noisy (two random HPO terms added), and imprecise (two of the original HPO annotations replaced by the more general parent terms in the ontology) annotations. These simulated genomes were run through the default settings of Genomiser. In the first step, genes and associated variants are removed where there is little similarity between observed phenotypes and direct or inferred knowledge from disease and model organism databases. Note in this step, distal (>20 kb from a gene) variants that reside in predicted enhancers from FANTOM5 and Ensembl are associated with the most phenotypically similar gene in the topological domain containing the enhancer rather than simply taking the closest gene. Distal variants that do not reside in a predicted enhancer are removed, followed by the exclusion of any that are common (>1% minor allele frequency [MAF]) in the 1000 Genomes Project, NHLBI Exome Sequencing Project (ESP), and Exome Aggregation Consortium (ExAC) datasets. Finally, the remaining variants are prioritized by a composite score of the minor allele frequency, phenotypic similarity, and pathogenicity (using the ReMM score for non-coding and the existing hiPHIVE method for coding and splice sequences). To assess our performance, we measured how often the seeded regulatory Mendelian variant was ranked first among the full set of the variants of the simulated Mendelian disease genomes. We note that the ReMM scores used in the Genomiser experiment were computed by ten-fold cross validation: the scores for the mutations included in each fold were obtained through a model trained on mutations not included in that fold, but only on those of the remaining nine. In other words, the ReMM score used in Genomiser for a specific mutation was obtained by a model not trained on this variant
  • #21: Fully translational – from bench to bedside – group of stakeholders, contributors, and partners