SlideShare a Scribd company logo
Ontology-oriented databases: Chado and OBD Chris Mungall Lawrence Berkeley Labs
Outline Chado GMOD & Model Organism Databases Genomics data in Chado using SO OBD NCBO & OBD Requirements RDF and the semantic web SPARQL endpoints
Chado: what is it? A relational database schema for biological data Part of the Generic Model Organism Database (GMOD) project http://www.gmod.org Interoperable tools for Model Organism Databases Chado was originally built for MODs
A brief introduction to MODs Some Model Organism Databases: FlyBase  (D melanogaster) WormBase  (C elegans) MGD  (M musculus) … What does a MOD organisation do? Curate and integrate data on a specific species or taxon Provide a web portal for the community What are the database requirements for a MOD?
Must store representations of  genes  and  genomic entities Sequence data Exon-intron structure Noncoding genes Curated and computed features Entities with unusual transcriptional properties And more…
Must store other data types pertinent to that organism Including, but not limited to: Expression Interaction Genetic and phenotypic Priorities amongst MODs differ Different MOs have different biological and experimental characteristics E.g.  D melanogaster  and genetics
Must house rich annotation data using  ontologies   GO (Gene Ontology); Anatomical Ontologies; Phenotype Ontologies
Must track  provenance  and  evidence  for data MOD data is often curated from the  literature Other sources Computes High throughput data Imaging
Must be an  integrated  source of data Must drive Web Portal http://www.flybase.org http://www.wormbase.org http://www.yeastgenome.org Links out to external resources GO, Ensembl, UniProt, … Substantial amount of records managed  locally  in single integrated database
Origins of Chado Chado was originally developed for FlyBase Integration of GadFly (Berkeley) and previous FlyBase database Chado later adopted by GMOD and other some individual MODs Popular amongst ‘newer’ MODs; eg Paramecium Also used outside MOD community TIGR Jenalia Farm Research Campus
Chado key concepts Tightly Integrated foreign key relations between entities Contrast with federated model Module System New modules can be ‘slotted in’ Some modules are mandatory Generic and extensible uses ontologies and terminologies for typing Highly normalised Community & open source
Chado modules Core general  (dbxrefs) cv  (ontologies) pub (bibliographic) audit Domains sequence  (genomics) phenotype expression RAD map genetic phylogeny organism event
Identifiers:  dbxref s All public records identified using bipartite scheme Not just external cross-references DB Authority must be specified Distinct table Can be associated with URIs (db, accession, version[optional]) Records can also get  secondary  dbxrefs Examples: GO:0000001, FlyBase:FBgn0000001
Ontologies and terminologies are central to Chado Ontology - A formal representation of some portion of biological reality eye what  kinds  of things exist? what are the  relationships  between these things? ommatidium sense organ eye disc is_a part_of develops from
Ontologies: cv module Based on GO DB Schema and OBO format spec key concepts cvterm  (a term, or class in an ontology) cvterm_relationship DAGs Subject-predicate-object Cv  (an ontology or terminology)
Subset of Sequence Ontology transcript Part_of Transcript region Transcript region Is_a exon Object Type Subject
Genomics: Sequence module some key concepts (a subset): Feature A genomic entity (gene, intron, SNP, chromosome, ..) Featureloc A relative location in sequence coordinates feature_relationship A pairwise relation between two features e.g. exon to transcript Featureprop Tag-value data for a feature feature_cvterm Ontology-based annotation
Feature table Features  have  sequences Sequence are  not  independent entities Embedded in feature table All features reside in same table Genes, exons, chromosomes, SNPs, .. Typed using Sequence Ontology (SO) Optional extra:  Automatically generated  SQL view layer
Feature Graphs: the  feature_relationship  table Feature graphs (FGs) Subject-predicate-object Predicates (types) are cvterms
Example: alternately spliced gene 7 features: 1 gene 2 transcripts 4 exons Not shown: polypeptide A (transcript) Part_of 4 (exon) B (transcript) Part_of 3 (exon) A (transcript) Part_of 3 (exon) B (transcript) Part_of 2 (exon) G (gene) Part_of B (transcript) A (transcript) Part_of 1 (exon) G (gene) Part_of A (transcript) Object Predicate Subject
Feature graph configurations are constrained by SO SO determines ontological relations between features Eg: Exon part_of transcript Standard rules for is_a E.g.  X is_a Y, Y part_of Z => X part_of Z See OBO Relation ontology http://www.obofoundry.org/ro Rules must be encoded  outside  standard relational schema
Declarative programming: SQL Functions Powerful, but optional PostgreSQL only Can be ported Separation of interface from implementation Sequence operations Transcription, translation Feature Graph operations Deduction of implicit features (eg introns) Location Graph operations Projection, mereological relations Related: Tata S, Patel JM, Friedman JS, and Swaroop A Declarative querying for biological sequence databases Proc of the 22nd International Conference on Data Engineering (ICDE), April 3-7, Atlanta, GA, 2006.
Chado: ongoing work Chado for phenotype (EQ) data With FlyBase, ZFIN, DictyBase Chado for evolutionary science In collaboration with NESCENT Documentation! Helpdesk (NESCENT) More GMOD integration Unified Architecture for GMOD? Latest Obo format features Allow for post-composition of complex terms
NCBO: OBO and OBD OBO: Open Bio Ontologies Http://obo.sourceforge.net http://www.obofoundry.org NCBO BioPortal; access to: OBO ontologies OBD annotations Current DBPs Fly & fish mutant phenotype annotation Linking to disease HIV Clinical trial analysis
OBD: Storing biomedical annotations Requirements different from Chado Domain scope All of biology and biomedicine Ontologies used for annotation Not just OBO Data integration Index minimum amount of data Link to external data where appropriate Provide and use data services Requirements partially met by  semantic web technology
The Semantic Web Datamodel Based on  RDF triples Subject-predicate-object Each element is a  URI Various serialisations: RDF/XML N3, N-Triples Multiple APIs, QLs and storage options RDF Graphs constrained by  ontologies Expressed in RDF Schema, OWL
OBD ‘Schema’: formal ontology of annotation Within OBO Foundry Framework - uses OBO upper ontology
Implementing OBD using SemWeb technology OBD-Sesame 3rd party triplestore Relational or in-memory Lacks native OWL support Performance issues OBD-SQL Developed at Berkeley Reuse Chado methodology, code ‘ Triplestore’ with extras Reduces triple overhead with common patterns
Wrapping databases as SPARQL endpoints A lot of data in existing relational databases like Chado Goal: make available as distributed resource  in OBD compliant way Solution: d2rq declarative mappings and SPARQL Progress: GO Database SPARQL endpoint: http://yuri.lbl.gov:9000/ Chado and OBD mappings coming soon Application: Integration of annotations through  genome dashboard
GO annotations OBD Disease/pheno annotations Genome server MOD D2rq D2rq DAS Sesame Usage scenario: AJAX Gbrowse (http://genome.biowiki.org) Annotation info sparql DAS/2 sparql sparql
Conclusions Flexible hypernormalized schemas Performance penalties Too much freedom expression? Ontologies + reasoners provide some constraints; eg SO Open world assumption Federation vs tight integration Tight integration is required for MODs As more data types become available dynamic integration will be key RDF and SPARQL is one solution
Thanks LBL Shengqiang Shu Mark Gibson Nicole Washington Seth Carbon John Day Richter Chris Smith Karen Eilbeck Sima Misra Suzanna Lewis FlyBase Dave Emmert Pinglei Zhou Peili Zhang Aubrey de Grey Paul Leyland William Gelbart HHMI Gerry Rubin GMOD, Nescent Scott Cain Sohel Merchant Eric Just Sierra Moxon Andrew Uzilov Brian Osborne Ian Holmes Lincoln Stein
 
end
Feature localisation Interbase Simplifies code All localisations relative Location Graph  (LG) Recursive/nested locations allowed
Recursive location graphs Locations can be nested Finished genomes typically flat;  depth(LG)=1 Unfinished genomes, heterochromatin may require 2 (rarely more) levels features located relative to contigs Contigs related relative to chrmosomes May be a requirement to change coordinates at each level  independently
Nested LGs Redundant localisations can be used to ‘flatten’ LG Group>0 indicates denormalised/flattened LG - must be recalculated if group=0 coordinates change 1 0 0 group chrom1 12000..13000[+] contig1 chrom1 12100..13100[+] exon1 contig1 100..200[+] exon1 Srcfeature Loc Feature
Relational featurelocs A relation between two or more locations Matches, sequence variants Indicated using rank column Use case: SNPs Simple way to query for variants introducing premature termination of translation Combine relational featurelocs and redundant featurelocs 3+ featureloc pairs: Sequence of SNP on reference and variant genome (+ location on reference) Same on transcripts Same on polypeptides
OWL entailment genomics use case SO defines ‘TE gene’ as: A SO:gene which is part_of a SO:TE In OWL: Class(TE_Gene complete Gene part_of(TE)) Result: Queries for ‘SO:TE_gene’ return features not explicitly annotated as such Compare: Chado Equivalent rules to be added PostgreSQL functions? Oboedit reasoner adapter?

More Related Content

What's hot (20)

PPT
Semantic Web: From Representations to Applications
Guus Schreiber
 
PPT
Information for learning object exchange
David Massart
 
KEY
Talking to your IDE
Michael Würsch
 
PPTX
Meta l metacase tools & possibilities
Fahad Golra
 
PPTX
SWT Lecture Session 2 - RDF
Mariano Rodriguez-Muro
 
ODP
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
PDF
SPARQL Query Verbalization for Explaining Semantic Search Engine Queries
Basil Ell
 
PDF
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
PPTX
SPARQL Cheat Sheet
LeeFeigenbaum
 
PDF
Neo4j and bioinformatics
Pablo Pareja Tobes
 
PDF
Harvester_presentaion
Ashwin Kasilingam
 
PPT
Information extraction for Free Text
butest
 
PPT
Semantic web
tariq1352
 
PDF
XSPARQL Tutorial
net2-project
 
PPT
Xml processing-by-asfak
Asfak Mahamud
 
PPT
Ks2008 Semanticweb In Action
Rinke Hoekstra
 
PPTX
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Pablo Pareja Tobes
 
PPT
RDF and OWL
Rachel Lovinger
 
PPT
NeXML
Rutger Vos
 
Semantic Web: From Representations to Applications
Guus Schreiber
 
Information for learning object exchange
David Massart
 
Talking to your IDE
Michael Würsch
 
Meta l metacase tools & possibilities
Fahad Golra
 
SWT Lecture Session 2 - RDF
Mariano Rodriguez-Muro
 
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
SPARQL Query Verbalization for Explaining Semantic Search Engine Queries
Basil Ell
 
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
SPARQL Cheat Sheet
LeeFeigenbaum
 
Neo4j and bioinformatics
Pablo Pareja Tobes
 
Harvester_presentaion
Ashwin Kasilingam
 
Information extraction for Free Text
butest
 
Semantic web
tariq1352
 
XSPARQL Tutorial
net2-project
 
Xml processing-by-asfak
Asfak Mahamud
 
Ks2008 Semanticweb In Action
Rinke Hoekstra
 
Graph DB + Bioinformatics: Bio4j, recent applications and future directions
Pablo Pareja Tobes
 
RDF and OWL
Rachel Lovinger
 
NeXML
Rutger Vos
 

Viewers also liked (6)

PPT
OBO Foundry
Chris Mungall
 
PPTX
Jetstream: Accessible cloud computing for the national science and engineerin...
Matthew Vaughn
 
PPTX
How Cyverse.org enables scalable data discoverability and re-use
Matthew Vaughn
 
PPTX
GIGA2 Structuring Phenotype Data
Chris Mungall
 
PPTX
Uberon EBI industry workshop
Chris Mungall
 
PPTX
Mapping Phenotype Ontologies for Obesity and Diabetes
Chris Mungall
 
OBO Foundry
Chris Mungall
 
Jetstream: Accessible cloud computing for the national science and engineerin...
Matthew Vaughn
 
How Cyverse.org enables scalable data discoverability and re-use
Matthew Vaughn
 
GIGA2 Structuring Phenotype Data
Chris Mungall
 
Uberon EBI industry workshop
Chris Mungall
 
Mapping Phenotype Ontologies for Obesity and Diabetes
Chris Mungall
 
Ad

Similar to Chado introduction (20)

PPTX
Experiences with logic programming in bioinformatics
Chris Mungall
 
PPT
2008 11 13 Hcls Call
Jun Zhao
 
ODP
2009 0807 Lod Gmod
Jun Zhao
 
PPT
Ontology-based Cooperation of Information Systems
Raji Ghawi
 
PPT
2010 03 Lodoxf Openflydata
Jun Zhao
 
PPT
2009 Dils Flyweb
Jun Zhao
 
PPT
Structured Dynamics' Semantic Technologies Product Stack
Mike Bergman
 
PDF
Vital AI: Big Data Modeling
Vital.AI
 
PPT
Falcon-AO: Results for OAEI 2007
Gong Cheng
 
PPT
Facilitating Busines Interoperability from the Semantic Web
Roberto García
 
PPTX
Case Study in Linked Data and Semantic Web: Human Genome
David Portnoy
 
PPT
Chambwe bosc2010
BOSC 2010
 
PPT
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Daniel Sonntag
 
PDF
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
PDF
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
PPTX
Ontology mapping for the semantic web
Worawith Sangkatip
 
PPTX
Enhancing non-Perl bioinformatic applications with Perl
Christos Argyropoulos
 
PPTX
Enhancing non-Perl bioinformatic applications with Perl
ChristosArgyropoulos7
 
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
PPT
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Takeshi Morita
 
Experiences with logic programming in bioinformatics
Chris Mungall
 
2008 11 13 Hcls Call
Jun Zhao
 
2009 0807 Lod Gmod
Jun Zhao
 
Ontology-based Cooperation of Information Systems
Raji Ghawi
 
2010 03 Lodoxf Openflydata
Jun Zhao
 
2009 Dils Flyweb
Jun Zhao
 
Structured Dynamics' Semantic Technologies Product Stack
Mike Bergman
 
Vital AI: Big Data Modeling
Vital.AI
 
Falcon-AO: Results for OAEI 2007
Gong Cheng
 
Facilitating Busines Interoperability from the Semantic Web
Roberto García
 
Case Study in Linked Data and Semantic Web: Human Genome
David Portnoy
 
Chambwe bosc2010
BOSC 2010
 
Explanations in Dialogue Systems through Uncertain RDF Knowledge Bases
Daniel Sonntag
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Chris Fregly
 
Ontology mapping for the semantic web
Worawith Sangkatip
 
Enhancing non-Perl bioinformatic applications with Perl
Christos Argyropoulos
 
Enhancing non-Perl bioinformatic applications with Perl
ChristosArgyropoulos7
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
Michael Rys
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Takeshi Morita
 
Ad

More from Chris Mungall (20)

PPTX
MADICES Mungall 2022.pptx
Chris Mungall
 
PPTX
Scaling up semantics; lessons learned across the life sciences
Chris Mungall
 
PPTX
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
Chris Mungall
 
PPTX
Ontology Access Kit_ Workshop Intro Slides.pptx
Chris Mungall
 
PPTX
LinkML Intro (for Monarch devs)
Chris Mungall
 
PPTX
LinkML presentation to Yosemite Group
Chris Mungall
 
PPTX
Experiences in the biosciences with the open biological ontologies foundry an...
Chris Mungall
 
PPTX
All together now: piecing together the knowledge graph of life
Chris Mungall
 
PPTX
Collaboratively Creating the Knowledge Graph of Life
Chris Mungall
 
PPTX
Representation of kidney structures in Uberon
Chris Mungall
 
PPTX
SparqlProg (BioHackathon 2019)
Chris Mungall
 
PPTX
Ontology Development Kit: Bio-Ontologies 2019
Chris Mungall
 
PPTX
US2TS: Reasoning over multiple open bio-ontologies to make machines and human...
Chris Mungall
 
PPTX
Uberon: opening up to community contributions
Chris Mungall
 
PPTX
Modeling exposure events and adverse outcome pathways using ontologies
Chris Mungall
 
PPTX
Causal reasoning using the Relation Ontology
Chris Mungall
 
PPTX
US2TS presentation on Gene Ontology
Chris Mungall
 
PPTX
Introduction to the BioLink datamodel
Chris Mungall
 
PPTX
Computing on Phenotypes AMP 2015
Chris Mungall
 
PPTX
ENVO GSC 2015
Chris Mungall
 
MADICES Mungall 2022.pptx
Chris Mungall
 
Scaling up semantics; lessons learned across the life sciences
Chris Mungall
 
LinkML Intro July 2022.pptx PLEASE VIEW THIS ON ZENODO
Chris Mungall
 
Ontology Access Kit_ Workshop Intro Slides.pptx
Chris Mungall
 
LinkML Intro (for Monarch devs)
Chris Mungall
 
LinkML presentation to Yosemite Group
Chris Mungall
 
Experiences in the biosciences with the open biological ontologies foundry an...
Chris Mungall
 
All together now: piecing together the knowledge graph of life
Chris Mungall
 
Collaboratively Creating the Knowledge Graph of Life
Chris Mungall
 
Representation of kidney structures in Uberon
Chris Mungall
 
SparqlProg (BioHackathon 2019)
Chris Mungall
 
Ontology Development Kit: Bio-Ontologies 2019
Chris Mungall
 
US2TS: Reasoning over multiple open bio-ontologies to make machines and human...
Chris Mungall
 
Uberon: opening up to community contributions
Chris Mungall
 
Modeling exposure events and adverse outcome pathways using ontologies
Chris Mungall
 
Causal reasoning using the Relation Ontology
Chris Mungall
 
US2TS presentation on Gene Ontology
Chris Mungall
 
Introduction to the BioLink datamodel
Chris Mungall
 
Computing on Phenotypes AMP 2015
Chris Mungall
 
ENVO GSC 2015
Chris Mungall
 

Recently uploaded (20)

PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
The Future of Artificial Intelligence (AI)
Mukul
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

Chado introduction

  • 1. Ontology-oriented databases: Chado and OBD Chris Mungall Lawrence Berkeley Labs
  • 2. Outline Chado GMOD & Model Organism Databases Genomics data in Chado using SO OBD NCBO & OBD Requirements RDF and the semantic web SPARQL endpoints
  • 3. Chado: what is it? A relational database schema for biological data Part of the Generic Model Organism Database (GMOD) project http://www.gmod.org Interoperable tools for Model Organism Databases Chado was originally built for MODs
  • 4. A brief introduction to MODs Some Model Organism Databases: FlyBase (D melanogaster) WormBase (C elegans) MGD (M musculus) … What does a MOD organisation do? Curate and integrate data on a specific species or taxon Provide a web portal for the community What are the database requirements for a MOD?
  • 5. Must store representations of genes and genomic entities Sequence data Exon-intron structure Noncoding genes Curated and computed features Entities with unusual transcriptional properties And more…
  • 6. Must store other data types pertinent to that organism Including, but not limited to: Expression Interaction Genetic and phenotypic Priorities amongst MODs differ Different MOs have different biological and experimental characteristics E.g. D melanogaster and genetics
  • 7. Must house rich annotation data using ontologies GO (Gene Ontology); Anatomical Ontologies; Phenotype Ontologies
  • 8. Must track provenance and evidence for data MOD data is often curated from the literature Other sources Computes High throughput data Imaging
  • 9. Must be an integrated source of data Must drive Web Portal http://www.flybase.org http://www.wormbase.org http://www.yeastgenome.org Links out to external resources GO, Ensembl, UniProt, … Substantial amount of records managed locally in single integrated database
  • 10. Origins of Chado Chado was originally developed for FlyBase Integration of GadFly (Berkeley) and previous FlyBase database Chado later adopted by GMOD and other some individual MODs Popular amongst ‘newer’ MODs; eg Paramecium Also used outside MOD community TIGR Jenalia Farm Research Campus
  • 11. Chado key concepts Tightly Integrated foreign key relations between entities Contrast with federated model Module System New modules can be ‘slotted in’ Some modules are mandatory Generic and extensible uses ontologies and terminologies for typing Highly normalised Community & open source
  • 12. Chado modules Core general (dbxrefs) cv (ontologies) pub (bibliographic) audit Domains sequence (genomics) phenotype expression RAD map genetic phylogeny organism event
  • 13. Identifiers: dbxref s All public records identified using bipartite scheme Not just external cross-references DB Authority must be specified Distinct table Can be associated with URIs (db, accession, version[optional]) Records can also get secondary dbxrefs Examples: GO:0000001, FlyBase:FBgn0000001
  • 14. Ontologies and terminologies are central to Chado Ontology - A formal representation of some portion of biological reality eye what kinds of things exist? what are the relationships between these things? ommatidium sense organ eye disc is_a part_of develops from
  • 15. Ontologies: cv module Based on GO DB Schema and OBO format spec key concepts cvterm (a term, or class in an ontology) cvterm_relationship DAGs Subject-predicate-object Cv (an ontology or terminology)
  • 16. Subset of Sequence Ontology transcript Part_of Transcript region Transcript region Is_a exon Object Type Subject
  • 17. Genomics: Sequence module some key concepts (a subset): Feature A genomic entity (gene, intron, SNP, chromosome, ..) Featureloc A relative location in sequence coordinates feature_relationship A pairwise relation between two features e.g. exon to transcript Featureprop Tag-value data for a feature feature_cvterm Ontology-based annotation
  • 18. Feature table Features have sequences Sequence are not independent entities Embedded in feature table All features reside in same table Genes, exons, chromosomes, SNPs, .. Typed using Sequence Ontology (SO) Optional extra: Automatically generated SQL view layer
  • 19. Feature Graphs: the feature_relationship table Feature graphs (FGs) Subject-predicate-object Predicates (types) are cvterms
  • 20. Example: alternately spliced gene 7 features: 1 gene 2 transcripts 4 exons Not shown: polypeptide A (transcript) Part_of 4 (exon) B (transcript) Part_of 3 (exon) A (transcript) Part_of 3 (exon) B (transcript) Part_of 2 (exon) G (gene) Part_of B (transcript) A (transcript) Part_of 1 (exon) G (gene) Part_of A (transcript) Object Predicate Subject
  • 21. Feature graph configurations are constrained by SO SO determines ontological relations between features Eg: Exon part_of transcript Standard rules for is_a E.g. X is_a Y, Y part_of Z => X part_of Z See OBO Relation ontology http://www.obofoundry.org/ro Rules must be encoded outside standard relational schema
  • 22. Declarative programming: SQL Functions Powerful, but optional PostgreSQL only Can be ported Separation of interface from implementation Sequence operations Transcription, translation Feature Graph operations Deduction of implicit features (eg introns) Location Graph operations Projection, mereological relations Related: Tata S, Patel JM, Friedman JS, and Swaroop A Declarative querying for biological sequence databases Proc of the 22nd International Conference on Data Engineering (ICDE), April 3-7, Atlanta, GA, 2006.
  • 23. Chado: ongoing work Chado for phenotype (EQ) data With FlyBase, ZFIN, DictyBase Chado for evolutionary science In collaboration with NESCENT Documentation! Helpdesk (NESCENT) More GMOD integration Unified Architecture for GMOD? Latest Obo format features Allow for post-composition of complex terms
  • 24. NCBO: OBO and OBD OBO: Open Bio Ontologies Http://obo.sourceforge.net http://www.obofoundry.org NCBO BioPortal; access to: OBO ontologies OBD annotations Current DBPs Fly & fish mutant phenotype annotation Linking to disease HIV Clinical trial analysis
  • 25. OBD: Storing biomedical annotations Requirements different from Chado Domain scope All of biology and biomedicine Ontologies used for annotation Not just OBO Data integration Index minimum amount of data Link to external data where appropriate Provide and use data services Requirements partially met by semantic web technology
  • 26. The Semantic Web Datamodel Based on RDF triples Subject-predicate-object Each element is a URI Various serialisations: RDF/XML N3, N-Triples Multiple APIs, QLs and storage options RDF Graphs constrained by ontologies Expressed in RDF Schema, OWL
  • 27. OBD ‘Schema’: formal ontology of annotation Within OBO Foundry Framework - uses OBO upper ontology
  • 28. Implementing OBD using SemWeb technology OBD-Sesame 3rd party triplestore Relational or in-memory Lacks native OWL support Performance issues OBD-SQL Developed at Berkeley Reuse Chado methodology, code ‘ Triplestore’ with extras Reduces triple overhead with common patterns
  • 29. Wrapping databases as SPARQL endpoints A lot of data in existing relational databases like Chado Goal: make available as distributed resource in OBD compliant way Solution: d2rq declarative mappings and SPARQL Progress: GO Database SPARQL endpoint: http://yuri.lbl.gov:9000/ Chado and OBD mappings coming soon Application: Integration of annotations through genome dashboard
  • 30. GO annotations OBD Disease/pheno annotations Genome server MOD D2rq D2rq DAS Sesame Usage scenario: AJAX Gbrowse (http://genome.biowiki.org) Annotation info sparql DAS/2 sparql sparql
  • 31. Conclusions Flexible hypernormalized schemas Performance penalties Too much freedom expression? Ontologies + reasoners provide some constraints; eg SO Open world assumption Federation vs tight integration Tight integration is required for MODs As more data types become available dynamic integration will be key RDF and SPARQL is one solution
  • 32. Thanks LBL Shengqiang Shu Mark Gibson Nicole Washington Seth Carbon John Day Richter Chris Smith Karen Eilbeck Sima Misra Suzanna Lewis FlyBase Dave Emmert Pinglei Zhou Peili Zhang Aubrey de Grey Paul Leyland William Gelbart HHMI Gerry Rubin GMOD, Nescent Scott Cain Sohel Merchant Eric Just Sierra Moxon Andrew Uzilov Brian Osborne Ian Holmes Lincoln Stein
  • 33.  
  • 34. end
  • 35. Feature localisation Interbase Simplifies code All localisations relative Location Graph (LG) Recursive/nested locations allowed
  • 36. Recursive location graphs Locations can be nested Finished genomes typically flat; depth(LG)=1 Unfinished genomes, heterochromatin may require 2 (rarely more) levels features located relative to contigs Contigs related relative to chrmosomes May be a requirement to change coordinates at each level independently
  • 37. Nested LGs Redundant localisations can be used to ‘flatten’ LG Group>0 indicates denormalised/flattened LG - must be recalculated if group=0 coordinates change 1 0 0 group chrom1 12000..13000[+] contig1 chrom1 12100..13100[+] exon1 contig1 100..200[+] exon1 Srcfeature Loc Feature
  • 38. Relational featurelocs A relation between two or more locations Matches, sequence variants Indicated using rank column Use case: SNPs Simple way to query for variants introducing premature termination of translation Combine relational featurelocs and redundant featurelocs 3+ featureloc pairs: Sequence of SNP on reference and variant genome (+ location on reference) Same on transcripts Same on polypeptides
  • 39. OWL entailment genomics use case SO defines ‘TE gene’ as: A SO:gene which is part_of a SO:TE In OWL: Class(TE_Gene complete Gene part_of(TE)) Result: Queries for ‘SO:TE_gene’ return features not explicitly annotated as such Compare: Chado Equivalent rules to be added PostgreSQL functions? Oboedit reasoner adapter?