SlideShare a Scribd company logo
Wikidata for biomedical
knowledge integration and
curation
Benjamin Good
The Scripps Research Institute
@bgood
bgood@scripps.edu
“knowledge”
• A lot
• Important
• Text
What are the
functions of
Fibronectin?
37186 articles
What are the functions of
the 238 ‘significant’ genes
that came up in my high
throughput screen??
What are the
functions of
Fibronectin?
37186 articles
…
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
“knowledge integration”
“curation”
“knowledge base”
Answers
Knowledge Bases
5
1,500+ listed at http://www.oxfordjournals.org/nar/database/a/
Applications of knowledge bases
• Find information
• Plan research
• ”Known unknowns?”
• Interpret data
• Gene Ontology
Enrichment Analysis
Interesting Gene List
Gene Ontology, Pathway,
Network interpretation
Knowledge bases are important tools
and will only grow more important
over time
9
Great!
10
BUT
11
1. Knowledge bases are not complete
2. Will get to later..
Annotation
missing from
human GO
annotation.
Should be here!
(‘5 HT Receptor’ means ‘Serotonin Receptor’)
Circa 2010
Added to GO
Jan. 2016
First characterized 1996
(Kohen et al J Neurochem)
Interesting Gene List
Gene Ontology, Pathway,
Network interpretation
We don’t know what we are missing
15
inflammatory
response
defense
response
Serotonin
receptor
activity?
?
response to
wounding
immune
response
Interesting Gene List
“Gene Ontology, its great right ?”
• “It sucks”
• “I only use it out of desperation”
WHY?!
Process of building knowledge bases
1. do science 2. publish it 3. Manually extract
the knowledge
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
why does he look so down?
Many scientists, powerful tools,
comparatively little reward for
curating knowledge
100’s of thousands 100’s
More than 2 articles
published/minute
Professional biocuration does not scale
up to the rate of production
1. do science 2. publish it 3. Manually extract
the knowledge
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
23
1. Knowledge bases are not complete
2. Knowledge needs integration
Knowledge is scattered,
integration brings it together
Merging knowledge bases:
the language barrier
“Methadone”
Interacts with:
“Moxifloxacin”May treat:
Opioid-Related Disorders
ID:
N0000000174
ID:
4095
Molecular Weight:
309.44518 g/mol
…
= ?
= ?
= ?
= ?
= ?
= ?
ID:
DB00333
Manufactured by:
Roxane laboratories inc
Good for business, bad for science
Google Scholar search shows 469 papers about
“identifier mapping” in bioinformatics
What can we do?
Global Knowledge Platform
What would happen if everyone
was literally working on the same
database?
1. Split up work more effectively
2. Make integration the default
behavior
Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
Managed by the MediaWiki foundation
that operates Wikipedia
It’s a
knowledge
base!
• Anyone
can edit
• Anyone
can use
Item: Q84
Item: Q414043
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Stated in:
Ensembl Release 83
Retrieved:
19 January 2016
Value (numeric)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
A Giant Global Graph
These statements link together into a queryable graph
https://query.wikidata.org
We are seeding it with
biomedical data
• All human, mouse genes
and proteins
• All Gene Ontology terms
• All FDA approved drugs
• 9,000+ human diseases
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
Our seeds are largely
concepts linked to many
identifier systems
N identifiers per item
• Genes: 8
• Drugs: 18
• Diseases: 11
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
Facilitate
integration
with key
external
knowledge
bases
Nurturing a multi-community
garden of biomedical knowledge
Gene DrugDisease
A Platform for knowledge integration and curation
38
Open data
Wikipedia(s)
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Application #1 (of many)
Burgstaller et al (2016) Database (preprint in BioRxiv)
Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)
Application #2 Web Apollo Genome Browser
41
• Genome annotation data retrieved
from wikidata via SPARQL queries
to https://query.wikidata.org
• Prototype achieved at recent San
Diego hackathon
1 Putman et al (2016) (under review) (preprint in BioRxiv)
Microbial Genetic Data
•Widely Distributed
•Difficult to query
•Not structured in meaningful way
•A lot of interest from this
community !
Microbial Genetic Data
Microbial genomes in Wikidata
• Loading genes,
proteins,
annotations for
120 reference
genomes.
• Completed 21
genomes so far
Putman et al (2016) (under review) (preprint in BioRxiv)
Microbiome modeling in Wikidata
Putman et al (2016) (under review) (preprint in BioRxiv)
46
1. Knowledge bases are not complete
2. Knowledge needs integration
Can help
Centralizing content while distributing labor
47
Open data
Your Apps
Here!
Wikipedia(s)
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Thanks!
Gene Wikidata Team
Andra Waagmeester (Micelio)
* Sebastian Burgstaller (Scripps)
* Tim Putman (Scripps)
* Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu* First author on manuscript cited in this presentation
Ben Tim
Andra
Elvira
Sebastian
Some Gene Wiki team members
enjoying their best paper award
at SWAT4LS, Dec. 2015
Adapted logo

More Related Content

PPTX
Channeling Collaborative Spirit
Benjamin Good
 
PPTX
2016 mem good
Benjamin Good
 
PPTX
Building a Biomedical Knowledge Garden
Benjamin Good
 
PPTX
Wikidata workshop for ISB Biocuration 2016
Benjamin Good
 
PPTX
Gene Wiki and Wikimedia Foundation SPARQL workshop
Benjamin Good
 
PPTX
Knowledge Beacons
Benjamin Good
 
PPTX
High-performance web services for gene and variant annotations
Chunlei Wu
 
PPTX
Jsm madduri-august-2015
Ravi Madduri
 
Channeling Collaborative Spirit
Benjamin Good
 
2016 mem good
Benjamin Good
 
Building a Biomedical Knowledge Garden
Benjamin Good
 
Wikidata workshop for ISB Biocuration 2016
Benjamin Good
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Benjamin Good
 
Knowledge Beacons
Benjamin Good
 
High-performance web services for gene and variant annotations
Chunlei Wu
 
Jsm madduri-august-2015
Ravi Madduri
 

What's hot (20)

PPTX
2015 6 bd2k_biobranch_knowbio
Benjamin Good
 
PPTX
Role of Amyloid Burden in cognitive decline
Ravi Madduri
 
PPTX
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET
 
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Andrew Su
 
PPTX
CI4CC sustainability-panel
Ravi Madduri
 
PPTX
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET
 
PPTX
2014 marine-microbes-grc
c.titus.brown
 
PPT
Quantitative Medicine Feb 2009
Ian Foster
 
PPTX
Causal reasoning using the Relation Ontology
Chris Mungall
 
PPT
The Language of the Gene Ontology
robertstevens65
 
PPTX
The Gene Ontology & Gene Ontology Annotation resources
Melanie Courtot
 
PDF
GlyGen Warren Workshop in Boston
GlyGen
 
PPT
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPTX
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
GigaScience, BGI Hong Kong
 
PPTX
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
Prof. Wim Van Criekinge
 
PPTX
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
Prof. Wim Van Criekinge
 
PPTX
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
Maulik Kamdar
 
PPTX
Acs denver dirks potenzone 30 aug2011
Rudy Potenzone
 
PPTX
US2TS presentation on Gene Ontology
Chris Mungall
 
2015 6 bd2k_biobranch_knowbio
Benjamin Good
 
Role of Amyloid Burden in cognitive decline
Ravi Madduri
 
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Andrew Su
 
CI4CC sustainability-panel
Ravi Madduri
 
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET
 
2014 marine-microbes-grc
c.titus.brown
 
Quantitative Medicine Feb 2009
Ian Foster
 
Causal reasoning using the Relation Ontology
Chris Mungall
 
The Language of the Gene Ontology
robertstevens65
 
The Gene Ontology & Gene Ontology Annotation resources
Melanie Courtot
 
GlyGen Warren Workshop in Boston
GlyGen
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
GigaScience, BGI Hong Kong
 
2015 bioinformatics bio_cheminformatics_wim_vancriekinge
Prof. Wim Van Criekinge
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
Prof. Wim Van Criekinge
 
Graph Analytics in Pharmacology over the Web of Life Sciences Linked Open Data
Maulik Kamdar
 
Acs denver dirks potenzone 30 aug2011
Rudy Potenzone
 
US2TS presentation on Gene Ontology
Chris Mungall
 
Ad

Viewers also liked (20)

PPT
Welcome to Ukraine - SunCity Travel LLC
Alex Faynin
 
PPTX
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
Benjamin Good
 
PPTX
Computing on the shoulders of giants
Benjamin Good
 
PDF
Mark Hopper Product And Marketing Exec 2010
Mark Hopper
 
PPTX
Human Guided Forests (HGF)
Benjamin Good
 
PPTX
Gene Wiki at Phenotype RCN annual meeting
Benjamin Good
 
PPTX
Gene Wiki and Mark2Cure update for BD2K
Benjamin Good
 
PPTX
Short update on The Cure game first week
Benjamin Good
 
PDF
Open source breakfast norge findwise
Cominvent AS
 
PPTX
Gene wiki jamboree
Benjamin Good
 
PDF
Resume 2009 Compatible V2 1
schelby
 
PPTX
genegames.org
Benjamin Good
 
PPT
The National Society For The Protection Of Hmmm
guest0233e9d0
 
PDF
Bio Logical Mass Collaboration3
Benjamin Good
 
PDF
EISHI CO. main eps machine catalogue
eishimachinery
 
PDF
Oslo Solr MeetUp March 2012 - Solr4 alpha
Cominvent AS
 
PDF
Light steel villa catalogue log
eishimachinery
 
PDF
Dagens Næringslivs overgang til Lucene/Solr søk
Cominvent AS
 
PDF
B2B Branding Explained
csadhy
 
PPT
Buyer Remorse
smfox
 
Welcome to Ukraine - SunCity Travel LLC
Alex Faynin
 
The Cure: A Game with the Purpose of Gene Selection for Breast Cancer Surviva...
Benjamin Good
 
Computing on the shoulders of giants
Benjamin Good
 
Mark Hopper Product And Marketing Exec 2010
Mark Hopper
 
Human Guided Forests (HGF)
Benjamin Good
 
Gene Wiki at Phenotype RCN annual meeting
Benjamin Good
 
Gene Wiki and Mark2Cure update for BD2K
Benjamin Good
 
Short update on The Cure game first week
Benjamin Good
 
Open source breakfast norge findwise
Cominvent AS
 
Gene wiki jamboree
Benjamin Good
 
Resume 2009 Compatible V2 1
schelby
 
genegames.org
Benjamin Good
 
The National Society For The Protection Of Hmmm
guest0233e9d0
 
Bio Logical Mass Collaboration3
Benjamin Good
 
EISHI CO. main eps machine catalogue
eishimachinery
 
Oslo Solr MeetUp March 2012 - Solr4 alpha
Cominvent AS
 
Light steel villa catalogue log
eishimachinery
 
Dagens Næringslivs overgang til Lucene/Solr søk
Cominvent AS
 
B2B Branding Explained
csadhy
 
Buyer Remorse
smfox
 
Ad

Similar to 2016 bd2k bgood_wikidata (20)

PPTX
Open data genomics_palermo_2017_ver03
Neuro, McGill University
 
PPTX
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
PPTX
Collaboratively Creating the Knowledge Graph of Life
Chris Mungall
 
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
European Bioinformatics Institute
 
PPTX
Opportunities and challenges presented by Wikidata in the context of biocuration
Benjamin Good
 
PPTX
Ontology for the Financial Services Industry
Barry Smith
 
PDF
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
PPTX
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
Andrew Su
 
PPT
UniProt-GOA
EBI
 
PPTX
Python Meetup2014 (Ying Liu)
eilosei
 
PDF
NetBioSIG2013-Talk Robin Haw
Alexander Pico
 
PPTX
Python meetup 2014
eilosei
 
PPTX
KnetMiner - EBI Workshop 2017
Keywan Hassani-Pak
 
PPTX
Biothings APIs: high-performance bioentity-centric web services
Chunlei Wu
 
PPTX
Data analysis & integration challenges in genomics
mikaelhuss
 
PDF
Functional annotation of invertebrate genomes
Surya Saha
 
PPT
Intro bioinformatics
Chris Dwan
 
PDF
Gene Sharing And Evolution The Diversity Of Protein Functions 1st Edition Jor...
baratenanret59
 
PDF
Genome science intermine
ELIXIR UK
 
PDF
Practical Systems Biology Volume 61 1st Edition Alistair Hetherington (Editor)
kapwerostov
 
Open data genomics_palermo_2017_ver03
Neuro, McGill University
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
Collaboratively Creating the Knowledge Graph of Life
Chris Mungall
 
Advanced Bioinformatics for Genomics and BioData Driven Research
European Bioinformatics Institute
 
Opportunities and challenges presented by Wikidata in the context of biocuration
Benjamin Good
 
Ontology for the Financial Services Industry
Barry Smith
 
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
Andrew Su
 
UniProt-GOA
EBI
 
Python Meetup2014 (Ying Liu)
eilosei
 
NetBioSIG2013-Talk Robin Haw
Alexander Pico
 
Python meetup 2014
eilosei
 
KnetMiner - EBI Workshop 2017
Keywan Hassani-Pak
 
Biothings APIs: high-performance bioentity-centric web services
Chunlei Wu
 
Data analysis & integration challenges in genomics
mikaelhuss
 
Functional annotation of invertebrate genomes
Surya Saha
 
Intro bioinformatics
Chris Dwan
 
Gene Sharing And Evolution The Diversity Of Protein Functions 1st Edition Jor...
baratenanret59
 
Genome science intermine
ELIXIR UK
 
Practical Systems Biology Volume 61 1st Edition Alistair Hetherington (Editor)
kapwerostov
 

More from Benjamin Good (17)

PPTX
Representing and reasoning with biological knowledge
Benjamin Good
 
PPTX
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Benjamin Good
 
PPTX
Pathways2GO: Converting BioPax pathways to GO-CAMs
Benjamin Good
 
PPTX
Science Game Lab
Benjamin Good
 
PPTX
Wikidata and the Semantic Web of Food
Benjamin Good
 
PPTX
Scripps bioinformatics seminar_day_2
Benjamin Good
 
PPTX
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
Benjamin Good
 
PDF
(Bio)Hackathons
Benjamin Good
 
PDF
Citizen sciencepanel2015 pdf
Benjamin Good
 
PDF
Building a massive biomedical knowledge graph with citizen science
Benjamin Good
 
PPTX
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
PPTX
Serious games for bioinformatics education. ISMB 2014 education workshop
Benjamin Good
 
PPTX
The Cure: Making a game of gene selection for breast cancer survival prediction
Benjamin Good
 
PPTX
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
PDF
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin Good
 
PDF
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Benjamin Good
 
PPTX
An online game for human phenotype prediction
Benjamin Good
 
Representing and reasoning with biological knowledge
Benjamin Good
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Benjamin Good
 
Pathways2GO: Converting BioPax pathways to GO-CAMs
Benjamin Good
 
Science Game Lab
Benjamin Good
 
Wikidata and the Semantic Web of Food
Benjamin Good
 
Scripps bioinformatics seminar_day_2
Benjamin Good
 
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery
Benjamin Good
 
(Bio)Hackathons
Benjamin Good
 
Citizen sciencepanel2015 pdf
Benjamin Good
 
Building a massive biomedical knowledge graph with citizen science
Benjamin Good
 
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
Serious games for bioinformatics education. ISMB 2014 education workshop
Benjamin Good
 
The Cure: Making a game of gene selection for breast cancer survival prediction
Benjamin Good
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin Good
 
Mark2Cure: a crowdsourcing platform for biomedical literature annotation
Benjamin Good
 
An online game for human phenotype prediction
Benjamin Good
 

2016 bd2k bgood_wikidata

  • 1. Wikidata for biomedical knowledge integration and curation Benjamin Good The Scripps Research Institute @bgood [email protected]
  • 2. “knowledge” • A lot • Important • Text
  • 3. What are the functions of Fibronectin? 37186 articles What are the functions of the 238 ‘significant’ genes that came up in my high throughput screen??
  • 4. What are the functions of Fibronectin? 37186 articles … Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy “knowledge integration” “curation” “knowledge base” Answers
  • 5. Knowledge Bases 5 1,500+ listed at http://www.oxfordjournals.org/nar/database/a/
  • 6. Applications of knowledge bases • Find information • Plan research • ”Known unknowns?” • Interpret data • Gene Ontology Enrichment Analysis
  • 7. Interesting Gene List Gene Ontology, Pathway, Network interpretation
  • 8. Knowledge bases are important tools and will only grow more important over time
  • 11. 11 1. Knowledge bases are not complete 2. Will get to later..
  • 12. Annotation missing from human GO annotation. Should be here! (‘5 HT Receptor’ means ‘Serotonin Receptor’) Circa 2010
  • 13. Added to GO Jan. 2016 First characterized 1996 (Kohen et al J Neurochem)
  • 14. Interesting Gene List Gene Ontology, Pathway, Network interpretation
  • 15. We don’t know what we are missing 15 inflammatory response defense response Serotonin receptor activity? ? response to wounding immune response Interesting Gene List
  • 16. “Gene Ontology, its great right ?” • “It sucks” • “I only use it out of desperation”
  • 17. WHY?!
  • 18. Process of building knowledge bases 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  • 19. why does he look so down?
  • 20. Many scientists, powerful tools, comparatively little reward for curating knowledge 100’s of thousands 100’s
  • 21. More than 2 articles published/minute
  • 22. Professional biocuration does not scale up to the rate of production 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  • 23. 23 1. Knowledge bases are not complete 2. Knowledge needs integration
  • 25. Merging knowledge bases: the language barrier “Methadone” Interacts with: “Moxifloxacin”May treat: Opioid-Related Disorders ID: N0000000174 ID: 4095 Molecular Weight: 309.44518 g/mol … = ? = ? = ? = ? = ? = ? ID: DB00333 Manufactured by: Roxane laboratories inc
  • 26. Good for business, bad for science Google Scholar search shows 469 papers about “identifier mapping” in bioinformatics
  • 27. What can we do?
  • 28. Global Knowledge Platform What would happen if everyone was literally working on the same database? 1. Split up work more effectively 2. Make integration the default behavior
  • 29. Is to data as Wikipedia is to text “Giving more people more access to more knowledge” A free and open repository of knowledge Managed by the MediaWiki foundation that operates Wikipedia
  • 30. It’s a knowledge base! • Anyone can edit • Anyone can use
  • 32. Item: Q414043 RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Stated in: Ensembl Release 83 Retrieved: 19 January 2016 Value (numeric) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  • 33. Item: Q414043 RELN Encodes: Reelin (protein) Stated in: NCBI homo sapiens annotation release 107 Retrieved: 19 January 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  • 34. A Giant Global Graph These statements link together into a queryable graph https://query.wikidata.org
  • 35. We are seeding it with biomedical data • All human, mouse genes and proteins • All Gene Ontology terms • All FDA approved drugs • 9,000+ human diseases Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
  • 36. Our seeds are largely concepts linked to many identifier systems N identifiers per item • Genes: 8 • Drugs: 18 • Diseases: 11 Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv) Facilitate integration with key external knowledge bases
  • 37. Nurturing a multi-community garden of biomedical knowledge Gene DrugDisease
  • 38. A Platform for knowledge integration and curation 38 Open data Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here! Your Apps Here!
  • 39. Application #1 (of many) Burgstaller et al (2016) Database (preprint in BioRxiv)
  • 40. Impact of wikidata on Wikipedia Gene Wiki Version 1. {{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}} = Gene Wiki Version 2. {{Infobox gene}} • All data in Wikidata • 1 Lua script works for all genes = (1 of these for every gene)
  • 41. Application #2 Web Apollo Genome Browser 41 • Genome annotation data retrieved from wikidata via SPARQL queries to https://query.wikidata.org • Prototype achieved at recent San Diego hackathon 1 Putman et al (2016) (under review) (preprint in BioRxiv)
  • 42. Microbial Genetic Data •Widely Distributed •Difficult to query •Not structured in meaningful way •A lot of interest from this community !
  • 44. Microbial genomes in Wikidata • Loading genes, proteins, annotations for 120 reference genomes. • Completed 21 genomes so far Putman et al (2016) (under review) (preprint in BioRxiv)
  • 45. Microbiome modeling in Wikidata Putman et al (2016) (under review) (preprint in BioRxiv)
  • 46. 46 1. Knowledge bases are not complete 2. Knowledge needs integration Can help
  • 47. Centralizing content while distributing labor 47 Open data Your Apps Here! Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here!
  • 48. Thanks! Gene Wikidata Team Andra Waagmeester (Micelio) * Sebastian Burgstaller (Scripps) * Tim Putman (Scripps) * Elvira Mitraka (U Maryland) Julia Turner (Scripps) Justin Leong (UBC) Lynn Schriml (U Maryland) Paul Pavlidis (UBC) Andrew Su (Scripps) Ginger Tsueng (Scripps) Contact [email protected]* First author on manuscript cited in this presentation Ben Tim Andra Elvira Sebastian Some Gene Wiki team members enjoying their best paper award at SWAT4LS, Dec. 2015 Adapted logo

Editor's Notes

  • #6: Databases. Obviously much more flexible. You can ask them questions.. (and make pretty pictures that are dynamic)
  • #7: “known unknowns” ?? If I want X, what Y should I test?
  • #13: Though it is a child of the more generic GO annotation to ‘G protein coupled receptor activity’ Kohen 1996, J Neurochem.
  • #19: Given a list of active genes produced from an experiment what key biological processes are happening in the cells? what diseases are these genes associated with? Given a list of genetic variations what diseases is a patient more susceptible to? what drugs should they take/avoid? etc.
  • #21: Given a list of active genes produced from an experiment what key biological processes are happening in the cells? what diseases are these genes associated with? Given a list of genetic variations what diseases is a patient more susceptible to? what drugs should they take/avoid? etc.
  • #22: Knowledge is either not shared (stuck in your head or your notebook) or it is shared as text and images in journal articles. There are more than 1 million articles added to PubMed each year
  • #23: Given a list of active genes produced from an experiment what key biological processes are happening in the cells? what diseases are these genes associated with? Given a list of genetic variations what diseases is a patient more susceptible to? what drugs should they take/avoid? etc.
  • #25: Divide and conquer algorithm for creating the knowledge base of everything. Splitting is hard because its very hard to know what other groups are doing, there is no centralized coordination, and decisions about what should be curated are made based on what gets funded rather than what is mist useful for the collective.
  • #26: The principle problem of knowledge integration is establishing which entities are shared between different systems Methadone N0000002109 (Opioid-Related Disorders) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3422823/
  • #29: It would be much easier to see what other people were doing By operating in the same database, it is much more likely that you will end up re-using entities that already exist rather than creating new ones and merging them later. Just like in your own local database.
  • #40: This is the first application of the work that we have done
  • #45: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Update