SlideShare a Scribd company logo
MetaCrowd: Crowdsourcing
Gene Expression Metadata
Quality Assessment
Amrapali Zaveri and Michel Dumontier
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl
Bio-ontologies 2017 July 24-25th, 2017
BIOMEDICAL DATA ON THE WEB
2
BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE
3
➤ For (re-)using this data, we need to understand the
structure of datasets and the experimental conditions under
which they were produced
➤ We require accurate, structured and complete description of
the data -- defined as metadata
➤ Good quality metadata is essential in finding, interpreting, and
reusing existing data beyond what the original investigators
envisioned
➤ Facilitates a data-driven approach by combining and analyzing
similar data to uncover novel insights or even more subtle
trends in the data
BIOMEDICAL METADATA ON THE WEB - CHALLENGES
4
SIZE complexity QUALITY measures
TIME consuming COSTLY, requires experts
HYPOTHESIS
Crowdsourcing i.e. non-expert workers can
be used to curate large-scale digital
biomedical metadata on the Web.
5
CROWDSOURCING - WHAT & WHY?
6
TIME MONEY
➤ Highly parallelizable tasks
➤ Work is broken down into
smaller — ‘micro’ — pieces
that can be solved
independently
➤ Tasks based on human skills
not easily replicable by machines
➤ Non-expert workers can perform
the tasks with a minimal
payment
Consolidated answers solve scientific problems !!
RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH
➤ Improve automated mining of biomedical text for annotating
diseases [1]
➤ Curation of gene-mutation relations [2]
➤ Identifying relationships between drugs and side-effects [3],
drugs and their indications [4]
➤ Annotation of microRNA functions [5].
7
GENE EXPRESSION OMNIBUS
➤ Unstructured
➤ Spreadsheet submission
➤ No controlled vocabulary
➤ Heterogeneity of terms
➤ Size complexity
➤ ~Billion records
8
Meta-analysis from GEO
data
A common rejection module (CRM) for acute rejection across multiple
organs identifies novel therapeutics for organ transplantation
Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709
Metadata issues:
• Missing
• Incomplete
• Inaccurate
GEO METADATA - EXAMPLE
10
44,000,000
Key: value pairs
GEO METADATA - QUALITY PROBLEMS FOR KEYS
➤ Minor spelling discrepancies
➤ genotype/varaiation, genotype/varat,
genotype/varation, genotype/variaion,
genotype/variataion, genotype/variation
➤ Different syntactic representations
➤ age (years), age(yrs) and age_year
➤ Different terms to denote one concept
➤ disease, illness, healthy control
➤ Two different key categories in one key name
➤ disease/cell type, tissue/cell line,
treatment age
11
METACROWD METHODOLOGY
12
GEO
Metadata
8 GEO Keys
5 Values (each)
• cell line
• disease
• gender/sex
• genotype
• strain
• time
• tissue
• treatment
Key Definitions
SemanticScience
Integration
Ontology
MICRO TASKS — CROWDFLOWER
13
MICRO TASKS — SETTINGS
14
• 3 workers per task
• ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence
• No. of gold standard questions — 60
• Min. accuracy — 80%
• 5 cents per judgment
• 10 tasks per page
RESULTS OVERVIEW
15
No. of microtasks (keys) 1643
Total no. of workers 145
Total no. of judgments 7835
Overall accuracy 0.934
No. of gold standard questions 60
Accuracy on gold standard questions 0.930
Total cost $451
Total time 1 hour
RESULTS FOR EACH KEY CATEGORY
16
Key Category No. of Keys
True Positive,
False Positive
Accuracy
Cell line 109 711, 21 0.955
Disease 85 412, 10 0.937
Gender 72 645, 23 0.902
Genotype 112 566, 10 0.984
Strain 181 788, 4 0.966
Time 698 2489, 120 0.908
Tissue 145 567, 6 0.947
Treatment 242 846, 49 0.944
RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1)
17
Workers classified incorrectly for:
• Cell line
• cell line initiation date, cell line source age
• Disease
• diseasestatus
• Gender
• cell sex
• Strain
• strain ID
• Tissue
• tissue & age, tissue/development stage
CONCLUSIONS & LIMITATIONS
18
• Crowdsourcing i.e. non-expert workers can be used to curate
large-scale digital gene expression metadata on the Web.
• Several keys that did not achieve consensus amongst the
workers due to either
• lack of semantically annotated values
• ambiguous nomenclature of keys as well as the values
• values indicating that keys belong to more than one
category
• inconsistent usage of the particular metadata key
CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK
19
• Perform crowdsourcing on values and key: value pairs
• Implement a semi-automated approach to identify similar keys
using ontologies
• Design a pipeline to involve semi-automated method+
crowdsourcing + experts
REFERENCES
[1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in
Biocomputing 2015 282–293World Scientific (2014).
[2]Burger, J. D. et al. Hybrid curation of gene–mutation relations
combining automated extraction and crowdsourcing. Database
2014, bau094 (2014).
[3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B.
Ranking adverse drug reactions with crowdsourcing. J. Med.
Internet Res. 17, e80 (2015).
[4] Khare, R. et al. Scaling drug indication curation through
crowdsourcing. Database 2015, bav016 (2015).
[5] Vergoulis, T. et al. mirPub: a database for searching microRNA
publications. Bioinformatics 31, 1502–1504 (2015).
20
THANK YOU!
QUESTIONS?
21
@AmrapaliZamrapali.zaveri@maastrichtuniversity.nl

More Related Content

PDF
CV of Rong Chen
Rong Chen
 
PPT
provenance of microarray experiments
Helena Deus
 
PPTX
140127 rm selection wg summary
GenomeInABottle
 
PDF
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...
Remedy Informatics
 
PPTX
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
Human Variome Project
 
PPTX
Quality Assessment of Biomedical Metadata using Topic Modeling
Stuti Nayak
 
PPT
Griffin Weber, MD PHD slide show test for Open Social
harvardjames
 
PPTX
150219 agbt giab_poster_marc
GenomeInABottle
 
CV of Rong Chen
Rong Chen
 
provenance of microarray experiments
Helena Deus
 
140127 rm selection wg summary
GenomeInABottle
 
Ontology-Driven Clinical Intelligence: Removing Data Barriers for Cross-Disci...
Remedy Informatics
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
Human Variome Project
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Stuti Nayak
 
Griffin Weber, MD PHD slide show test for Open Social
harvardjames
 
150219 agbt giab_poster_marc
GenomeInABottle
 

What's hot (20)

PPTX
Career oppurtunities in the field of Bioinformatics
Shikha Thakur
 
PPTX
2014 agbt giab data integration poster 140206
GenomeInABottle
 
PDF
JPROT-TargetedProteomics-CallforPapers
manrai1953
 
PDF
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Tracy Heath
 
PPTX
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
PPTX
Experimental Designs in Next Generation Sequencing
GuttiPavan
 
PPTX
Enhancing the Quality of ImmPort Data
Barry Smith
 
PDF
Postdoctoral Position in the Translational Glycomaterials Laboratory
Lohitash Karumbaiah
 
PDF
Bioinformatics tools for development, analysis, and preclinical testing of in...
Malachi Griffith
 
PPT
03 Guerra, Rudy
Hadley Wickham
 
PDF
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Enrico Busto
 
DOCX
DanaM 0116 plus R6
Dana McLymond
 
DOC
V.A. Westbrook Resume
V. Anne Westbrook, Ph.D.
 
PPTX
Model Organism Linked Data
Michel Dumontier
 
PDF
Oskar Laur-resume
Oskar Laur
 
PDF
Gcc talk baltimore july 2014
pratikomics
 
PDF
Using ADAGE for pathway-style analyses
Casey Greene
 
PDF
No Boundary Thinking in Bioinformatics Workshop Keynote
Casey Greene
 
DOCX
140127 Performance Metrics WG
GenomeInABottle
 
PPTX
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
Career oppurtunities in the field of Bioinformatics
Shikha Thakur
 
2014 agbt giab data integration poster 140206
GenomeInABottle
 
JPROT-TargetedProteomics-CallforPapers
manrai1953
 
Cracking the (bio)code -- Professional Development Session at SACNAS 2014
Tracy Heath
 
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...
Benjamin Good
 
Experimental Designs in Next Generation Sequencing
GuttiPavan
 
Enhancing the Quality of ImmPort Data
Barry Smith
 
Postdoctoral Position in the Translational Glycomaterials Laboratory
Lohitash Karumbaiah
 
Bioinformatics tools for development, analysis, and preclinical testing of in...
Malachi Griffith
 
03 Guerra, Rudy
Hadley Wickham
 
Master's Thesis - deep genomics: harnessing the power of deep neural networks...
Enrico Busto
 
DanaM 0116 plus R6
Dana McLymond
 
V.A. Westbrook Resume
V. Anne Westbrook, Ph.D.
 
Model Organism Linked Data
Michel Dumontier
 
Oskar Laur-resume
Oskar Laur
 
Gcc talk baltimore july 2014
pratikomics
 
Using ADAGE for pathway-style analyses
Casey Greene
 
No Boundary Thinking in Bioinformatics Workshop Keynote
Casey Greene
 
140127 Performance Metrics WG
GenomeInABottle
 
Tools and approaches for data deposition into nanomaterial databases
Valery Tkachenko
 
Ad

Similar to MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment (20)

PDF
CEDAR work bench for metadata management
Pistoia Alliance
 
PPTX
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
PPTX
EiTESAL eHealth Conference 14&15 May 2017
EITESANGO
 
PPTX
2016 09 cxo forum
Chris Dwan
 
PPTX
Fore FAIR ISMB 2019
Ian Fore
 
PPTX
Bhasha_Bandhu_Sample_presentation_2.pptxFESGEWGASGASFASFASFAS
soumyadebnandy23
 
PDF
Friend NAS 2013-01-10
Sage Base
 
PPTX
GIAB Integrating multiple technologies to form benchmark SVs 180517
GenomeInABottle
 
PPTX
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
The Children's Hospital of Philadelphia
 
PDF
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Adam Ford
 
PDF
FAIR and metadata standards - FAIRsharing and Neuroscience
Susanna-Assunta Sansone
 
PPTX
171017 giab for giab grc workshop
Genome Reference Consortium
 
PPT
Biostatistics and Statistical Bioinformatics
Setia Pramana
 
PPTX
171017 giab for giab grc workshop
GenomeInABottle
 
PDF
Amia tb-review-08
Russ Altman
 
PPTX
Ontologies: What Librarians Need to Know
Barry Smith
 
PPTX
The End of the Drug Development Casino?
Paul Agapow
 
PPTX
Systems genetics approaches to understand complex traits
SOYEON KIM
 
PDF
Sabina Leonelli
Anita de Waard
 
PDF
Going FAIR: premises, promises and challenges of interoperability standards
Susanna-Assunta Sansone
 
CEDAR work bench for metadata management
Pistoia Alliance
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Amit Sheth
 
EiTESAL eHealth Conference 14&15 May 2017
EITESANGO
 
2016 09 cxo forum
Chris Dwan
 
Fore FAIR ISMB 2019
Ian Fore
 
Bhasha_Bandhu_Sample_presentation_2.pptxFESGEWGASGASFASFASFAS
soumyadebnandy23
 
Friend NAS 2013-01-10
Sage Base
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GenomeInABottle
 
Understanding Gaps between Data Quality Checks and Research Capabilities in a...
The Children's Hospital of Philadelphia
 
Ben Goertzel AIs, Superflies and the Path to Immortality - singsum au 2011
Adam Ford
 
FAIR and metadata standards - FAIRsharing and Neuroscience
Susanna-Assunta Sansone
 
171017 giab for giab grc workshop
Genome Reference Consortium
 
Biostatistics and Statistical Bioinformatics
Setia Pramana
 
171017 giab for giab grc workshop
GenomeInABottle
 
Amia tb-review-08
Russ Altman
 
Ontologies: What Librarians Need to Know
Barry Smith
 
The End of the Drug Development Casino?
Paul Agapow
 
Systems genetics approaches to understand complex traits
SOYEON KIM
 
Sabina Leonelli
Anita de Waard
 
Going FAIR: premises, promises and challenges of interoperability standards
Susanna-Assunta Sansone
 
Ad

More from Amrapali Zaveri, PhD (16)

PDF
Data Quality and the FAIR principles
Amrapali Zaveri, PhD
 
PDF
Workshop on Data Quality Management in Wikidata
Amrapali Zaveri, PhD
 
PDF
ESOF Panel 2018
Amrapali Zaveri, PhD
 
PDF
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
Amrapali Zaveri, PhD
 
PDF
smartAPI: Towards a more intelligent network of Web APIs
Amrapali Zaveri, PhD
 
PDF
Introduction to Bio SPARQL
Amrapali Zaveri, PhD
 
PDF
Crowdsourcing Linked Data Quality Assessment
Amrapali Zaveri, PhD
 
PDF
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 
PDF
Amrapali Zaveri Defense
Amrapali Zaveri, PhD
 
PDF
LDQ 2014 DQ Methodology
Amrapali Zaveri, PhD
 
PDF
TripleCheckMate
Amrapali Zaveri, PhD
 
PDF
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Amrapali Zaveri, PhD
 
PDF
User-driven Quality Evaluation of DBpedia
Amrapali Zaveri, PhD
 
PDF
Converting GHO to RDF
Amrapali Zaveri, PhD
 
KEY
ReDD-Observatory
Amrapali Zaveri, PhD
 
Data Quality and the FAIR principles
Amrapali Zaveri, PhD
 
Workshop on Data Quality Management in Wikidata
Amrapali Zaveri, PhD
 
ESOF Panel 2018
Amrapali Zaveri, PhD
 
CrowdED: Guideline for optimal Crowdsourcing Experimental Design
Amrapali Zaveri, PhD
 
smartAPI: Towards a more intelligent network of Web APIs
Amrapali Zaveri, PhD
 
Introduction to Bio SPARQL
Amrapali Zaveri, PhD
 
Crowdsourcing Linked Data Quality Assessment
Amrapali Zaveri, PhD
 
Linked Data Quality Assessment: A Survey
Amrapali Zaveri, PhD
 
Amrapali Zaveri Defense
Amrapali Zaveri, PhD
 
LDQ 2014 DQ Methodology
Amrapali Zaveri, PhD
 
TripleCheckMate
Amrapali Zaveri, PhD
 
Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Amrapali Zaveri, PhD
 
User-driven Quality Evaluation of DBpedia
Amrapali Zaveri, PhD
 
Converting GHO to RDF
Amrapali Zaveri, PhD
 
ReDD-Observatory
Amrapali Zaveri, PhD
 

Recently uploaded (20)

PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
PPTX
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
PPTX
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PDF
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
PPTX
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
DOCX
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
PPTX
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PPTX
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
PPTX
CDH. pptx
AneetaSharma15
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
Virus sequence retrieval from NCBI database
yamunaK13
 
An introduction to Dialogue writing.pptx
drsiddhantnagine
 
HEALTH CARE DELIVERY SYSTEM - UNIT 2 - GNM 3RD YEAR.pptx
Priyanshu Anand
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
The-Invisible-Living-World-Beyond-Our-Naked-Eye chapter 2.pdf/8th science cur...
Sandeep Swamy
 
Five Point Someone – Chetan Bhagat | Book Summary & Analysis by Bhupesh Kushwaha
Bhupesh Kushwaha
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
Unit 5: Speech-language and swallowing disorders
JELLA VISHNU DURGA PRASAD
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
family health care settings home visit - unit 6 - chn 1 - gnm 1st year.pptx
Priyanshu Anand
 
Sonnet 130_ My Mistress’ Eyes Are Nothing Like the Sun By William Shakespear...
DhatriParmar
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
How to Track Skills & Contracts Using Odoo 18 Employee
Celine George
 
CDH. pptx
AneetaSharma15
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 

MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment

  • 1. MetaCrowd: Crowdsourcing Gene Expression Metadata Quality Assessment Amrapali Zaveri and Michel Dumontier @[email protected] Bio-ontologies 2017 July 24-25th, 2017
  • 2. BIOMEDICAL DATA ON THE WEB 2
  • 3. BIOMEDICAL METADATA ON THE WEB — SIGNIFICANCE 3 ➤ For (re-)using this data, we need to understand the structure of datasets and the experimental conditions under which they were produced ➤ We require accurate, structured and complete description of the data -- defined as metadata ➤ Good quality metadata is essential in finding, interpreting, and reusing existing data beyond what the original investigators envisioned ➤ Facilitates a data-driven approach by combining and analyzing similar data to uncover novel insights or even more subtle trends in the data
  • 4. BIOMEDICAL METADATA ON THE WEB - CHALLENGES 4 SIZE complexity QUALITY measures TIME consuming COSTLY, requires experts
  • 5. HYPOTHESIS Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital biomedical metadata on the Web. 5
  • 6. CROWDSOURCING - WHAT & WHY? 6 TIME MONEY ➤ Highly parallelizable tasks ➤ Work is broken down into smaller — ‘micro’ — pieces that can be solved independently ➤ Tasks based on human skills not easily replicable by machines ➤ Non-expert workers can perform the tasks with a minimal payment Consolidated answers solve scientific problems !!
  • 7. RELATED WORK - CROWDSOURCING BIOMEDICAL RESEARCH ➤ Improve automated mining of biomedical text for annotating diseases [1] ➤ Curation of gene-mutation relations [2] ➤ Identifying relationships between drugs and side-effects [3], drugs and their indications [4] ➤ Annotation of microRNA functions [5]. 7
  • 8. GENE EXPRESSION OMNIBUS ➤ Unstructured ➤ Spreadsheet submission ➤ No controlled vocabulary ➤ Heterogeneity of terms ➤ Size complexity ➤ ~Billion records 8
  • 9. Meta-analysis from GEO data A common rejection module (CRM) for acute rejection across multiple organs identifies novel therapeutics for organ transplantation Khatri et al. JEM. 210 (11): 2205; DOI: 10.1084/jem.20122709 Metadata issues: • Missing • Incomplete • Inaccurate
  • 10. GEO METADATA - EXAMPLE 10 44,000,000 Key: value pairs
  • 11. GEO METADATA - QUALITY PROBLEMS FOR KEYS ➤ Minor spelling discrepancies ➤ genotype/varaiation, genotype/varat, genotype/varation, genotype/variaion, genotype/variataion, genotype/variation ➤ Different syntactic representations ➤ age (years), age(yrs) and age_year ➤ Different terms to denote one concept ➤ disease, illness, healthy control ➤ Two different key categories in one key name ➤ disease/cell type, tissue/cell line, treatment age 11
  • 12. METACROWD METHODOLOGY 12 GEO Metadata 8 GEO Keys 5 Values (each) • cell line • disease • gender/sex • genotype • strain • time • tissue • treatment Key Definitions SemanticScience Integration Ontology
  • 13. MICRO TASKS — CROWDFLOWER 13
  • 14. MICRO TASKS — SETTINGS 14 • 3 workers per task • ‘Dynamic Judgment’ to 7 workers, with 0.8 confidence • No. of gold standard questions — 60 • Min. accuracy — 80% • 5 cents per judgment • 10 tasks per page
  • 15. RESULTS OVERVIEW 15 No. of microtasks (keys) 1643 Total no. of workers 145 Total no. of judgments 7835 Overall accuracy 0.934 No. of gold standard questions 60 Accuracy on gold standard questions 0.930 Total cost $451 Total time 1 hour
  • 16. RESULTS FOR EACH KEY CATEGORY 16 Key Category No. of Keys True Positive, False Positive Accuracy Cell line 109 711, 21 0.955 Disease 85 412, 10 0.937 Gender 72 645, 23 0.902 Genotype 112 566, 10 0.984 Strain 181 788, 4 0.966 Time 698 2489, 120 0.908 Tissue 145 567, 6 0.947 Treatment 242 846, 49 0.944
  • 17. RESULTS FOR EACH KEY CATEGORY — EXAMPLES (1) 17 Workers classified incorrectly for: • Cell line • cell line initiation date, cell line source age • Disease • diseasestatus • Gender • cell sex • Strain • strain ID • Tissue • tissue & age, tissue/development stage
  • 18. CONCLUSIONS & LIMITATIONS 18 • Crowdsourcing i.e. non-expert workers can be used to curate large-scale digital gene expression metadata on the Web. • Several keys that did not achieve consensus amongst the workers due to either • lack of semantically annotated values • ambiguous nomenclature of keys as well as the values • values indicating that keys belong to more than one category • inconsistent usage of the particular metadata key
  • 19. CROWDSOURCING GEO METADATA QUALITY — FUTURE WORK 19 • Perform crowdsourcing on values and key: value pairs • Implement a semi-automated approach to identify similar keys using ontologies • Design a pipeline to involve semi-automated method+ crowdsourcing + experts
  • 20. REFERENCES [1] Benjamin, M. G., Max, N., Chunlei, W. U. & Andrew, I. S. in Biocomputing 2015 282–293World Scientific (2014). [2]Burger, J. D. et al. Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing. Database 2014, bau094 (2014). [3] Gottlieb, A., Hoehndorf, R., Dumontier, M. & Altman, R. B. Ranking adverse drug reactions with crowdsourcing. J. Med. Internet Res. 17, e80 (2015). [4] Khare, R. et al. Scaling drug indication curation through crowdsourcing. Database 2015, bav016 (2015). [5] Vergoulis, T. et al. mirPub: a database for searching microRNA publications. Bioinformatics 31, 1502–1504 (2015). 20