Workshop overview
• Y/our backgrounds and interests and what we want
• How does mining work and what can it do for YOU/Cochrane?
• Demonstration with emphasis on dictionaries.
• What would YOU like a system to do?
• Your dictionary/ies in action
• Advanced (chemistry, diagram mining)
• ANY early adopter can obtain our (Open) software and run it at
home for any resource (medical, agricultural, government, climate,
etc.). We will help you during next 24 hours.
• All material CC BY.
Cochrane UK & Ireland
Symposium 2016,
Birmingham, UK, 2016-03-15
Let the Machine Help
with your
Systematic Reviews
Peter Murray-Rust1,2
Christopher Kittel2
[1]University of Cambridge
[2]TheContentMine
Simple, Universal,
Knowledge creation and re-use
The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
Resources
• Europe PubMedCentral http://europepmc.org/
• ContentMine toolkit https://github.com/ContentMine/
• Wikidata:
https://www.wikidata.org/wiki/Wikidata:Main_Page
• Hypothes.is https://hypothes.is/ [1]
• Etherpad: http://pads.cottagelabs.com/p/cochrane2016
• Note: early adopters can obtain our (Open) software and
run it at home…
• [1] Not used in CochraneBham workshop
Europe PubMedCentral
Cochrane workshop2016
catalogue
getpapers
query
Daily
Crawl
EPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
dictionaries
Dictionaries!
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
Dict A
Dict B
Image
Caption
Table
Caption
MINING
with sections
and dictionaries
[W3C Annotation / https://hypothes.is/ ]
Disease Dictionary (ICD-10)
<dictionary title="disease">
<entry term="1p36 deletion syndrome"/>
<entry term="1q21.1 deletion syndrome"/>
<entry term="1q21.1 duplication syndrome"/>
<entry term="3-methylglutaconic aciduria"/>
<entry term="3mc syndrome”
<entry term="corpus luteum cyst”/>
<entry term="cortical blindness" />
SELECT DISTINCT ?thingLabel WHERE {
?thing wdt:P494 ?wd .
?thing wdt:P279 wd:Q12136 .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" }
}
wdt:P494 = ICD-10 (P494) identifier
wd:Q12136 = disease (Q12136) abnormal condition that
affects the body of an organism
Wikidata ontology for disease
• ChEBI (chemicals at EBI)
ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names_3star.tsv.gz)
• combined with WIKIDATA: World Health Organisation International Nonproprietary Name
(P2275)
* => 4947 items in the dictionary (inn.xml)
DRUGS
<dictionary title="inn">
<entry term="(r)-fenfluramine"/>
<entry term="abacavir"/>
<entry term="abafungin"/>
<entry term="abafungina"/>
<entry term="abafungine"/>
<entry term="abafunginum"/>
<entry term="abamectin"/>
<entry term="abarelix"/>
<entry term="abatacept"/>
<dictionary title="funders">
<!— from http://help.crossref.org/funder-registry with
thanks -->
<entry id="http://dx.doi.org/10.13039/100001436"
term="1675 Foundation"/>
<entry id="http://dx.doi.org/10.13039/100004343"
term="3M"/>
<entry id=“http://dx.doi.org/10.13039/501100005957”
term="8020 Promotion Foundation"/>
<entry id="http://dx.doi.org/10.13039/501100007139"
term="A Richer Life Foundation"/>
<entry id="http://dx.doi.org/10.13039/100006543"
term="A World Celiac Community Foundation"/>
<entry id="http://dx.doi.org/10.13039/100001962"
term="A-T Children's Project"/>
<entry id="http://dx.doi.org/10.13039/100008456"
term="A. Alfred Taubman Medical Research Institute"/>
11566 entries
Funders Dictionary
Dengue Mosquito
<dictionary name="genus">
<entry term="Aa"/>
<entry term="Aaaba"/>
<entry term="Aacanthocnema"/>
<entry term="Aaosphaeria"/>
<entry term="Aaptos"/>
<entry term="Aaptosyax"/>
<entry term="Aaroniella"/>
<entry term="Aaronsohnia"/>
<entry term="Abablemma"/>
Genera from NCBI TaxDump
<dictionary title="hgnc">
<entry term="A1BG" name="alpha-1-B glycoprotein"/>
<entry term="A1BG-AS1" name="A1BG antisense RNA 1"/>
<entry term="A1CF"
name="APOBEC1 complementation factor"/>
<entry term="A2M" name="alpha-2-macroglobulin"/>
<entry term="A2M-AS1"
name="A2M antisense RNA 1 (head to head)"/>
<entry term="A2ML1" name="alpha-2-macroglobulin-like 1"/>
<entry term="A2ML1-AS1" name="A2ML1 antisense RNA 1"/>
Human Genes (HGNC)
<entry term="Aaas"
name="achalasia, adrenocortical insufficiency, alacrimia"/>
<entry term="Aacs" name="acetoacetyl-CoA synthetase"/>
<entry term="Aadac"
name="arylacetamide deacetylase (esterase)"/>
<entry term="Aadacl2"
name="arylacetamide deacetylase-like 2"/>
<entry term="Aadacl3"
name="arylacetamide deacetylase-like 3"/>
<entry term="Aadat" name="aminoadipate aminotransferase"/>
<entry term="Aaed1"
name="AhpC/TSA antioxidant enzyme domain containing 1"/>
<entry term="Aagab"
name="alpha- and gamma-adaptin binding protein"/>
<entry term="Aak1" name="AP2 associated kinase 1"/>
<entry term="Aamdc"
name="adipogenesis associated Mth938 domain containing"/>
<entry term="Aamp"
name="angio-associated migratory protein"/>
Mouse genes (JAXson)
Ebola!
<dictionary title="tropicalVirus">
<entry term="ZIKV" name="Zika virus"/>
<entry term="Zika" name="Zika virus"/>
<entry term="DENV" name="Dengue virus"/>
<entry term="Dengue" name="Dengue virus"/>
<entry term="CHIKV" name="Chikungunya virus"/>
<entry term="Chikungunya" name="Chikungunya virus"/>
<entry term="WNV" name="West Nile virus"/>
<entry term="West Nile" name="West Nile virus"/>
<entry term="YFV" name="Yellow fever virus"/>
<entry term="Yellow fever" name="Yellow fever virus"/>
<entry term="HPV" name="Human papilloma virus"/>
<entry term="Human papilloma virus"
name="Human papilloma virus"/>
</dictionary>
Terms co-ocurring with “Zika”
<dictionary title="cochrane">
<entry term="Cochrane Library"/>
<entry term="Cochrane Reviews"/>
<entry
term="Cochrane Central Register of Controlled Trials"/>
<entry term="Cochrane"/>
<entry term="randomize"/>
<entry term="meta-analysis"/>
<entry term="Embase"/>
<entry term="MEDLINE"/>
<entry term="eligibility"/>
<entry term="exclusion"/>
<entry term="outcome"/>
<entry term="Review Manager"/>
<entry term="STATA"/>
<entry term="RCT"/>
</dictionary>
Terms lexically related to “meta-analysis”
Mining strategy
• Discover. negotiate permissions . => bibliography
• Crawl / Scrape (download), documents AND
supplemental
• Normalize. PDF => XML
• Index: facets => Facts and snippets (“entities”)
• Interpret/analyze entities => relationships,
aggregations (“Transformative”)
• Publish
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
Demo
PMR runs getpapers and ami
Chris runs Python visualization of drug co-occurrence
Systematic Reviews
Can we:
• eliminate true negatives automatically?
• extract data from formulaic language?
• mine diagrams?
• Annotate existing sources?
• forward-reference clinical trials?
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
Cochrane workshop2016
Diagram Mining
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
Cochrane workshop2016
Cochrane workshop2016
Cochrane workshop2016

More Related Content

PPSX
Cochrane workshop 2016
PPTX
Liberating facts from the scientific literature - Jisc Digifest 2016
PPTX
Automatic Extraction of Science and Medicine from the scholarly literature
PPTX
The culture of researchData
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Content Mining of Science and Medicine
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
ContentMine (TDM) at JISC Digifest
Cochrane workshop 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
Automatic Extraction of Science and Medicine from the scholarly literature
The culture of researchData
Automatic Extraction of Knowledge from the Literature
Content Mining of Science and Medicine
Automatic Extraction of Knowledge from Biomedical literature
ContentMine (TDM) at JISC Digifest

What's hot (20)

PPTX
ContentMine + EPMC: Finding Zika!
PPTX
Automatic Extraction of Knowledge from Biomedical literature
PPTX
Automatic Extraction of Knowledge from the Literature
PPTX
Text and Data Mining explained at FTDM
PPTX
Content Mining of Science in Europe
PPTX
ContentMine + EPMC: Finding Zika!
PPTX
Digital Scholarship: Enlightenment or Devastated Landscape?
PPTX
Content Mining at Wellcome Trust
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
High throughput mining of the scholarly literature
PPTX
High throughput mining of the scholarly literature
PPTX
Open software and knowledge for MIOSS
PPTX
Content Mining of Science in Cambridge
PPTX
Can Computers understand the scientific literature (includes compscie material)
PPTX
Amanuens.is HUmans and machines annotating scholarly literature
PPTX
High throughput mining of the scholarly literature; talk at NIH
PDF
Museum impact: linking-up specimens with research published on them
PPTX
Mining the scientific literature for plants and chemistry
PPTX
Towards Responsible Content Mining: A Cambridge perspective
ContentMine + EPMC: Finding Zika!
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from the Literature
Text and Data Mining explained at FTDM
Content Mining of Science in Europe
ContentMine + EPMC: Finding Zika!
Digital Scholarship: Enlightenment or Devastated Landscape?
Content Mining at Wellcome Trust
Amanuens.is HUmans and machines annotating scholarly literature
Open software and knowledge for MIOSS
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
Open software and knowledge for MIOSS
Content Mining of Science in Cambridge
Can Computers understand the scientific literature (includes compscie material)
Amanuens.is HUmans and machines annotating scholarly literature
High throughput mining of the scholarly literature; talk at NIH
Museum impact: linking-up specimens with research published on them
Mining the scientific literature for plants and chemistry
Towards Responsible Content Mining: A Cambridge perspective
Ad

Viewers also liked (12)

PPTX
ContentMining and Clinical Trials
PPTX
Cochrane Library (BVS)
PPTX
Ensayo clínico aleatorizado
PPTX
Ensayo clínico aleatorizado
PPTX
Cálculo del tamaño de la muestra
PDF
Guias de practica clinica 2016 (primera parte): Introducción, alcances, objet...
PPTX
Revisiones sistemáticas
PPTX
ContentMining for France and Europe; Lessons from 2 years in UK
PDF
El Ensayo Clínico Aleatorio: introducción
PDF
Guias de practica clinica 2016 (3a parte)
PPS
Sesión clínica: "Meta análisis y revisiones sistemáticas"
PPT
Lectura Critica de articulos médicos
ContentMining and Clinical Trials
Cochrane Library (BVS)
Ensayo clínico aleatorizado
Ensayo clínico aleatorizado
Cálculo del tamaño de la muestra
Guias de practica clinica 2016 (primera parte): Introducción, alcances, objet...
Revisiones sistemáticas
ContentMining for France and Europe; Lessons from 2 years in UK
El Ensayo Clínico Aleatorio: introducción
Guias de practica clinica 2016 (3a parte)
Sesión clínica: "Meta análisis y revisiones sistemáticas"
Lectura Critica de articulos médicos
Ad

Similar to Cochrane workshop2016 (20)

PPTX
OSFair2017 Workshop | Bioschemas
PPTX
Systematic reviews searching part 2 2019
PPT
NCBO Tools and Web services
PPT
AZ of Chemspider February 2011
PPT
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
PPT
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
PPTX
Systematic Review
PPT
Ebi public meeting on internet chemistry databases november 2010
PPTX
Exhaustive Literature Searching (Systematic Reviews)
PPT
Web services and the Development of Semantic Applications
PPTX
A Global Commons for Scientific Data: Molecules and Wikidata
PPT
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
PPT
Cohg presentation for drf day
PPT
How the web has weaved a web of interlinked chemistry data final
PPTX
PubChem: a public chemical information resource for big data chemistry
PPTX
Serving the medicinal chemistry community with Royal Society of Chemistry che...
PPT
Clinical Anatomy 9566
OSFair2017 Workshop | Bioschemas
Systematic reviews searching part 2 2019
NCBO Tools and Web services
AZ of Chemspider February 2011
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
Systematic Review
Ebi public meeting on internet chemistry databases november 2010
Exhaustive Literature Searching (Systematic Reviews)
Web services and the Development of Semantic Applications
A Global Commons for Scientific Data: Molecules and Wikidata
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
Cohg presentation for drf day
How the web has weaved a web of interlinked chemistry data final
PubChem: a public chemical information resource for big data chemistry
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Clinical Anatomy 9566

More from petermurrayrust (20)

PPTX
Omdi2021 Ontologies for (Materials) Science in the Digital Age
PPTX
Open Science Principles and Practice
PPTX
Open Virus Indian Presentation
PPTX
Can machines understand the scientific literature?
PPTX
OpenVirus at OpenPublishingFest
PPTX
Open Virus Indian Presentation
PPTX
Automatic mining of data from materials science literature
PPTX
Climate Change and Human Migration
PPTX
openVirus - tools for discovering literature on viruses
PPTX
XML for science; its huge potential; but are pubiishers preventing it?
PPTX
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
PPTX
Early Career Reseachers and Open Healthcare
PPTX
Rapid biomedical search
PPTX
Scientific search for everyone
PPTX
Openplant2018 Poster; Semantic searching
PPTX
Extracting science from the archive
PPTX
WikiFactMine: Ontology for Everybody and Everything
PPTX
Disrupting the Publisher-Academic Complex
PPTX
Paradise Lost and The Right to Read is the Right to Mine
PPTX
Young people in an Age of Knowledge Neocolonialism
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Open Science Principles and Practice
Open Virus Indian Presentation
Can machines understand the scientific literature?
OpenVirus at OpenPublishingFest
Open Virus Indian Presentation
Automatic mining of data from materials science literature
Climate Change and Human Migration
openVirus - tools for discovering literature on viruses
XML for science; its huge potential; but are pubiishers preventing it?
Early Career Reseachers in Science. Start Early, Be Open , Be Brave
Early Career Reseachers and Open Healthcare
Rapid biomedical search
Scientific search for everyone
Openplant2018 Poster; Semantic searching
Extracting science from the archive
WikiFactMine: Ontology for Everybody and Everything
Disrupting the Publisher-Academic Complex
Paradise Lost and The Right to Read is the Right to Mine
Young people in an Age of Knowledge Neocolonialism

Recently uploaded (20)

PDF
Glaucoma Definition, Introduction, Etiology, Epidemiology, Clinical Presentat...
PPTX
Peripheral Arterial Diseases PAD-WPS Office.pptx
PPTX
AWMI case presentation ppt AWMI case presentation ppt
PPTX
Hypertensive disorders in pregnancy.pptx
PDF
495958952-Techno-Obstetric-sminiOSCE.pdf
PPTX
ARTHRITIS and Types,causes,pathophysiology,clinicalanifestations,diagnostic e...
PPTX
Type 2 Diabetes Mellitus (T2DM) Part 3 v2.pptx
PPTX
Tuberculosis : NTEP and recent updates (2024)
PPTX
المحاضرة الثالثة Urosurgery (Inflammation).pptx
PPTX
IND is a submission to the food and drug administration (FDA), requesting per...
PPTX
Hyperthyroidism, Thyrotoxicosis, Grave's Disease with MCQs.pptx
PPTX
Pharynx and larynx -4.............pptx
PPTX
Methods of population control Community Medicine
PDF
Approach to dyspnea/shortness of breath (SOB)
PPTX
Assessment of fetal wellbeing for nurses.
PPTX
SUMMARY OF EAR, NOSE AND THROAT DISORDERS INCLUDING DEFINITION, CAUSES, CLINI...
PDF
periodontaldiseasesandtreatments-200626195738.pdf
PPTX
etomidate and ketamine action mechanism.pptx
PPTX
CASE PRESENTATION CLUB FOOT management.pptx
PDF
Geriatrics Chapter 1 powerpoint for PA-S
Glaucoma Definition, Introduction, Etiology, Epidemiology, Clinical Presentat...
Peripheral Arterial Diseases PAD-WPS Office.pptx
AWMI case presentation ppt AWMI case presentation ppt
Hypertensive disorders in pregnancy.pptx
495958952-Techno-Obstetric-sminiOSCE.pdf
ARTHRITIS and Types,causes,pathophysiology,clinicalanifestations,diagnostic e...
Type 2 Diabetes Mellitus (T2DM) Part 3 v2.pptx
Tuberculosis : NTEP and recent updates (2024)
المحاضرة الثالثة Urosurgery (Inflammation).pptx
IND is a submission to the food and drug administration (FDA), requesting per...
Hyperthyroidism, Thyrotoxicosis, Grave's Disease with MCQs.pptx
Pharynx and larynx -4.............pptx
Methods of population control Community Medicine
Approach to dyspnea/shortness of breath (SOB)
Assessment of fetal wellbeing for nurses.
SUMMARY OF EAR, NOSE AND THROAT DISORDERS INCLUDING DEFINITION, CAUSES, CLINI...
periodontaldiseasesandtreatments-200626195738.pdf
etomidate and ketamine action mechanism.pptx
CASE PRESENTATION CLUB FOOT management.pptx
Geriatrics Chapter 1 powerpoint for PA-S

Cochrane workshop2016

  • 1. Workshop overview • Y/our backgrounds and interests and what we want • How does mining work and what can it do for YOU/Cochrane? • Demonstration with emphasis on dictionaries. • What would YOU like a system to do? • Your dictionary/ies in action • Advanced (chemistry, diagram mining) • ANY early adopter can obtain our (Open) software and run it at home for any resource (medical, agricultural, government, climate, etc.). We will help you during next 24 hours. • All material CC BY.
  • 2. Cochrane UK & Ireland Symposium 2016, Birmingham, UK, 2016-03-15 Let the Machine Help with your Systematic Reviews Peter Murray-Rust1,2 Christopher Kittel2 [1]University of Cambridge [2]TheContentMine Simple, Universal, Knowledge creation and re-use
  • 3. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011 http://contentmine.org
  • 4. Resources • Europe PubMedCentral http://europepmc.org/ • ContentMine toolkit https://github.com/ContentMine/ • Wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page • Hypothes.is https://hypothes.is/ [1] • Etherpad: http://pads.cottagelabs.com/p/cochrane2016 • Note: early adopters can obtain our (Open) software and run it at home… • [1] Not used in CochraneBham workshop
  • 7. catalogue getpapers query Daily Crawl EPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature dictionaries
  • 9. abstract methods references Captioned Figures Fig. 1 HTML tables abstract methods references Captioned Figures Fig. 1 HTML tables Dict A Dict B Image Caption Table Caption MINING with sections and dictionaries [W3C Annotation / https://hypothes.is/ ]
  • 10. Disease Dictionary (ICD-10) <dictionary title="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" /> SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } } wdt:P494 = ICD-10 (P494) identifier wd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism Wikidata ontology for disease
  • 11. • ChEBI (chemicals at EBI) ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names_3star.tsv.gz) • combined with WIKIDATA: World Health Organisation International Nonproprietary Name (P2275) * => 4947 items in the dictionary (inn.xml) DRUGS <dictionary title="inn"> <entry term="(r)-fenfluramine"/> <entry term="abacavir"/> <entry term="abafungin"/> <entry term="abafungina"/> <entry term="abafungine"/> <entry term="abafunginum"/> <entry term="abamectin"/> <entry term="abarelix"/> <entry term="abatacept"/>
  • 12. <dictionary title="funders"> <!— from http://help.crossref.org/funder-registry with thanks --> <entry id="http://dx.doi.org/10.13039/100001436" term="1675 Foundation"/> <entry id="http://dx.doi.org/10.13039/100004343" term="3M"/> <entry id=“http://dx.doi.org/10.13039/501100005957” term="8020 Promotion Foundation"/> <entry id="http://dx.doi.org/10.13039/501100007139" term="A Richer Life Foundation"/> <entry id="http://dx.doi.org/10.13039/100006543" term="A World Celiac Community Foundation"/> <entry id="http://dx.doi.org/10.13039/100001962" term="A-T Children's Project"/> <entry id="http://dx.doi.org/10.13039/100008456" term="A. Alfred Taubman Medical Research Institute"/> 11566 entries Funders Dictionary
  • 14. <dictionary name="genus"> <entry term="Aa"/> <entry term="Aaaba"/> <entry term="Aacanthocnema"/> <entry term="Aaosphaeria"/> <entry term="Aaptos"/> <entry term="Aaptosyax"/> <entry term="Aaroniella"/> <entry term="Aaronsohnia"/> <entry term="Abablemma"/> Genera from NCBI TaxDump
  • 15. <dictionary title="hgnc"> <entry term="A1BG" name="alpha-1-B glycoprotein"/> <entry term="A1BG-AS1" name="A1BG antisense RNA 1"/> <entry term="A1CF" name="APOBEC1 complementation factor"/> <entry term="A2M" name="alpha-2-macroglobulin"/> <entry term="A2M-AS1" name="A2M antisense RNA 1 (head to head)"/> <entry term="A2ML1" name="alpha-2-macroglobulin-like 1"/> <entry term="A2ML1-AS1" name="A2ML1 antisense RNA 1"/> Human Genes (HGNC)
  • 16. <entry term="Aaas" name="achalasia, adrenocortical insufficiency, alacrimia"/> <entry term="Aacs" name="acetoacetyl-CoA synthetase"/> <entry term="Aadac" name="arylacetamide deacetylase (esterase)"/> <entry term="Aadacl2" name="arylacetamide deacetylase-like 2"/> <entry term="Aadacl3" name="arylacetamide deacetylase-like 3"/> <entry term="Aadat" name="aminoadipate aminotransferase"/> <entry term="Aaed1" name="AhpC/TSA antioxidant enzyme domain containing 1"/> <entry term="Aagab" name="alpha- and gamma-adaptin binding protein"/> <entry term="Aak1" name="AP2 associated kinase 1"/> <entry term="Aamdc" name="adipogenesis associated Mth938 domain containing"/> <entry term="Aamp" name="angio-associated migratory protein"/> Mouse genes (JAXson)
  • 18. <dictionary title="tropicalVirus"> <entry term="ZIKV" name="Zika virus"/> <entry term="Zika" name="Zika virus"/> <entry term="DENV" name="Dengue virus"/> <entry term="Dengue" name="Dengue virus"/> <entry term="CHIKV" name="Chikungunya virus"/> <entry term="Chikungunya" name="Chikungunya virus"/> <entry term="WNV" name="West Nile virus"/> <entry term="West Nile" name="West Nile virus"/> <entry term="YFV" name="Yellow fever virus"/> <entry term="Yellow fever" name="Yellow fever virus"/> <entry term="HPV" name="Human papilloma virus"/> <entry term="Human papilloma virus" name="Human papilloma virus"/> </dictionary> Terms co-ocurring with “Zika”
  • 19. <dictionary title="cochrane"> <entry term="Cochrane Library"/> <entry term="Cochrane Reviews"/> <entry term="Cochrane Central Register of Controlled Trials"/> <entry term="Cochrane"/> <entry term="randomize"/> <entry term="meta-analysis"/> <entry term="Embase"/> <entry term="MEDLINE"/> <entry term="eligibility"/> <entry term="exclusion"/> <entry term="outcome"/> <entry term="Review Manager"/> <entry term="STATA"/> <entry term="RCT"/> </dictionary> Terms lexically related to “meta-analysis”
  • 20. Mining strategy • Discover. negotiate permissions . => bibliography • Crawl / Scrape (download), documents AND supplemental • Normalize. PDF => XML • Index: facets => Facts and snippets (“entities”) • Interpret/analyze entities => relationships, aggregations (“Transformative”) • Publish
  • 21. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
  • 22. Demo PMR runs getpapers and ami Chris runs Python visualization of drug co-occurrence
  • 23. Systematic Reviews Can we: • eliminate true negatives automatically? • extract data from formulaic language? • mine diagrams? • Annotate existing sources? • forward-reference clinical trials?
  • 24. Polly has 20 seconds to read this paper… …and 10,000 more
  • 25. ContentMine software can do this in a few minutes Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
  • 26. 400,000 Clinical Trials In 10 government registries Mapping trials => papers http://www.trialsjournal.com/content/16/1/80 2009 => 2015. What’s happened in last 6 years?? Search the whole scientific literature For “2009-0100068-41”
  • 29. Ln Bacterial load per fly 11.5 11.0 10.5 10.0 9.5 9.0 6.5 6.0 Days post—infection 0 1 2 3 4 5 Bitmap Image and Tesseract OCR