SlideShare a Scribd company logo
2/14/2014
Biotea, RDF4PMC

RDF4PMC, RDFizing
PubMed Central
Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1
1Florida State University
2Universitat Jaume I

1
The Biotea project
Why Semantic Web Technologies?
RDF4PMC in a nutshell
Architecture
RDFization process
•
•
•
•

PMC RDFization
Content enrichment
Some numbers for RDF4PMC
Architecture

• Using the data
•
•
•
•

•
•
•
•
•

SPARQL
Bio2RDF integration
Web services
A first prototype

Challenges and Lessons
Currently working on…
Future Work
Conclusions
Acknowledgments

Biotea, RDF4PMC

•
•
•
•
•

2/14/2014

Outline

2
Christine L. Borgman

• Methodologies, methods and techniques supporting semantic
enrichment of scholarly communication
• Once enriched, then how is this changing our user experience?

Biotea, RDF4PMC

Scholarly data and documents are of most value
when they are interconnected rather than
independent

2/14/2014

Biotea

3
Biotea

• How are publications connected to each other?
• Putting together explicit assertions from different papers to
form new implicit assertions
• Semantic
Web
Technology
supporting
scholarly
communication, Literature Based Discovery and the SearchRetrieval-and-Interacting-with-the-Document (SRID) processes

Biotea, RDF4PMC

Christine L. Borgman

2/14/2014

Scholarly data and documents are of most value when they are
interconnected rather than independent

4
• Retrieve all papers that have a component X (CHEBI)

and the cellular location in GO terms

Biotea, RDF4PMC

• Generates an adaptable open approach, the data becomes the
platform
• The SW delivers an integrative platform
• Makes it easier for the community to build over the platform
• Simplifies programmatic access to information

2/14/2014

Why SWT for research documents

• As simple as relating terminologies
• Delivers Social Network ready content
5
Biotea, RDF4PMC

• Delivers an interoperable, interlinked, and selfdescribing document model in the biomedical
domain.
• A network of interconnected documents
• Semantic infrastructure for PMC
• An interface to the Web of Data
• A knowledge model for biomedical literature –
easily extendible

2/14/2014

RDF4PMC in a nutshell

6
Biotea, RDF4PMC

• RDFizing biomedical literature by orchestrating
ontologies such as
• DoCO, BIBO, DC, FOAF, W3CPROV, and others
• Datasets are available
• RDF for metadata and content
• RDF for annotations from text-mining
• RDFizator will be available
• Adding other ontologies and annotators is possible
• Working with XML from other sources is possible

2/14/2014

RDF4PMC in a nutshell

7
PMC RDFization

RDF Generation

Biotea, RDF4PMC

References Enrichment

2/14/2014

Metadata+ Content
+ References

RDFReactor

PMC NXML

8
9

Biotea, RDF4PMC

2/14/2014
Annotations: Content Enrichment

Biotea, RDF4PMC

2/14/2014

Enriched RDF

RDF Generation

Automatic Annotation
Web service

Metadata+ Content
+ References

Web service

10
11

Biotea, RDF4PMC

2/14/2014
Biotea, RDF4PMC

2/14/2014

RDF4PMC, some numbers

12
RDF4PMC Server Architecture
RDF DB
Slave
RDF DB
Master
Master
Server

Import scripts
+ RDF files

PMC RDFization

Web &
SPARQL
Server
(development)

RDF DB
Slave Web &
SPARQL
Server
(production)
Consuming the data: SPARQL
Query expressed in natural
SPARQL query
language

Retrieving PubMed

?article a bibo:Document ;
bibo:pmid ?pmid ;

identifier, article title,

dcterms:title ?title .

section title, and

?section a doco:Section ;

paragraphs for those

dcterms:isPartOf ?article ;
dcterms:title ?secTitle .

Biotea, RDF4PMC

WHERE {

2/14/2014

SELECT ?pmid ?title ?secTitle ?text



articles containing the

FILTER (regex(str(?secTitle), "introduction", "i")).
?para a doco:Paragraph ;
dcterms:isPartOf ?section ;

term “cancer” in any
section whose title

cnt:chars ?text .
FILTER (regex(str(?text), "cancer", "i")).
} LIMIT 50

includes “introduction”

14
Consuming the data: SPARQL
Query expressed in natural

Retrieving PubMed identifier
SELECT distinct ?pmid
for those articles that have
WHERE {

been semantically annotated
?article a bibo:AcademicArticle ;
with the biological entity
bibo:pmid ?pmid .


Biotea, RDF4PMC

language

2/14/2014

SPARQL query

CHEBI:60004. The semantic

?annotation a aot:ExactQualifier ;
annotation comes from the
ao:annotatesResource ?article ;
occurrence of the term
ao:hasTopic <http://purl.obolibrary.org/obo/CHEBI_60004> .
“mixture” in any paragraph
}
of the retrieved articles.
CHEBI:60004
A mixture is a chemical substance composed of multiple molecules, at least two of which are of a different kind

15
Annotations

Biotea, RDF4PMC

2/14/2014

Content

Metadata & References

Bio2RDF Integration

16
Consuming the data: Web services
Retrieval

Service

A list of topics and their related vocabularies

http://biotea.idiginfo.org/api/topics

All topics related to a term

e.g., http://biotea.idiginfo.org/api/topics?term=cancer

All vocabularies related to a term

e.g., http://biotea.idiginfo.org/api/vocabularies?term=cancer

All terms that start with a specific string (for autocompletion)

e.g.,http://biotea.idiginfo.org/api/terms?prefix=canc

All topics related to a vocabulary

e.g., http://biotea.idiginfo.org/api/topics?vocabulary=po

RDF of articles that include a term

e.g., http://biotea.idiginfo.org/api/articles?term=cancer

Count of RDF of articles that include a term

e.g., http://biotea.idiginfo.org/api/articles?term=cancer&count=true

2/14/2014

http://biotea.idiginfo.org/api/terms

Biotea, RDF4PMC

A list of terms and their related topics

17
A list of vocabularies and their prefixes

http://biotea.idiginfo.org/vocabularies

RDF of articles that include a vocabulary

e.g., http://biotea.idiginfo.org/api/articles?vocabulary=po
Semantically enriched
publication

Metadata+ Content
+ References

Automatically
Annotated RDF

Biotea, RDF4PMC

2/14/2014

Consuming the data: a dashboard for
semantic bio-publications

SPARQL

18
Catalase
Consuming the data: first prototype
Cloud of Bio-annotations
(term + # of bio-entities)

2/14/2014

Title &
authors

Biotea, RDF4PMC

Links

Abstract

Paragraphs containing
the annotation selected
by the user

Graphical tools

19
Biotea, RDF4PMC

2/14/2014

Consuming the data: A first prototype

20
Challenges and Lessons
Tables and images  Links
Inline tables  Format is lost
Supplementary material
Most of them follow one DTD but …

• References
• At least 4 different styles
• Some times are just plain text

Biotea, RDF4PMC

•
•
•
•

2/14/2014

• Content

• Annotators
• Not always available
• Stop words are tricky
21
Challenges and Lessons

• Annotation is context dependent

Biotea, RDF4PMC

• Delivering the expressivity of the data set to the end user is a
complex issue

2/14/2014

• Where are the facts? How to validate the facts?

• Maintaining the triplet store has a learning curve of its own
• Building SW infrastructure is H A R D

22
Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries
• Little cognitive support

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
• Data availability
RDF for PubMedCentral
Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries
• Little cognitive support
• How, why and where are a set of documents similar?

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
RDF for PubMedCentral
Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries
• Little cognitive support

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
RDF for PubMedCentral
Future Work

• User Experience
•
•
•
•

Web services for data analysis
RDF browser
More visualization tools
Supporting and taking advantage of the structure of the
document
• Collaborative element

Biotea, RDF4PMC

• URI standardization following similar patterns to identifiers.org
and Bio2RDF
• Integration into Bio2RDF
• Dataset identification and summary (void)
• Improve data for references

2/14/2014

• RDF

29
Future Work

Biotea, RDF4PMC

• From PDF to XML to RDF to Enriched Metadata
for the PDF
• The PDF is gently introduced in the WoD
• Once the metadata has been enriched then

2/14/2014

• Application in Clinical Psychology, the MSRC case

• Rich interaction supporting: SEARCH-RETRIEVALINTERACTION WITH THE DOCUMENT (PDF)
30
Conclusions

• New vocabularies as well as annotators can easily be plugged in
• Our approach is useful for both open and non-open access
datasets

Biotea, RDF4PMC

• the transformation into RDF from the original PMC files
• the annotation of the RDF
• an API which makes that data available.

2/14/2014

• We provide

• Publishers may decide what to expose via RDF and what content to
make available

• Our approach is also applicable for PDF-only environments
31
The MSRC consortium
Greg Riccardi, FSU
Oscar Corcho, UPM
Olga Giraldo, UPM
Bob Morris, Harvard University
Michel Dumontier, Carleton University
Dietrich Rebholz-Schuhmann, University of Zurich
Diane Leiva, FSU
US DoD Grant MOMRP Grant w81xwh-10-2-0181
All of those who gave us feedback about the RDFization and
the quality of our RDF datasets

Biotea, RDF4PMC

•
•
•
•
•
•
•
•
•
•

2/14/2014

Acknowledgments

32
Contacts
• Alexander García: agarciac@gmail.com
• L. Jael García Castro: leylajael@gmail.com

Biotea, RDF4PMC

2/14/2014

Thanks for you attention

33

More Related Content

PPTX
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
SisInfLab-SWoT @Politecnico di Bari
 
PPTX
Introduction to W3C Linked Data Platform
Nandana Mihindukulasooriya
 
PPTX
Describing LDP Applications with the Hydra Core Vocabulary
Nandana Mihindukulasooriya
 
PDF
LOD(Linked Open Data) Recommendations
Myungjin Lee
 
PPTX
Linked Data Usecases
Myungjin Lee
 
PPTX
Introduction to Linked Data Platform (LDP)
Hector Correa
 
PPTX
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
National Information Standards Organization (NISO)
 
PDF
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
National Information Standards Organization (NISO)
 
Linked Data (in low-resource) Platforms: a mapping for Constrained Applicatio...
SisInfLab-SWoT @Politecnico di Bari
 
Introduction to W3C Linked Data Platform
Nandana Mihindukulasooriya
 
Describing LDP Applications with the Hydra Core Vocabulary
Nandana Mihindukulasooriya
 
LOD(Linked Open Data) Recommendations
Myungjin Lee
 
Linked Data Usecases
Myungjin Lee
 
Introduction to Linked Data Platform (LDP)
Hector Correa
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
National Information Standards Organization (NISO)
 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
National Information Standards Organization (NISO)
 

What's hot (20)

PDF
semanticweb
Kevin Hutt
 
PPTX
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Simeon Warner
 
PPT
Xiaoli Li: MARC to BIBFRAME (Linked Data)
Northern California Technical Processes Group
 
PPTX
Usage of Linked Data: Introduction and Application Scenarios
EUCLID project
 
PDF
MR^3: Meta-Model Management based on RDFs Revision Reflection
Takeshi Morita
 
PPTX
Building Linked Data Applications
EUCLID project
 
ODP
The OpenOffice.org ODF Toolkit Project
Alexandro Colorado
 
PPTX
Linked Data Modeling for Beginner
Myungjin Lee
 
PPTX
Timbuctoo 2 EASY
henkvandenberg16
 
PDF
WWW2014 Overview of W3C Linked Data Platform 20140410
Arnaud Le Hors
 
PPTX
Querying Linked Data on Android
EUCLID project
 
PPTX
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 
PPTX
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
National Information Standards Organization (NISO)
 
PPTX
Querying Linked Data
EUCLID project
 
PDF
Producing, publishing and consuming linked data - CSHALS 2013
François Belleau
 
PDF
Documents, services, and data on the web
Chiara Del Vescovo
 
PPTX
Fedora Migration Considerations
Avalon Media System
 
PPTX
DLF 2015 Presentation, "RDF in the Real World."
Avalon Media System
 
PPT
Re-using Media on the Web: Media fragment re-mixing and playout
MediaMixerCommunity
 
semanticweb
Kevin Hutt
 
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
Simeon Warner
 
Xiaoli Li: MARC to BIBFRAME (Linked Data)
Northern California Technical Processes Group
 
Usage of Linked Data: Introduction and Application Scenarios
EUCLID project
 
MR^3: Meta-Model Management based on RDFs Revision Reflection
Takeshi Morita
 
Building Linked Data Applications
EUCLID project
 
The OpenOffice.org ODF Toolkit Project
Alexandro Colorado
 
Linked Data Modeling for Beginner
Myungjin Lee
 
Timbuctoo 2 EASY
henkvandenberg16
 
WWW2014 Overview of W3C Linked Data Platform 20140410
Arnaud Le Hors
 
Querying Linked Data on Android
EUCLID project
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 
April 24, 2013 NISO/DCMI Webinar: Deployment of RDA (Resource Description and...
National Information Standards Organization (NISO)
 
Querying Linked Data
EUCLID project
 
Producing, publishing and consuming linked data - CSHALS 2013
François Belleau
 
Documents, services, and data on the web
Chiara Del Vescovo
 
Fedora Migration Considerations
Avalon Media System
 
DLF 2015 Presentation, "RDF in the Real World."
Avalon Media System
 
Re-using Media on the Web: Media fragment re-mixing and playout
MediaMixerCommunity
 
Ad

Viewers also liked (9)

PPTX
Monday presentation 1336-may23
alexander garcia
 
PPSX
Scientific background: chemistry student
Federico Floris
 
PPTX
Biotea poster biolinks at ISMB 2013
alexander garcia
 
PPTX
Paper as a Research Object
alexander garcia
 
PDF
LDQL: A Query Language for the Web of Linked Data
Olaf Hartig
 
PPTX
OWLGrEd Ontology Visualizer
Uldis Bojars
 
PPTX
Nanotweets
alexander garcia
 
PDF
The Semantics of SPARQL
Olaf Hartig
 
PDF
The Outcome Economy
Helge Tennø
 
Monday presentation 1336-may23
alexander garcia
 
Scientific background: chemistry student
Federico Floris
 
Biotea poster biolinks at ISMB 2013
alexander garcia
 
Paper as a Research Object
alexander garcia
 
LDQL: A Query Language for the Web of Linked Data
Olaf Hartig
 
OWLGrEd Ontology Visualizer
Uldis Bojars
 
Nanotweets
alexander garcia
 
The Semantics of SPARQL
Olaf Hartig
 
The Outcome Economy
Helge Tennø
 
Ad

Similar to RDF for PubMedCentral (20)

PPTX
The Progress on Sagace and Data Integration
Maori Ito
 
PDF
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
PDF
Bio2RDF presentation at Combine 2012
François Belleau
 
PDF
Current advances to bridge the usability-expressivity gap in biomedical seman...
Maulik Kamdar
 
PDF
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
PDF
Bio2RDF @ W3C HCLS2009
François Belleau
 
PDF
Use of open_linked_data_in_bioinformatics
Remzi Çelebi
 
ODP
Bio2RDF@BH2010
François Belleau
 
PPTX
Building a Network of Interoperable and Independently Produced Linked and Ope...
Michel Dumontier
 
PPTX
literature based discovery
alexander garcia
 
ODP
State of the Semantic Web
Ivan Herman
 
PPTX
Processing Life Science Data at Scale - using Semantic Web Technologies
Syed Muhammad Ali Hasnain
 
PDF
Semantic Web talk TEMPLATE
Oleksiy Pylypenko
 
PPTX
Applied semantic technology and linked data
William Smith
 
PDF
Connecting the dots: drug information and Linked Data
Tomasz Adamusiak
 
PPT
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
PDF
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Michel Dumontier
 
PPTX
E.Gombocz: Semantics in a Box (SemTech 2013-04-30)
Erich Gombocz
 
PPTX
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
BarbaraStarr2009
 
PDF
Harvester_presentaion
Ashwin Kasilingam
 
The Progress on Sagace and Data Integration
Maori Ito
 
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Bio2RDF presentation at Combine 2012
François Belleau
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Maulik Kamdar
 
Bio ontologies and semantic technologies
Prof. Wim Van Criekinge
 
Bio2RDF @ W3C HCLS2009
François Belleau
 
Use of open_linked_data_in_bioinformatics
Remzi Çelebi
 
Bio2RDF@BH2010
François Belleau
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Michel Dumontier
 
literature based discovery
alexander garcia
 
State of the Semantic Web
Ivan Herman
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Syed Muhammad Ali Hasnain
 
Semantic Web talk TEMPLATE
Oleksiy Pylypenko
 
Applied semantic technology and linked data
William Smith
 
Connecting the dots: drug information and Linked Data
Tomasz Adamusiak
 
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Michel Dumontier
 
E.Gombocz: Semantics in a Box (SemTech 2013-04-30)
Erich Gombocz
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
BarbaraStarr2009
 
Harvester_presentaion
Ashwin Kasilingam
 

Recently uploaded (20)

PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Doc9.....................................
SofiaCollazos
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 

RDF for PubMedCentral

  • 1. 2/14/2014 Biotea, RDF4PMC RDF4PMC, RDFizing PubMed Central Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1 1Florida State University 2Universitat Jaume I 1
  • 2. The Biotea project Why Semantic Web Technologies? RDF4PMC in a nutshell Architecture RDFization process • • • • PMC RDFization Content enrichment Some numbers for RDF4PMC Architecture • Using the data • • • • • • • • • SPARQL Bio2RDF integration Web services A first prototype Challenges and Lessons Currently working on… Future Work Conclusions Acknowledgments Biotea, RDF4PMC • • • • • 2/14/2014 Outline 2
  • 3. Christine L. Borgman • Methodologies, methods and techniques supporting semantic enrichment of scholarly communication • Once enriched, then how is this changing our user experience? Biotea, RDF4PMC Scholarly data and documents are of most value when they are interconnected rather than independent 2/14/2014 Biotea 3
  • 4. Biotea • How are publications connected to each other? • Putting together explicit assertions from different papers to form new implicit assertions • Semantic Web Technology supporting scholarly communication, Literature Based Discovery and the SearchRetrieval-and-Interacting-with-the-Document (SRID) processes Biotea, RDF4PMC Christine L. Borgman 2/14/2014 Scholarly data and documents are of most value when they are interconnected rather than independent 4
  • 5. • Retrieve all papers that have a component X (CHEBI) and the cellular location in GO terms Biotea, RDF4PMC • Generates an adaptable open approach, the data becomes the platform • The SW delivers an integrative platform • Makes it easier for the community to build over the platform • Simplifies programmatic access to information 2/14/2014 Why SWT for research documents • As simple as relating terminologies • Delivers Social Network ready content 5
  • 6. Biotea, RDF4PMC • Delivers an interoperable, interlinked, and selfdescribing document model in the biomedical domain. • A network of interconnected documents • Semantic infrastructure for PMC • An interface to the Web of Data • A knowledge model for biomedical literature – easily extendible 2/14/2014 RDF4PMC in a nutshell 6
  • 7. Biotea, RDF4PMC • RDFizing biomedical literature by orchestrating ontologies such as • DoCO, BIBO, DC, FOAF, W3CPROV, and others • Datasets are available • RDF for metadata and content • RDF for annotations from text-mining • RDFizator will be available • Adding other ontologies and annotators is possible • Working with XML from other sources is possible 2/14/2014 RDF4PMC in a nutshell 7
  • 8. PMC RDFization RDF Generation Biotea, RDF4PMC References Enrichment 2/14/2014 Metadata+ Content + References RDFReactor PMC NXML 8
  • 10. Annotations: Content Enrichment Biotea, RDF4PMC 2/14/2014 Enriched RDF RDF Generation Automatic Annotation Web service Metadata+ Content + References Web service 10
  • 13. RDF4PMC Server Architecture RDF DB Slave RDF DB Master Master Server Import scripts + RDF files PMC RDFization Web & SPARQL Server (development) RDF DB Slave Web & SPARQL Server (production)
  • 14. Consuming the data: SPARQL Query expressed in natural SPARQL query language Retrieving PubMed ?article a bibo:Document ; bibo:pmid ?pmid ; identifier, article title, dcterms:title ?title . section title, and ?section a doco:Section ; paragraphs for those dcterms:isPartOf ?article ; dcterms:title ?secTitle . Biotea, RDF4PMC WHERE { 2/14/2014 SELECT ?pmid ?title ?secTitle ?text  articles containing the FILTER (regex(str(?secTitle), "introduction", "i")). ?para a doco:Paragraph ; dcterms:isPartOf ?section ; term “cancer” in any section whose title cnt:chars ?text . FILTER (regex(str(?text), "cancer", "i")). } LIMIT 50 includes “introduction” 14
  • 15. Consuming the data: SPARQL Query expressed in natural Retrieving PubMed identifier SELECT distinct ?pmid for those articles that have WHERE { been semantically annotated ?article a bibo:AcademicArticle ; with the biological entity bibo:pmid ?pmid .  Biotea, RDF4PMC language 2/14/2014 SPARQL query CHEBI:60004. The semantic ?annotation a aot:ExactQualifier ; annotation comes from the ao:annotatesResource ?article ; occurrence of the term ao:hasTopic <http://purl.obolibrary.org/obo/CHEBI_60004> . “mixture” in any paragraph } of the retrieved articles. CHEBI:60004 A mixture is a chemical substance composed of multiple molecules, at least two of which are of a different kind 15
  • 16. Annotations Biotea, RDF4PMC 2/14/2014 Content Metadata & References Bio2RDF Integration 16
  • 17. Consuming the data: Web services Retrieval Service A list of topics and their related vocabularies http://biotea.idiginfo.org/api/topics All topics related to a term e.g., http://biotea.idiginfo.org/api/topics?term=cancer All vocabularies related to a term e.g., http://biotea.idiginfo.org/api/vocabularies?term=cancer All terms that start with a specific string (for autocompletion) e.g.,http://biotea.idiginfo.org/api/terms?prefix=canc All topics related to a vocabulary e.g., http://biotea.idiginfo.org/api/topics?vocabulary=po RDF of articles that include a term e.g., http://biotea.idiginfo.org/api/articles?term=cancer Count of RDF of articles that include a term e.g., http://biotea.idiginfo.org/api/articles?term=cancer&count=true 2/14/2014 http://biotea.idiginfo.org/api/terms Biotea, RDF4PMC A list of terms and their related topics 17 A list of vocabularies and their prefixes http://biotea.idiginfo.org/vocabularies RDF of articles that include a vocabulary e.g., http://biotea.idiginfo.org/api/articles?vocabulary=po
  • 18. Semantically enriched publication Metadata+ Content + References Automatically Annotated RDF Biotea, RDF4PMC 2/14/2014 Consuming the data: a dashboard for semantic bio-publications SPARQL 18 Catalase
  • 19. Consuming the data: first prototype Cloud of Bio-annotations (term + # of bio-entities) 2/14/2014 Title & authors Biotea, RDF4PMC Links Abstract Paragraphs containing the annotation selected by the user Graphical tools 19
  • 20. Biotea, RDF4PMC 2/14/2014 Consuming the data: A first prototype 20
  • 21. Challenges and Lessons Tables and images  Links Inline tables  Format is lost Supplementary material Most of them follow one DTD but … • References • At least 4 different styles • Some times are just plain text Biotea, RDF4PMC • • • • 2/14/2014 • Content • Annotators • Not always available • Stop words are tricky 21
  • 22. Challenges and Lessons • Annotation is context dependent Biotea, RDF4PMC • Delivering the expressivity of the data set to the end user is a complex issue 2/14/2014 • Where are the facts? How to validate the facts? • Maintaining the triplet store has a learning curve of its own • Building SW infrastructure is H A R D 22
  • 23. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • Interacting with the document • Straight into the PDF • Zero cognitive support • Data availability
  • 25. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • How, why and where are a set of documents similar? • Interacting with the document • Straight into the PDF • Zero cognitive support
  • 27. Currently working on: Literature Discovery Process • Search • Usually string-based search mechanisms • Little cognitive support • Retrieval • Simple list of DB entries • Little cognitive support • Interacting with the document • Straight into the PDF • Zero cognitive support
  • 29. Future Work • User Experience • • • • Web services for data analysis RDF browser More visualization tools Supporting and taking advantage of the structure of the document • Collaborative element Biotea, RDF4PMC • URI standardization following similar patterns to identifiers.org and Bio2RDF • Integration into Bio2RDF • Dataset identification and summary (void) • Improve data for references 2/14/2014 • RDF 29
  • 30. Future Work Biotea, RDF4PMC • From PDF to XML to RDF to Enriched Metadata for the PDF • The PDF is gently introduced in the WoD • Once the metadata has been enriched then 2/14/2014 • Application in Clinical Psychology, the MSRC case • Rich interaction supporting: SEARCH-RETRIEVALINTERACTION WITH THE DOCUMENT (PDF) 30
  • 31. Conclusions • New vocabularies as well as annotators can easily be plugged in • Our approach is useful for both open and non-open access datasets Biotea, RDF4PMC • the transformation into RDF from the original PMC files • the annotation of the RDF • an API which makes that data available. 2/14/2014 • We provide • Publishers may decide what to expose via RDF and what content to make available • Our approach is also applicable for PDF-only environments 31
  • 32. The MSRC consortium Greg Riccardi, FSU Oscar Corcho, UPM Olga Giraldo, UPM Bob Morris, Harvard University Michel Dumontier, Carleton University Dietrich Rebholz-Schuhmann, University of Zurich Diane Leiva, FSU US DoD Grant MOMRP Grant w81xwh-10-2-0181 All of those who gave us feedback about the RDFization and the quality of our RDF datasets Biotea, RDF4PMC • • • • • • • • • • 2/14/2014 Acknowledgments 32
  • 33. Contacts • Alexander García: [email protected] • L. Jael García Castro: [email protected] Biotea, RDF4PMC 2/14/2014 Thanks for you attention 33

Editor's Notes

  • #4: Not limited to open access modelsNot limited to closed business models
  • #5: In spite of the advances, scientific publications remain poorly connected to each other as well as to external resources. Furthermore, most of the information remains locked up in discrete documents without machine-processable content.
  • #6: 3rd point : As easy as building mash-ups
  • #7: 4th point then The paper becomes an
  • #13: Distribution of the first 20 journals, corresponding to about 40% of the total of 270,834 Number of biological entities identified across the papers
  • #15: SPARQL queries can be used to retrieve metadata and content. It is possible to specify words and sentences that should appear in the text or in the section title.
  • #16: As content has been semantically enriched, it is possible to retrieve articles based on either the annotated terms, e.g., “mixture,” or their corresponding biological entities, e.g., CHEBI:60004.
  • #20: Users search for a human gene names; the term is initially resolved against GeneWiki, the associated UniProt accession is then used in the SPARQL query. The resulting set includes publication metadata, abstract, and a cloud of annotations. b) Enriched content based on annotations is displayed on the interactive zone; this may be the annotated paragraph, a chemical entity, or protein related information and so on.
  • #21: Graph-based retrieval for the terms “catalase”; only shared terms with more than 30 associated biological terms are included in the results. We want to visualoize the network of interconected documents. How is a document related to another document, what do they share.
  • #22: What are the challenges we have face so far in the PMC rdfication case? Well a first challenge is related to content, tables and images are not part of the core RDF but are referenced as links so we still have access to them. However, there are some inline tables whose files and columns are described in XML; so far we recover the content but not the format so we are missing information there. The supplementary material is still difficult to integrate, it could be a Section, another paper, a technical report following a total different schema, can be outside PMC, can be a footnote... Also, most of the documents follow the same schema, but we have found some few with a different one, those cannot be currently processed. Those cases are less than the 5%.As for the references, we have at least 4 different styles, so the process is different in each case and some times we get references that are just plain text so we have to do some extra process in order to get the metadata associated.And the annotators, well, they are web services so they can be down, or too busy. Also, the stop words are tricky, we have tried to avoid them as much as possible but it is always possible to get some noisy terms in the annotations.Nota: stop words son palabrascomo “they” “can” que son evitadaspor los anotadorespor ser muycomunes o muycortas. Peroesdifícilcubrirlastodas, tenemoslasmáscomunes. Otracuestiónesquepuede ser queperdamosinformaciónporquealgunosacrónimospuede ser iguales a un stop word. Porejemplo CAN es un acrónimousado en CHEBI perocomoesunapalabracomún la evitamos, si en algún paper CAN esusado en el contexto de CHEBI y no del verboauxiliar, estaríamosperdiendoesainformación.
  • #30: Methods&amp;materials, what Olga is working onCollaborative element… still do not know but something should be done there