RDF for PubMedCentral

2/14/2014
Biotea, RDF4PMC

RDF4PMC, RDFizing
PubMed Central
Alexander Garcia1, Leyla Jael García Castro2, Casey McLaughlin1
1Florida State University
2Universitat Jaume I

1

The Biotea project
Why Semantic Web Technologies?
RDF4PMC in a nutshell
Architecture
RDFization process
•
•
•
•

PMC RDFization
Content enrichment
Some numbers for RDF4PMC
Architecture

• Using the data
•
•
•
•

•
•
•
•
•

SPARQL
Bio2RDF integration
Web services
A first prototype

Challenges and Lessons
Currently working on…
Future Work
Conclusions
Acknowledgments

Biotea, RDF4PMC

•
•
•
•
•

2/14/2014

Outline

2

Christine L. Borgman

• Methodologies, methods and techniques supporting semantic
enrichment of scholarly communication
• Once enriched, then how is this changing our user experience?

Biotea, RDF4PMC

Scholarly data and documents are of most value
when they are interconnected rather than
independent

2/14/2014

Biotea

3

Biotea

• How are publications connected to each other?
• Putting together explicit assertions from different papers to
form new implicit assertions
• Semantic
Web
Technology
supporting
scholarly
communication, Literature Based Discovery and the SearchRetrieval-and-Interacting-with-the-Document (SRID) processes

Biotea, RDF4PMC

Christine L. Borgman

2/14/2014

Scholarly data and documents are of most value when they are
interconnected rather than independent

4

• Retrieve all papers that have a component X (CHEBI)

and the cellular location in GO terms

Biotea, RDF4PMC

• Generates an adaptable open approach, the data becomes the
platform
• The SW delivers an integrative platform
• Makes it easier for the community to build over the platform
• Simplifies programmatic access to information

2/14/2014

Why SWT for research documents

• As simple as relating terminologies
• Delivers Social Network ready content
5

Biotea, RDF4PMC

• Delivers an interoperable, interlinked, and selfdescribing document model in the biomedical
domain.
• A network of interconnected documents
• Semantic infrastructure for PMC
• An interface to the Web of Data
• A knowledge model for biomedical literature –
easily extendible

2/14/2014


6

Biotea, RDF4PMC

• RDFizing biomedical literature by orchestrating
ontologies such as
• DoCO, BIBO, DC, FOAF, W3CPROV, and others
• Datasets are available
• RDF for metadata and content
• RDF for annotations from text-mining
• RDFizator will be available
• Adding other ontologies and annotators is possible
• Working with XML from other sources is possible

2/14/2014


7

PMC RDFization

RDF Generation

Biotea, RDF4PMC

References Enrichment

2/14/2014

Metadata+ Content
+ References

RDFReactor

PMC NXML

8

Annotations: Content Enrichment

Biotea, RDF4PMC

2/14/2014

Enriched RDF

RDF Generation

Automatic Annotation
Web service

Metadata+ Content
+ References

Web service

10

11

Biotea, RDF4PMC

2/14/2014

Biotea, RDF4PMC

2/14/2014

RDF4PMC, some numbers

12

RDF4PMC Server Architecture
RDF DB
Slave
RDF DB
Master
Master
Server

Import scripts
+ RDF files

PMC RDFization

Web &
SPARQL
Server
(development)

RDF DB
Slave Web &
SPARQL
Server
(production)

Consuming the data: SPARQL
Query expressed in natural
SPARQL query
language

Retrieving PubMed

?article a bibo:Document ;
bibo:pmid ?pmid ;

identifier, article title,

dcterms:title ?title .

section title, and

?section a doco:Section ;

paragraphs for those

dcterms:isPartOf ?article ;
dcterms:title ?secTitle .

Biotea, RDF4PMC

WHERE {

2/14/2014

SELECT ?pmid ?title ?secTitle ?text



articles containing the

FILTER (regex(str(?secTitle), "introduction", "i")).
?para a doco:Paragraph ;
dcterms:isPartOf ?section ;

term “cancer” in any
section whose title

cnt:chars ?text .
FILTER (regex(str(?text), "cancer", "i")).
} LIMIT 50

includes “introduction”

14

Consuming the data: SPARQL
Query expressed in natural

Retrieving PubMed identifier
SELECT distinct ?pmid
for those articles that have
WHERE {

been semantically annotated
?article a bibo:AcademicArticle ;
with the biological entity
bibo:pmid ?pmid .


Biotea, RDF4PMC

language

2/14/2014

SPARQL query

CHEBI:60004. The semantic

?annotation a aot:ExactQualifier ;
annotation comes from the
ao:annotatesResource ?article ;
occurrence of the term
ao:hasTopic <http://purl.obolibrary.org/obo/CHEBI_60004> .
“mixture” in any paragraph
}
of the retrieved articles.
CHEBI:60004
A mixture is a chemical substance composed of multiple molecules, at least two of which are of a different kind

15

Annotations

Biotea, RDF4PMC

2/14/2014

Content

Metadata & References

Bio2RDF Integration

16

Consuming the data: Web services
Retrieval

Service

A list of topics and their related vocabularies

http://biotea.idiginfo.org/api/topics

All topics related to a term

e.g., http://biotea.idiginfo.org/api/topics?term=cancer

All vocabularies related to a term

e.g., http://biotea.idiginfo.org/api/vocabularies?term=cancer

All terms that start with a specific string (for autocompletion)

e.g.,http://biotea.idiginfo.org/api/terms?prefix=canc

All topics related to a vocabulary

e.g., http://biotea.idiginfo.org/api/topics?vocabulary=po

RDF of articles that include a term

e.g., http://biotea.idiginfo.org/api/articles?term=cancer

Count of RDF of articles that include a term

e.g., http://biotea.idiginfo.org/api/articles?term=cancer&count=true

2/14/2014

http://biotea.idiginfo.org/api/terms

Biotea, RDF4PMC

A list of terms and their related topics

17
A list of vocabularies and their prefixes

http://biotea.idiginfo.org/vocabularies

RDF of articles that include a vocabulary

e.g., http://biotea.idiginfo.org/api/articles?vocabulary=po

Semantically enriched
publication

Metadata+ Content
+ References

Automatically
Annotated RDF

Biotea, RDF4PMC

2/14/2014

Consuming the data: a dashboard for
semantic bio-publications

SPARQL

18
Catalase

Consuming the data: first prototype
Cloud of Bio-annotations
(term + # of bio-entities)

2/14/2014

Title &
authors

Biotea, RDF4PMC

Links

Abstract

Paragraphs containing
the annotation selected
by the user

Graphical tools

19

Biotea, RDF4PMC

2/14/2014

Consuming the data: A first prototype

20

Tables and images  Links
Inline tables  Format is lost
Supplementary material
Most of them follow one DTD but …

• References
• At least 4 different styles
• Some times are just plain text

Biotea, RDF4PMC

•
•
•
•

2/14/2014

• Content

• Annotators
• Not always available
• Stop words are tricky
21


• Annotation is context dependent

Biotea, RDF4PMC

• Delivering the expressivity of the data set to the end user is a
complex issue

2/14/2014

• Where are the facts? How to validate the facts?

• Maintaining the triplet store has a learning curve of its own
• Building SW infrastructure is H A R D

22

Currently working on:
Literature Discovery Process
• Search
• Usually string-based search mechanisms
• Little cognitive support

• Retrieval
• Simple list of DB entries

• Interacting with the document
• Straight into the PDF
• Zero cognitive support
• Data availability

• Search

• Retrieval
• How, why and where are a set of documents similar?


• Search

• Retrieval


Future Work

• User Experience
•
•
•
•

Web services for data analysis
RDF browser
More visualization tools
Supporting and taking advantage of the structure of the
document
• Collaborative element

Biotea, RDF4PMC

• URI standardization following similar patterns to identifiers.org
and Bio2RDF
• Integration into Bio2RDF
• Dataset identification and summary (void)
• Improve data for references

2/14/2014

• RDF

29

Future Work

Biotea, RDF4PMC

• From PDF to XML to RDF to Enriched Metadata
for the PDF
• The PDF is gently introduced in the WoD
• Once the metadata has been enriched then

2/14/2014

• Application in Clinical Psychology, the MSRC case

• Rich interaction supporting: SEARCH-RETRIEVALINTERACTION WITH THE DOCUMENT (PDF)
30

Conclusions

• New vocabularies as well as annotators can easily be plugged in
• Our approach is useful for both open and non-open access
datasets

Biotea, RDF4PMC

• the transformation into RDF from the original PMC files
• the annotation of the RDF
• an API which makes that data available.

2/14/2014

• We provide

• Publishers may decide what to expose via RDF and what content to
make available

• Our approach is also applicable for PDF-only environments
31

The MSRC consortium
Greg Riccardi, FSU
Oscar Corcho, UPM
Olga Giraldo, UPM
Bob Morris, Harvard University
Michel Dumontier, Carleton University
Dietrich Rebholz-Schuhmann, University of Zurich
Diane Leiva, FSU
US DoD Grant MOMRP Grant w81xwh-10-2-0181
All of those who gave us feedback about the RDFization and
the quality of our RDF datasets

Biotea, RDF4PMC

•
•
•
•
•
•
•
•
•
•

2/14/2014

Acknowledgments

32

Contacts
• Alexander García: agarciac@gmail.com
• L. Jael García Castro: leylajael@gmail.com

Biotea, RDF4PMC

2/14/2014

Thanks for you attention

33

RDF for PubMedCentral

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to RDF for PubMedCentral (20)

Recently uploaded (20)

RDF for PubMedCentral

Editor's Notes