Semantic Data Normalization For Efficient Clinical Trial Research

Semantic Data Normalization
For
Efficient Clinical Trial Research
September 8th, 2016

• The specifics of clinical data
• What is RDF and how we can use it together with TA?
• Semantic annotations and their limitations
• What is semantic data normalization?
• Current state and next steps
Outline
September 8th, 2016

• Unstructured (Semi-Structured)
• Abundant
• Redundant
• Ambiguous
• Aggregated
Clinical Data
September 8th, 2016
In order to transform your clinical data into information and even knowledge, you will have to
analyze it!
… but before that you have to make it ready for the analysis!

September 8th, 2016
What is RDF
RDF data model resolves all syntax level ambiguities
It helps you express all data in a common data model
ID GRAA_HUMAN STANDARD; PRT; 262 AA.
AC P12544; DT 01-OCT-1989 (Rel. 12, Created)
DT 01-OCT-1989 (Rel. 12, Last sequence update)
DT 15-JUN-2002 (Rel. 41, Last annotation update)
DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-
lymphocyte proteinase
DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1)
(CTL tryptase)
DE (Fragmentin 1). GN
GZMA OR CTLA3 OR HFSP. OS Homo sapiens
(Human).
<PubmedArticle> <MedlineCitation Owner="NLM"
Status="In-Process"> <PMID
Version="1">21500419</PMID> <DateCreated>
<Year>2011</Year> <Month>04</Month>
<Day>15</Day> </DateCreated> <Article
PubModel="Print"> <Journal> <ISSN
IssnType="Electronic">1520-6882</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>82</Volume> <Issue>20</Issue>
<PubDate> <Year>2010</Year>
<Month>Oct</Month> <Day>15</Day>
</PubDate> </JournalIssue>

Linked Data
How well interlinked is the linked data cloud?
•Many interesting queries are difficult to be expressed in SPARQL
•String functions could not be index
•Often there are misplaced identifiers
P29965
UNIPROT
CD40L_HUMAN
cpath:CPATH-94138
cpath:CPATH-LOCAL-8467065
cpath:CPATH-LOCAL-8749236
uniprot:P29965
CD40L_HUMAN
TNF5_HUMAN
CD4L_HUMAN
#5
September 8th, 2016

Semantic Annotations
pmid:17714090
umls:C0035204
COPD
Bronchial Diseases
Respiration Disorders
umls:C0006261
Chronic Obstructive
Airway Diseases
Asthma umls:C000496
Ian A Yang
Clinical and experimental pharmacology …
September 8th, 2016

• Good for:
– Generation of machine readable meta data
– Semantic indexing of large sets of documents
– Providing additional background knowledge
• Limitations:
– Incomplete knowledge extraction
– Does not capture completely the context
Semantic Annotations
September 8th, 2016

• What is it?
– A text analytics approach that aims to capture the full
context of the information and to provide clear references to
concepts/objects in order to be easily interpreted by
machines.
• How we do it?
– Work on sentence level
– Extract the key phrases from the sentence
– Identify the main concept
– Identify all the qualifiers and negations
– Model the extracted data as RDF
September 8th, 2016

September 8th, 2016
• Condition text:
– “Advanced Biliary Tract Adenocarcinoma” (Study ID = NCT01506973)
• Text Analysis
– One phrase is identified in the Condition text
– Advanced Biliary Tract Adenocarcinoma
• Data Schema
– One annotation object is created
– Main concept is “Adenocarcinoma”
– Qualifier concepts are “Advanced” and “Biliary tract”

September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:conditionText “Advanced Biliary Tract Adenocarcinoma”
ct:conditionAnnotation ConditionAnnotationID
ca:hasDisease
C0001418
ca:hasPhrase
“Advanced Biliary Tract Adenocarcinoma”
ca:hasQualifiers
QualifierGroupID
C0205179 C0005423
cg:hasQualifiers

• Study Conditions
– Multiple phrases in a text
– Pre-coordinated concepts vs. post-coordinated
– Scoring of matching concepts
• Study Interventions
– Drug, route, form
– Drug dosage
• Adverse Events
– Normalization of AE
– Post-coordinated concepts
• Eligibility Criteria
– Semantic sectioning and categorization
– Negations
– Diseases, findings, treatments, age and gender
Demo Example
September 8th, 2016

Intervention Annotation Model - Drugs
September 8th, 2016
NCT01506973
ct:hasIntervention
in:drugAnnotation
DrugAnnotationID
da:hasDrug
111418
da:hasAdministrationRoute
do:hasSingleDose
DrugDosageID
SingleDoseID PeriodID
do:hasPeriod
NCT01506973_1_2
SCTID:111418
SCTID:121681
da:hasDosage
do:hasFrequency
FrequencyID
Value Unit
Denominator
Value
Denominator
Unit
da:hasAdministrationForm

Criteria Annotation Model
September 8th, 2016
NCT01506973
ct:hasCriteriaSection
cs:hasCriterion
Criterion
cr:hasText
cr:hasAnnotation
CriteriaSection
AnnotationId
sa:Negation
rdf:type “Inclusion”/”Exclusion”/”Not defined”
cs:hasText
…
No extensive intraductal components on core
biopsy, defined as intraductal carcinoma.
Patients must not have recurrent invasive breast
cancer.
…
Patients must not have recurrent invasive breast
cancer.
“Disease”/”Drug”/…rdf:type
“True”/”False”/…Property 1Property 2Property N

• Work with ClinicalTrials.gov data as public show case
– > 215K clinical studies
– > 76 million RDF statements
• Coverage
– Conditions (197,154 objects)
– Diseases, Findings, Body locations, Qualifiers
– Interventions (rdf:type = ‘Drug’ and rdf:type = ‘Biologics’) – (381,590 objects)
– Drugs, Dosages, Administration form, Administration route, Population group
– Adverse Events – (1,226,754 objects)
– Diseases, Findings, Body locations, Qualifiers
– Criteria (semantic sectioning and categorization, negations) – (7,216,361 objects)
– Diseases, Findings, Drugs, Population groups
• In total more than 80 millions of RDF triples
Current Status
September 8th, 2016

• Directly mine the public enhanced CT.gov version
• Apply the same approach over your internal clinical trials data
• Once the data is semantically normalized you can “slice and
dice” it as your use case requires
• Examples
– Top-bottom data exploration
– Linked data browsing
How Can I Use This?
September 8th, 2016

Next Steps
• Release RDFized version of ClinicalTrials.gov
• Pre-loaded in GraphDB Free
• Pre-loaded on Ontotext S4 Cloud
• As RDF serialization distribution
• Release all semantically structured information
under free for non-commercial use license
• Extend the data schema to support not only
concepts but also tokens which cannot be
normalized to ontology instances

Thank You!
You can contact me by e-mail:
todor.primov@ontotext.com

Semantic Data Normalization For Efficient Clinical Trial Research

More Related Content

What's hot

Viewers also liked

Similar to Semantic Data Normalization For Efficient Clinical Trial Research

More from Graphwise ( previously Ontotext)

Recently uploaded

Semantic Data Normalization For Efficient Clinical Trial Research