Semantic Data Normalization
For
Efficient Clinical Trial Research
September 8th, 2016
• The specifics of clinical data
• What is RDF and how we can use it together with TA?
• Semantic annotations and their limitations
• What is semantic data normalization?
• Current state and next steps
Outline
September 8th, 2016
• Unstructured (Semi-Structured)
• Abundant
• Redundant
• Ambiguous
• Aggregated
Clinical Data
September 8th, 2016
In order to transform your clinical data into information and even knowledge, you will have to
analyze it!
… but before that you have to make it ready for the analysis!
September 8th, 2016
What is RDF
RDF data model resolves all syntax level ambiguities
It helps you express all data in a common data model
ID GRAA_HUMAN STANDARD; PRT; 262 AA.
AC P12544; DT 01-OCT-1989 (Rel. 12, Created)
DT 01-OCT-1989 (Rel. 12, Last sequence update)
DT 15-JUN-2002 (Rel. 41, Last annotation update)
DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-
lymphocyte proteinase
DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1)
(CTL tryptase)
DE (Fragmentin 1). GN
GZMA OR CTLA3 OR HFSP. OS Homo sapiens
(Human).
<PubmedArticle> <MedlineCitation Owner="NLM"
Status="In-Process"> <PMID
Version="1">21500419</PMID> <DateCreated>
<Year>2011</Year> <Month>04</Month>
<Day>15</Day> </DateCreated> <Article
PubModel="Print"> <Journal> <ISSN
IssnType="Electronic">1520-6882</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>82</Volume> <Issue>20</Issue>
<PubDate> <Year>2010</Year>
<Month>Oct</Month> <Day>15</Day>
</PubDate> </JournalIssue>
Linked Data
How well interlinked is the linked data cloud?
•Many interesting queries are difficult to be expressed in SPARQL
•String functions could not be index
•Often there are misplaced identifiers
P29965
UNIPROT
CD40L_HUMAN
cpath:CPATH-94138
cpath:CPATH-LOCAL-8467065
cpath:CPATH-LOCAL-8749236
uniprot:P29965
CD40L_HUMAN
TNF5_HUMAN
CD4L_HUMAN
#5
September 8th, 2016
Semantic Annotations
pmid:17714090
umls:C0035204
COPD
Bronchial Diseases
Respiration Disorders
umls:C0006261
Chronic Obstructive
Airway Diseases
Asthma umls:C000496
Ian A Yang
Clinical and experimental pharmacology …
September 8th, 2016
• Good for:
– Generation of machine readable meta data
– Semantic indexing of large sets of documents
– Providing additional background knowledge
• Limitations:
– Incomplete knowledge extraction
– Does not capture completely the context
Semantic Annotations
September 8th, 2016
• What is it?
– A text analytics approach that aims to capture the full
context of the information and to provide clear references to
concepts/objects in order to be easily interpreted by
machines.
• How we do it?
– Work on sentence level
– Extract the key phrases from the sentence
– Identify the main concept
– Identify all the qualifiers and negations
– Model the extracted data as RDF
Semantic Data Normalization
September 8th, 2016
Semantic Data Normalization
September 8th, 2016
• Condition text:
– “Advanced Biliary Tract Adenocarcinoma” (Study ID = NCT01506973)
• Text Analysis
– One phrase is identified in the Condition text
– Advanced Biliary Tract Adenocarcinoma
• Data Schema
– One annotation object is created
– Main concept is “Adenocarcinoma”
– Qualifier concepts are “Advanced” and “Biliary tract”
Semantic Data Normalization
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:conditionText “Advanced Biliary Tract Adenocarcinoma”
ct:conditionAnnotation ConditionAnnotationID
ca:hasDisease
C0001418
ca:hasPhrase
“Advanced Biliary Tract Adenocarcinoma”
ca:hasQualifiers
QualifierGroupID
C0205179 C0005423
cg:hasQualifiers
• Study Conditions
– Multiple phrases in a text
– Pre-coordinated concepts vs. post-coordinated
– Scoring of matching concepts
• Study Interventions
– Drug, route, form
– Drug dosage
• Adverse Events
– Normalization of AE
– Post-coordinated concepts
• Eligibility Criteria
– Semantic sectioning and categorization
– Negations
– Diseases, findings, treatments, age and gender
Demo Example
September 8th, 2016
Intervention Annotation Model - Drugs
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:hasIntervention
in:drugAnnotation
DrugAnnotationID
da:hasDrug
111418
da:hasAdministrationRoute
do:hasSingleDose
DrugDosageID
SingleDoseID PeriodID
do:hasPeriod
NCT01506973_1_2
SCTID:111418
SCTID:121681
da:hasDosage
do:hasFrequency
FrequencyID
Value Unit
Denominator
Value
Denominator
Unit
da:hasAdministrationForm
Criteria Annotation Model
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:hasCriteriaSection
cs:hasCriterion
Criterion
cr:hasText
cr:hasAnnotation
CriteriaSection
AnnotationId
sa:Negation
rdf:type “Inclusion”/”Exclusion”/”Not defined”
cs:hasText
…
No extensive intraductal components on core
biopsy, defined as intraductal carcinoma.
Patients must not have recurrent invasive breast
cancer.
…
Patients must not have recurrent invasive breast
cancer.
“Disease”/”Drug”/…rdf:type
“True”/”False”/…Property 1Property 2Property N
• Work with ClinicalTrials.gov data as public show case
– > 215K clinical studies
– > 76 million RDF statements
• Coverage
– Conditions (197,154 objects)
– Diseases, Findings, Body locations, Qualifiers
– Interventions (rdf:type = ‘Drug’ and rdf:type = ‘Biologics’) – (381,590 objects)
– Drugs, Dosages, Administration form, Administration route, Population group
– Adverse Events – (1,226,754 objects)
– Diseases, Findings, Body locations, Qualifiers
– Criteria (semantic sectioning and categorization, negations) – (7,216,361 objects)
– Diseases, Findings, Drugs, Population groups
• In total more than 80 millions of RDF triples
Current Status
September 8th, 2016
• Directly mine the public enhanced CT.gov version
• Apply the same approach over your internal clinical trials data
• Once the data is semantically normalized you can “slice and
dice” it as your use case requires
• Examples
– Top-bottom data exploration
– Linked data browsing
How Can I Use This?
September 8th, 2016
Next Steps
• Release RDFized version of ClinicalTrials.gov
• Pre-loaded in GraphDB Free
• Pre-loaded on Ontotext S4 Cloud
• As RDF serialization distribution
• Release all semantically structured information
under free for non-commercial use license
• Extend the data schema to support not only
concepts but also tokens which cannot be
normalized to ontology instances
Thank You!
You can contact me by e-mail:
todor.primov@ontotext.com

Semantic Data Normalization For Efficient Clinical Trial Research

  • 1.
    Semantic Data Normalization For EfficientClinical Trial Research September 8th, 2016
  • 2.
    • The specificsof clinical data • What is RDF and how we can use it together with TA? • Semantic annotations and their limitations • What is semantic data normalization? • Current state and next steps Outline September 8th, 2016
  • 3.
    • Unstructured (Semi-Structured) •Abundant • Redundant • Ambiguous • Aggregated Clinical Data September 8th, 2016 In order to transform your clinical data into information and even knowledge, you will have to analyze it! … but before that you have to make it ready for the analysis!
  • 4.
    September 8th, 2016 Whatis RDF RDF data model resolves all syntax level ambiguities It helps you express all data in a common data model ID GRAA_HUMAN STANDARD; PRT; 262 AA. AC P12544; DT 01-OCT-1989 (Rel. 12, Created) DT 01-OCT-1989 (Rel. 12, Last sequence update) DT 15-JUN-2002 (Rel. 41, Last annotation update) DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T- lymphocyte proteinase DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1) (CTL tryptase) DE (Fragmentin 1). GN GZMA OR CTLA3 OR HFSP. OS Homo sapiens (Human). <PubmedArticle> <MedlineCitation Owner="NLM" Status="In-Process"> <PMID Version="1">21500419</PMID> <DateCreated> <Year>2011</Year> <Month>04</Month> <Day>15</Day> </DateCreated> <Article PubModel="Print"> <Journal> <ISSN IssnType="Electronic">1520-6882</ISSN> <JournalIssue CitedMedium="Internet"> <Volume>82</Volume> <Issue>20</Issue> <PubDate> <Year>2010</Year> <Month>Oct</Month> <Day>15</Day> </PubDate> </JournalIssue>
  • 5.
    Linked Data How wellinterlinked is the linked data cloud? •Many interesting queries are difficult to be expressed in SPARQL •String functions could not be index •Often there are misplaced identifiers P29965 UNIPROT CD40L_HUMAN cpath:CPATH-94138 cpath:CPATH-LOCAL-8467065 cpath:CPATH-LOCAL-8749236 uniprot:P29965 CD40L_HUMAN TNF5_HUMAN CD4L_HUMAN #5 September 8th, 2016
  • 6.
    Semantic Annotations pmid:17714090 umls:C0035204 COPD Bronchial Diseases RespirationDisorders umls:C0006261 Chronic Obstructive Airway Diseases Asthma umls:C000496 Ian A Yang Clinical and experimental pharmacology … September 8th, 2016
  • 7.
    • Good for: –Generation of machine readable meta data – Semantic indexing of large sets of documents – Providing additional background knowledge • Limitations: – Incomplete knowledge extraction – Does not capture completely the context Semantic Annotations September 8th, 2016
  • 8.
    • What isit? – A text analytics approach that aims to capture the full context of the information and to provide clear references to concepts/objects in order to be easily interpreted by machines. • How we do it? – Work on sentence level – Extract the key phrases from the sentence – Identify the main concept – Identify all the qualifiers and negations – Model the extracted data as RDF Semantic Data Normalization September 8th, 2016
  • 9.
    Semantic Data Normalization September8th, 2016 • Condition text: – “Advanced Biliary Tract Adenocarcinoma” (Study ID = NCT01506973) • Text Analysis – One phrase is identified in the Condition text – Advanced Biliary Tract Adenocarcinoma • Data Schema – One annotation object is created – Main concept is “Adenocarcinoma” – Qualifier concepts are “Advanced” and “Biliary tract”
  • 10.
    Semantic Data Normalization September8th, 2016 NCT01506973 rdf:type ClinicalTrial ct:conditionText “Advanced Biliary Tract Adenocarcinoma” ct:conditionAnnotation ConditionAnnotationID ca:hasDisease C0001418 ca:hasPhrase “Advanced Biliary Tract Adenocarcinoma” ca:hasQualifiers QualifierGroupID C0205179 C0005423 cg:hasQualifiers
  • 11.
    • Study Conditions –Multiple phrases in a text – Pre-coordinated concepts vs. post-coordinated – Scoring of matching concepts • Study Interventions – Drug, route, form – Drug dosage • Adverse Events – Normalization of AE – Post-coordinated concepts • Eligibility Criteria – Semantic sectioning and categorization – Negations – Diseases, findings, treatments, age and gender Demo Example September 8th, 2016
  • 12.
    Intervention Annotation Model- Drugs September 8th, 2016 NCT01506973 rdf:type ClinicalTrial ct:hasIntervention in:drugAnnotation DrugAnnotationID da:hasDrug 111418 da:hasAdministrationRoute do:hasSingleDose DrugDosageID SingleDoseID PeriodID do:hasPeriod NCT01506973_1_2 SCTID:111418 SCTID:121681 da:hasDosage do:hasFrequency FrequencyID Value Unit Denominator Value Denominator Unit da:hasAdministrationForm
  • 13.
    Criteria Annotation Model September8th, 2016 NCT01506973 rdf:type ClinicalTrial ct:hasCriteriaSection cs:hasCriterion Criterion cr:hasText cr:hasAnnotation CriteriaSection AnnotationId sa:Negation rdf:type “Inclusion”/”Exclusion”/”Not defined” cs:hasText … No extensive intraductal components on core biopsy, defined as intraductal carcinoma. Patients must not have recurrent invasive breast cancer. … Patients must not have recurrent invasive breast cancer. “Disease”/”Drug”/…rdf:type “True”/”False”/…Property 1Property 2Property N
  • 14.
    • Work withClinicalTrials.gov data as public show case – > 215K clinical studies – > 76 million RDF statements • Coverage – Conditions (197,154 objects) – Diseases, Findings, Body locations, Qualifiers – Interventions (rdf:type = ‘Drug’ and rdf:type = ‘Biologics’) – (381,590 objects) – Drugs, Dosages, Administration form, Administration route, Population group – Adverse Events – (1,226,754 objects) – Diseases, Findings, Body locations, Qualifiers – Criteria (semantic sectioning and categorization, negations) – (7,216,361 objects) – Diseases, Findings, Drugs, Population groups • In total more than 80 millions of RDF triples Current Status September 8th, 2016
  • 15.
    • Directly minethe public enhanced CT.gov version • Apply the same approach over your internal clinical trials data • Once the data is semantically normalized you can “slice and dice” it as your use case requires • Examples – Top-bottom data exploration – Linked data browsing How Can I Use This? September 8th, 2016
  • 16.
    Next Steps • ReleaseRDFized version of ClinicalTrials.gov • Pre-loaded in GraphDB Free • Pre-loaded on Ontotext S4 Cloud • As RDF serialization distribution • Release all semantically structured information under free for non-commercial use license • Extend the data schema to support not only concepts but also tokens which cannot be normalized to ontology instances
  • 17.