SlideShare a Scribd company logo
ChemExtractor:
Enhanced Rule-Based Capture and
Identification of PDF Based Property Data
Stuart J. Chalk, Department of Chemistry
University of North Florida
schalk@unf.edu
253rd ACS Meeting April 2017
Outline
 Motivation
 Research Approach
 Analyzing Tabular Data
 Types of Data
 Regular Expressions (Regex)
 Rules and Rulesets
 Examples of Data Extraction
 Contextualizing Data
 Data Storage (MySQL)
 Data Representation (SciData)
 Conclusion
Funding for this project provided by
Motivation
 The Landholt-Börnstein Database is +450 volumes of
curated chemical property data (18?? to date)
 With the move to data-driven science it is imperative we
leverage the time invested in, and scientific quality of, the
curation of this data
 This high value data is locked in PDF files 
 This data can be made more useful if it is extracted with
its metadata (chemical system, original reference,
physical property with unit, etc.)
 Optical Character Recognition (OCR) is a standard
process to extract text from scanned images
 Accurate extraction of text using OCR in a page layout
format not only captures the text but also the inferred
relationships between tabular data
 Utilize regular expression (regex) analysis of tabular
text to capture data and its position relative to other
data
 Contextualize captured data with metadata encoded
based on layout of table and string format
Research Approach
Analyzing Tabular Data
Chemical
Metadata
Series of
Property Data
Condition
Series Condition
Property Data
Reference
 Properties and Units
 Conditions, data, supplemental data
 Equation coefficients and variables
 Chemical properties (MW, BP, MP)
 Annotations
 Table headers, column headers, data notes
 Chemical Metadata
 Formula, name, CASRN
 Context Metadata
 Table #, refcodes, property headers, component #
Types of Data
 Implemented in every programming language
 Relatively uniform implementation
 Uses syntax to create regex string that matches
and/or captures characters in string
 Groups of characters – ., [A-Z], [a-z], [0-9]
 Character classes w, d, s, h
 Repetition - ? (optional), + (1 or more), * ( or more)
 Capture group – ()
Regular Expressions (Regex)
http://regex101.com
Regex
Example 1
Regex
Example 2
http://regex101.com
http://regex101.com
Regex
Example 3
 We write rules for specific line formats constructed using
 Rule templates – define basic structure of a line, i.e.
how many blocks of text that need to be captured
 Rule snippets – define small regex strings that capture a particular
format of text
 Rulesets are list of rules with associated actions indicating
sequentaily process text line by line
Rules and Rulesets
^@B1@h+@B2@h*@B3@h*@B4@h*@B5@h*@B6@$
(?:[A-Z][a-z]{0,2}d{0,3}h?)+(?:.?d*.?d*H2O)
Rules and Rulesets
Rules and Rulesets
Rules and Rulesets
Rules and Rulesets
Examples of Data Extraction
Examples of Data Extraction
Examples of Data Extraction
Examples of Data Extraction
 This data…
 … is detected by rule ‘Temperature (K) & refcode’
Contextualizing the Data
T/K 298.15 () () () () () () () () 81V1 ()
Data Storage (MySQL)
Data Storage (MySQL)
Data Presentation
Data Presentation
Data Representation (SciData)
 Higher accuracy capture of property data and equation data
(350,000 property data points, 10,000 equations)
 Integrated chemical and reference metadata
 Can be applied to other PDF-based curated datasets
 Open website for community use this summer
 Implementation for research article capture of chemical
property data
Conclusion
 schalk@unf.edu
 Phone: 904-620-1938
 Skype: stuartchalk
 LinkedIn: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
Questions?

More Related Content

What's hot (20)

PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
PPTX
A Standard Data Format for Computational Chemistry: CSX
Stuart Chalk
 
PPTX
Effective and Efficient Entity Search in RDF data
Roi Blanco
 
ODP
2009 0807 Lod Gmod
Jun Zhao
 
PDF
ChemEngine ACS
Muthukumarasamy Karthikeyan
 
PPTX
ModelDR - the tool that untangles complex information
Simon Roberts
 
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
PPTX
chemengine karthi acs sandiego rev1.0
Muthukumarasamy Karthikeyan
 
PDF
Rethinking data intensive science using scalable analytics systems
newmooxx
 
PDF
Hybrid geo textual index structure
cseij
 
PDF
Lecture 07 Data Structures - Basic Sorting
Haitham El-Ghareeb
 
PPTX
Latest trends in AI and information Retrieval
Abhay Ratnaparkhi
 
PDF
International Journal of Engineering Research and Development
IJERD Editor
 
PPTX
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPTX
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
PPTX
Limits of RDBMS and Need for NoSQL in Bioinformatics
Dan Sullivan, Ph.D.
 
PPT
Integrating scientific laboratories into the cloud
Data Finder
 
PPT
Binary search in ds
chauhankapil
 
PPT
Design and creation of ontologies for environmental information retrieval
AIMS (Agricultural Information Management Standards)
 
PDF
M phil-computer-science-machine-language-and-pattern-analysis-projects
Vijay Karan
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Johann Petrak
 
A Standard Data Format for Computational Chemistry: CSX
Stuart Chalk
 
Effective and Efficient Entity Search in RDF data
Roi Blanco
 
2009 0807 Lod Gmod
Jun Zhao
 
ModelDR - the tool that untangles complex information
Simon Roberts
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
chemengine karthi acs sandiego rev1.0
Muthukumarasamy Karthikeyan
 
Rethinking data intensive science using scalable analytics systems
newmooxx
 
Hybrid geo textual index structure
cseij
 
Lecture 07 Data Structures - Basic Sorting
Haitham El-Ghareeb
 
Latest trends in AI and information Retrieval
Abhay Ratnaparkhi
 
International Journal of Engineering Research and Development
IJERD Editor
 
16. Algo analysis & Design - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
widespreadpromotion
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Dan Sullivan, Ph.D.
 
Integrating scientific laboratories into the cloud
Data Finder
 
Binary search in ds
chauhankapil
 
Design and creation of ontologies for environmental information retrieval
AIMS (Agricultural Information Management Standards)
 
M phil-computer-science-machine-language-and-pattern-analysis-projects
Vijay Karan
 

Similar to ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data (13)

PDF
Using Regular Expressions in Document Management Data Capture and Indexing
Sandy Schiele
 
PPTX
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Stuart Chalk
 
PPTX
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPT
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
PPT
Royal society of chemistry activities to develop a data repository for chemis...
Ken Karapetyan
 
ODP
OISF: Regular Expressions (Regex) Overview
ThreatReel Podcast
 
PPT
Chemxseer qr-sagnik
TahseenaM
 
PDF
beyond-regular-regular-expressions-v20.pdf
ronaldopanuelos
 
ODP
DerbyCon 7.0 Legacy: Regular Expressions (Regex) Overview
ThreatReel Podcast
 
PPTX
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Prof. Wim Van Criekinge
 
PPTX
Regular expressions
Eran Zimbler
 
PDF
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
NextMove Software
 
Using Regular Expressions in Document Management Data Capture and Indexing
Sandy Schiele
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Stuart Chalk
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Royal society of chemistry activities to develop a data repository for chemis...
Ken Karapetyan
 
OISF: Regular Expressions (Regex) Overview
ThreatReel Podcast
 
Chemxseer qr-sagnik
TahseenaM
 
beyond-regular-regular-expressions-v20.pdf
ronaldopanuelos
 
DerbyCon 7.0 Legacy: Regular Expressions (Regex) Overview
ThreatReel Podcast
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Prof. Wim Van Criekinge
 
Regular expressions
Eran Zimbler
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
NextMove Software
 
Ad

More from Stuart Chalk (20)

PPTX
Semantic properties and units
Stuart Chalk
 
PPTX
Open semantic chemical structures
Stuart Chalk
 
PPTX
AnIML: A New Analytical Data Standard
Stuart Chalk
 
PPTX
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
Stuart Chalk
 
PPTX
Scientific Units in the Electronic Age
Stuart Chalk
 
PPTX
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Stuart Chalk
 
PPTX
The Electronic Notebook Ontology
Stuart Chalk
 
PPTX
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Stuart Chalk
 
PPTX
Bringing Flow injection Analysis to the Semantic Web
Stuart Chalk
 
PPTX
Reactions to the Open Spectral Database
Stuart Chalk
 
PPTX
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Stuart Chalk
 
PPTX
Building a Standard for Standards: The ChAMP Project
Stuart Chalk
 
PPTX
Overview of the Analytical Information Markup Language (AnIML)
Stuart Chalk
 
PPTX
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
Stuart Chalk
 
PPTX
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk
 
PPTX
ACS 248th Paper 108 NIST-IUPAC Solubility Data
Stuart Chalk
 
PPTX
ACS 248th Paper 104 ChemData Project
Stuart Chalk
 
PPTX
ACS 248th Paper 71 ChAMP Project
Stuart Chalk
 
PPTX
ACS 248th Paper 67 Eureka Collaboration
Stuart Chalk
 
PPTX
247th ACS Meeting: The Eureka Research Workbench
Stuart Chalk
 
Semantic properties and units
Stuart Chalk
 
Open semantic chemical structures
Stuart Chalk
 
AnIML: A New Analytical Data Standard
Stuart Chalk
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
Stuart Chalk
 
Scientific Units in the Electronic Age
Stuart Chalk
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Stuart Chalk
 
The Electronic Notebook Ontology
Stuart Chalk
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Stuart Chalk
 
Bringing Flow injection Analysis to the Semantic Web
Stuart Chalk
 
Reactions to the Open Spectral Database
Stuart Chalk
 
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Stuart Chalk
 
Building a Standard for Standards: The ChAMP Project
Stuart Chalk
 
Overview of the Analytical Information Markup Language (AnIML)
Stuart Chalk
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
Stuart Chalk
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk
 
ACS 248th Paper 108 NIST-IUPAC Solubility Data
Stuart Chalk
 
ACS 248th Paper 104 ChemData Project
Stuart Chalk
 
ACS 248th Paper 71 ChAMP Project
Stuart Chalk
 
ACS 248th Paper 67 Eureka Collaboration
Stuart Chalk
 
247th ACS Meeting: The Eureka Research Workbench
Stuart Chalk
 
Ad

Recently uploaded (20)

PDF
Service innovation with AI: Transformation of value proposition and market se...
Selcen Ozturkcan
 
PDF
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
PPTX
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
PPTX
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
PDF
Unit-3 ppt.pdf organic chemistry - 3 unit 3
visionshukla007
 
PDF
Global Congress on Forensic Science and Research
infoforensicscience2
 
PPTX
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
PPTX
Immunopharmaceuticals and microbial Application
xxkaira1
 
PDF
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
PPTX
Akshay tunneling .pptx_20250331_165945_0000.pptx
akshaythaker18
 
PDF
Treatment and safety of drinking water .
psuvethapalani
 
PPTX
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
DOCX
Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"
Sahmiral Amri Rajagukguk
 
PDF
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
PPTX
abdominal compartment syndrome presentation and treatment.pptx
LakshmiMounicaGrandh
 
PDF
Chemokines and Receptors Overview – Key to Immune Cell Signaling
Benjamin Lewis Lewis
 
PPT
Restriction digestion of DNA for students of undergraduate and post graduate ...
DrMukeshRameshPimpli
 
PDF
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
PPTX
Envenomation AND ANIMAL BITES DETAILS.pptx
HARISH543351
 
PPTX
ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptx
adhagalejotshna
 
Service innovation with AI: Transformation of value proposition and market se...
Selcen Ozturkcan
 
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
Phage Therapy and Bacteriophage Biology.pptx
Prachi Virat
 
Microbiome_Engineering_Poster_Fixed.pptx
SupriyaPolisetty1
 
Unit-3 ppt.pdf organic chemistry - 3 unit 3
visionshukla007
 
Global Congress on Forensic Science and Research
infoforensicscience2
 
Diagnostic Features of Common Oral Ulcerative Lesions.pptx
Dr Palak borade
 
Immunopharmaceuticals and microbial Application
xxkaira1
 
High-speedBouldersandtheDebrisFieldinDARTEjecta
Sérgio Sacani
 
Akshay tunneling .pptx_20250331_165945_0000.pptx
akshaythaker18
 
Treatment and safety of drinking water .
psuvethapalani
 
Q1_Science 8_Week3-Day 1.pptx science lesson
AizaRazonado
 
Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"
Sahmiral Amri Rajagukguk
 
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
abdominal compartment syndrome presentation and treatment.pptx
LakshmiMounicaGrandh
 
Chemokines and Receptors Overview – Key to Immune Cell Signaling
Benjamin Lewis Lewis
 
Restriction digestion of DNA for students of undergraduate and post graduate ...
DrMukeshRameshPimpli
 
A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...
Sérgio Sacani
 
Envenomation AND ANIMAL BITES DETAILS.pptx
HARISH543351
 
ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptx
adhagalejotshna
 

ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data