ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data

Download as PPTX, PDF

1 like227 views

The document discusses the ChemExtractor project, which enhances the extraction and identification of chemical property data from PDF files using rule-based methods and regular expressions. It highlights the motivation behind the project, such as the need to leverage curatorial investments in chemical data, and outlines the research approach, including data analysis and the contextualization of captured data with metadata. The project aims to facilitate higher accuracy in capturing significant datasets and is intended for community use.

Science

ChemExtractor:
Enhanced Rule-Based Capture and
Identification of PDF Based Property Data
Stuart J. Chalk, Department of Chemistry
University of North Florida
schalk@unf.edu
253rd ACS Meeting April 2017

Outline
 Motivation
 Research Approach
 Analyzing Tabular Data
 Types of Data
 Regular Expressions (Regex)
 Rules and Rulesets
 Examples of Data Extraction
 Contextualizing Data
 Data Storage (MySQL)
 Data Representation (SciData)
 Conclusion
Funding for this project provided by

Motivation
 The Landholt-Börnstein Database is +450 volumes of
curated chemical property data (18?? to date)
 With the move to data-driven science it is imperative we
leverage the time invested in, and scientific quality of, the
curation of this data
 This high value data is locked in PDF files 
 This data can be made more useful if it is extracted with
its metadata (chemical system, original reference,
physical property with unit, etc.)

 Optical Character Recognition (OCR) is a standard
process to extract text from scanned images
 Accurate extraction of text using OCR in a page layout
format not only captures the text but also the inferred
relationships between tabular data
 Utilize regular expression (regex) analysis of tabular
text to capture data and its position relative to other
data
 Contextualize captured data with metadata encoded
based on layout of table and string format
Research Approach

Analyzing Tabular Data
Chemical
Metadata
Series of
Property Data
Condition
Series Condition
Property Data
Reference

 Properties and Units
 Conditions, data, supplemental data
 Equation coefficients and variables
 Chemical properties (MW, BP, MP)
 Annotations
 Table headers, column headers, data notes
 Chemical Metadata
 Formula, name, CASRN
 Context Metadata
 Table #, refcodes, property headers, component #
Types of Data

 Implemented in every programming language
 Relatively uniform implementation
 Uses syntax to create regex string that matches
and/or captures characters in string
 Groups of characters – ., [A-Z], [a-z], [0-9]
 Character classes w, d, s, h
 Repetition - ? (optional), + (1 or more), * ( or more)
 Capture group – ()
Regular Expressions (Regex)

http://regex101.com
Regex
Example 1

Regex
Example 2
http://regex101.com

http://regex101.com
Regex
Example 3

$ We write rules for specific line formats constructed using  Rule templates – define basic structure of a line, i.e. how many blocks of text that need to be captured  Rule snippets – define small regex strings that capture a particular format of text  Rulesets are list of rules with associated actions indicating sequentaily process text line by line Rules and Rulesets ^@B1@h+@B2@h*@B3@h*@B4@h*@B5@h*@B6@$ (?:[A-Z][a-z]{0,2}d{0,3}h?)+(?:.?d*.?d*H2O)$

 This data…
 … is detected by rule ‘Temperature (K) & refcode’
Contextualizing the Data
T/K 298.15 () () () () () () () () 81V1 ()

 Higher accuracy capture of property data and equation data
(350,000 property data points, 10,000 equations)
 Integrated chemical and reference metadata
 Can be applied to other PDF-based curated datasets
 Open website for community use this summer
 Implementation for research article capture of chemical
property data
Conclusion

 schalk@unf.edu
 Phone: 904-620-1938
 Skype: stuartchalk
 LinkedIn: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
Questions?

More Related Content

What's hot (20)

PDF

Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak

PPTX

A Standard Data Format for Computational Chemistry: CSXStuart Chalk

PPTX

Effective and Efficient Entity Search in RDF dataRoi Blanco

ODP

2009 0807 Lod GmodJun Zhao

PDF

ChemEngine ACSMuthukumarasamy Karthikeyan

PPTX

ModelDR - the tool that untangles complex informationSimon Roberts

PDF

A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER

PPTX

chemengine karthi acs sandiego rev1.0Muthukumarasamy Karthikeyan

PDF

Rethinking data intensive science using scalable analytics systemsnewmooxx

PDF

Hybrid geo textual index structurecseij

PDF

Lecture 07 Data Structures - Basic SortingHaitham El-Ghareeb

PPTX

Latest trends in AI and information Retrieval Abhay Ratnaparkhi

PDF

International Journal of Engineering Research and DevelopmentIJERD Editor

PPTX

16. Algo analysis & Design - Data Structures using C++ by Varsha Patilwidespreadpromotion

PPTX

1. Fundamental Concept - Data Structures using C++ by Varsha Patilwidespreadpromotion

PPTX

Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.

PPT

Integrating scientific laboratories into the cloudData Finder

PPT

Binary search in dschauhankapil

PPT

Design and creation of ontologies for environmental information retrievalAIMS (Agricultural Information Management Standards)

PDF

M phil-computer-science-machine-language-and-pattern-analysis-projectsVijay Karan

Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak

A Standard Data Format for Computational Chemistry: CSXStuart Chalk

Effective and Efficient Entity Search in RDF dataRoi Blanco

2009 0807 Lod GmodJun Zhao

ChemEngine ACSMuthukumarasamy Karthikeyan

ModelDR - the tool that untangles complex informationSimon Roberts

A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER

chemengine karthi acs sandiego rev1.0Muthukumarasamy Karthikeyan

Rethinking data intensive science using scalable analytics systemsnewmooxx

Hybrid geo textual index structurecseij

Lecture 07 Data Structures - Basic SortingHaitham El-Ghareeb

Latest trends in AI and information Retrieval Abhay Ratnaparkhi

International Journal of Engineering Research and DevelopmentIJERD Editor

16. Algo analysis & Design - Data Structures using C++ by Varsha Patilwidespreadpromotion

1. Fundamental Concept - Data Structures using C++ by Varsha Patilwidespreadpromotion

Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.

Integrating scientific laboratories into the cloudData Finder

Binary search in dschauhankapil

Design and creation of ontologies for environmental information retrievalAIMS (Agricultural Information Management Standards)

M phil-computer-science-machine-language-and-pattern-analysis-projectsVijay Karan

Similar to ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data (13)

PDF

Using Regular Expressions in Document Management Data Capture and IndexingSandy Schiele

PPTX

Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk

PPTX

Data Mining Dissertations and Adventures and Experiences in the World of Chem...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Royal society of chemistry activities to develop a data repository for chemis...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPT

Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan

ODP

OISF: Regular Expressions (Regex) OverviewThreatReel Podcast

PPT

Chemxseer qr-sagnikTahseenaM

PDF

beyond-regular-regular-expressions-v20.pdfronaldopanuelos

ODP

DerbyCon 7.0 Legacy: Regular Expressions (Regex) OverviewThreatReel Podcast

PPTX

Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge

PPT

Digitally enabling the RSC archiveUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

PPTX

Regular expressionsEran Zimbler

PDF

Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software

Using Regular Expressions in Document Management Data Capture and IndexingSandy Schiele

Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk

Data Mining Dissertations and Adventures and Experiences in the World of Chem...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Royal society of chemistry activities to develop a data repository for chemis...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan

OISF: Regular Expressions (Regex) OverviewThreatReel Podcast

Chemxseer qr-sagnikTahseenaM

beyond-regular-regular-expressions-v20.pdfronaldopanuelos

DerbyCon 7.0 Legacy: Regular Expressions (Regex) OverviewThreatReel Podcast

Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge

Digitally enabling the RSC archiveUS Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Regular expressionsEran Zimbler

Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software

More from Stuart Chalk (20)

PPTX

Semantic properties and unitsStuart Chalk

PPTX

Open semantic chemical structuresStuart Chalk

PPTX

AnIML: A New Analytical Data StandardStuart Chalk

PPTX

A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk

PPTX

Scientific Units in the Electronic AgeStuart Chalk

PPTX

Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk

PPTX

The Electronic Notebook OntologyStuart Chalk

PPTX

Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataStuart Chalk

PPTX

Bringing Flow injection Analysis to the Semantic WebStuart Chalk

PPTX

Reactions to the Open Spectral DatabaseStuart Chalk

PPTX

Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Stuart Chalk

PPTX

Building a Standard for Standards: The ChAMP ProjectStuart Chalk

PPTX

Overview of the Analytical Information Markup Language (AnIML)Stuart Chalk

PPTX

ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaStuart Chalk

PPTX

ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk

PPTX

ACS 248th Paper 108 NIST-IUPAC Solubility DataStuart Chalk

PPTX

ACS 248th Paper 104 ChemData ProjectStuart Chalk

PPTX

ACS 248th Paper 71 ChAMP ProjectStuart Chalk

PPTX

ACS 248th Paper 67 Eureka CollaborationStuart Chalk

PPTX

247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk

Semantic properties and unitsStuart Chalk

Open semantic chemical structuresStuart Chalk

AnIML: A New Analytical Data StandardStuart Chalk

A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk

Scientific Units in the Electronic AgeStuart Chalk

Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk

The Electronic Notebook OntologyStuart Chalk

Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataStuart Chalk

Bringing Flow injection Analysis to the Semantic WebStuart Chalk

Reactions to the Open Spectral DatabaseStuart Chalk

Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Stuart Chalk

Building a Standard for Standards: The ChAMP ProjectStuart Chalk

Overview of the Analytical Information Markup Language (AnIML)Stuart Chalk

ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaStuart Chalk

ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk

ACS 248th Paper 108 NIST-IUPAC Solubility DataStuart Chalk

ACS 248th Paper 104 ChemData ProjectStuart Chalk

ACS 248th Paper 71 ChAMP ProjectStuart Chalk

ACS 248th Paper 67 Eureka CollaborationStuart Chalk

247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk

Recently uploaded (20)

PDF

Service innovation with AI: Transformation of value proposition and market se...Selcen Ozturkcan

PDF

Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...Sérgio Sacani

PPTX

Phage Therapy and Bacteriophage Biology.pptxPrachi Virat

PPTX

Microbiome_Engineering_Poster_Fixed.pptxSupriyaPolisetty1

PDF

Unit-3 ppt.pdf organic chemistry - 3 unit 3visionshukla007

PDF

Global Congress on Forensic Science and Researchinfoforensicscience2

PPTX

Diagnostic Features of Common Oral Ulcerative Lesions.pptxDr Palak borade

PPTX

Immunopharmaceuticals and microbial Applicationxxkaira1

PDF

High-speedBouldersandtheDebrisFieldinDARTEjectaSérgio Sacani

PPTX

Akshay tunneling .pptx_20250331_165945_0000.pptxakshaythaker18

PDF

Treatment and safety of drinking water .psuvethapalani

PPTX

Q1_Science 8_Week3-Day 1.pptx science lessonAizaRazonado

DOCX

Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"Sahmiral Amri Rajagukguk

PDF

Pharma Part 1.pdf #pharmacology #pharmacologyhikmatyt01

PPTX

abdominal compartment syndrome presentation and treatment.pptxLakshmiMounicaGrandh

PDF

Chemokines and Receptors Overview – Key to Immune Cell SignalingBenjamin Lewis Lewis

PPT

Restriction digestion of DNA for students of undergraduate and post graduate ...DrMukeshRameshPimpli

PDF

A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...Sérgio Sacani

PPTX

Envenomation AND ANIMAL BITES DETAILS.pptxHARISH543351

PPTX

ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptxadhagalejotshna

Service innovation with AI: Transformation of value proposition and market se...Selcen Ozturkcan

Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...Sérgio Sacani

Phage Therapy and Bacteriophage Biology.pptxPrachi Virat

Microbiome_Engineering_Poster_Fixed.pptxSupriyaPolisetty1

Unit-3 ppt.pdf organic chemistry - 3 unit 3visionshukla007

Global Congress on Forensic Science and Researchinfoforensicscience2

Diagnostic Features of Common Oral Ulcerative Lesions.pptxDr Palak borade

Immunopharmaceuticals and microbial Applicationxxkaira1

High-speedBouldersandtheDebrisFieldinDARTEjectaSérgio Sacani

Akshay tunneling .pptx_20250331_165945_0000.pptxakshaythaker18

Treatment and safety of drinking water .psuvethapalani

Q1_Science 8_Week3-Day 1.pptx science lessonAizaRazonado

Critical Book Review (CBR) - "Hate Speech: Linguistic Perspectives"Sahmiral Amri Rajagukguk

Pharma Part 1.pdf #pharmacology #pharmacologyhikmatyt01

abdominal compartment syndrome presentation and treatment.pptxLakshmiMounicaGrandh

Chemokines and Receptors Overview – Key to Immune Cell SignalingBenjamin Lewis Lewis

Restriction digestion of DNA for students of undergraduate and post graduate ...DrMukeshRameshPimpli

A High-Caliber View of the Bullet Cluster through JWST Strong and Weak Lensin...Sérgio Sacani

Envenomation AND ANIMAL BITES DETAILS.pptxHARISH543351

ION EXCHANGE CHROMATOGRAPHY NEW PPT (JA).pptxadhagalejotshna

ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data

1. ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Property Data Stuart J. Chalk, Department of Chemistry University of North Florida [email protected] 253rd ACS Meeting April 2017

2. Outline  Motivation  Research Approach  Analyzing Tabular Data  Types of Data  Regular Expressions (Regex)  Rules and Rulesets  Examples of Data Extraction  Contextualizing Data  Data Storage (MySQL)  Data Representation (SciData)  Conclusion Funding for this project provided by

3. Motivation  The Landholt-Börnstein Database is +450 volumes of curated chemical property data (18?? to date)  With the move to data-driven science it is imperative we leverage the time invested in, and scientific quality of, the curation of this data  This high value data is locked in PDF files   This data can be made more useful if it is extracted with its metadata (chemical system, original reference, physical property with unit, etc.)

4.  Optical Character Recognition (OCR) is a standard process to extract text from scanned images  Accurate extraction of text using OCR in a page layout format not only captures the text but also the inferred relationships between tabular data  Utilize regular expression (regex) analysis of tabular text to capture data and its position relative to other data  Contextualize captured data with metadata encoded based on layout of table and string format Research Approach

5. Analyzing Tabular Data Chemical Metadata Series of Property Data Condition Series Condition Property Data Reference

6.  Properties and Units  Conditions, data, supplemental data  Equation coefficients and variables  Chemical properties (MW, BP, MP)  Annotations  Table headers, column headers, data notes  Chemical Metadata  Formula, name, CASRN  Context Metadata  Table #, refcodes, property headers, component # Types of Data

7.  Implemented in every programming language  Relatively uniform implementation  Uses syntax to create regex string that matches and/or captures characters in string  Groups of characters – ., [A-Z], [a-z], [0-9]  Character classes w, d, s, h  Repetition - ? (optional), + (1 or more), * ( or more)  Capture group – () Regular Expressions (Regex)

8. http://regex101.com Regex Example 1

9. Regex Example 2 http://regex101.com

10. http://regex101.com Regex Example 3

11.  We write rules for specific line formats constructed using  Rule templates – define basic structure of a line, i.e. how many blocks of text that need to be captured  Rule snippets – define small regex strings that capture a particular format of text  Rulesets are list of rules with associated actions indicating sequentaily process text line by line Rules and Rulesets ^@B1@h+@B2@h*@B3@h*@B4@h*@B5@h*@B6@$ (?:[A-Z][a-z]{0,2}d{0,3}h?)+(?:.?d*.?d*H2O)

12. Rules and Rulesets

13. Rules and Rulesets

14. Rules and Rulesets

15. Rules and Rulesets

16. Examples of Data Extraction

17. Examples of Data Extraction

18. Examples of Data Extraction

19. Examples of Data Extraction

20.  This data…  … is detected by rule ‘Temperature (K) & refcode’ Contextualizing the Data T/K 298.15 () () () () () () () () 81V1 ()

21. Data Storage (MySQL)

22. Data Storage (MySQL)

23. Data Presentation

24. Data Presentation

25. Data Representation (SciData)

26.  Higher accuracy capture of property data and equation data (350,000 property data points, 10,000 equations)  Integrated chemical and reference metadata  Can be applied to other PDF-based curated datasets  Open website for community use this summer  Implementation for research article capture of chemical property data Conclusion

27.  [email protected]  Phone: 904-620-1938  Skype: stuartchalk  LinkedIn: https://www.linkedin.com/in/stuchalk  ORCID: http://orcid.org/0000-0002-0703-7776 Questions?