Integrated Data
Platform at Bayer
Turning bits into insights
Wolfgang Thielemann
Agenda
What platform did we built?
What does it look like?
Why did we build it?
Architecture and data enrichment
Challenges
Plans for the future
2 /// AI-SDV 2022 // Integrated Data Platform at Bayer
/// AI-SDV 2022 // Integrated Data Platform at Bayer
3
What Platform did we built?
1
/// AI-SDV 2022 // Integrated Data Platform at Bayer
4
Our platform semantically integrates Terabytes
of external scientific textual data to support
insight generation along the R&D value chain
/// AI-SDV 2022 // Integrated Data Platform at Bayer
5
Big data platform
This platform is…
• A semantically integrated and harmonized big data hub containing major external, text-
rich, and life-science related data sources
• Enriched with FAIR meta-data generated by extracting the key information (e.g., molecular
targets, medical conditions, active ingredients, technologies etc.) using NLP
• An analysis-ready platform for end-users (GUI access) and data scientists (API access)
/// AI-SDV 2022 // Integrated Data Platform at Bayer
6
Scientific
end users
Data scientists
Developers of
digital products
The users
/// AI-SDV 2022 // Integrated Data Platform at Bayer
7
The users
End-user GUIs
more power &
precision for
scientific search
Project leaders
R&D scientists
Tech scouts
& Co
Find relevant information
Alerts
Analysis
Filter & Review
Expert APIs
Provide structured
data for insight
generation
Data scientists
Computational scientists
Information professionals
Bioinformaticians
Generate insights
Find new targets & treatments
Support pipeline decisions
Build predictive models
/// AI-SDV 2022 // Integrated Data Platform at Bayer
8
What does it look like?
2
/// AI-SDV 2022 // Integrated Data Platform at Bayer
9
Example: Liver cancer
Google-like search interface
/// AI-SDV 2022 // Integrated Data Platform at Bayer
10
Example: Liver cancer
Interactive analysis and filtering
/// AI-SDV 2022 // Integrated Data Platform at Bayer
11
Example: Liver cancer
Result overview
/// AI-SDV 2022 // Integrated Data Platform at Bayer
12
Example: Liver cancer
Record view
/// AI-SDV 2022 // Integrated Data Platform at Bayer
13
Why did we build it?
3
/// AI-SDV 2022 // Integrated Data Platform at Bayer
14
Big Data Platform
6 Reasons why building it made and makes sense
Richness of data sources
Flexibility
Costs
Scalability
FAIR meta-data
Full transparency
and control
/// AI-SDV 2022 // Integrated Data Platform at Bayer
15
Scientific sources in our platform Platforms limited to publicly
available data
1. Bandwidth and richness of data sources
Big Data Platform
Why did we build it?
/// AI-SDV 2022 // Integrated Data Platform at Bayer
16
2. Maximum flexibility to analyze the data and to integrate it into our
Bayer data ecosystem
Existing platforms often come with limited/pre-defined analysis options and
limited integrability
Big Data Platform
Why did we build it?
/// AI-SDV 2022 // Integrated Data Platform at Bayer
17
Our platform is built on a scalable cloud infrastructure for big data analysis
and does allow you to analyze millions of records in one go.
Big Data Platform
Why did we build it?
3. Full scalability
/// AI-SDV 2022 // Integrated Data Platform at Bayer
18
4. Costs
This platform allowed us to save money and reduce complexity be replacing
various proprietary legacy platforms
Big Data Platform
Why did we build it?
/// AI-SDV 2022 // Integrated Data Platform at Bayer
19
5. One terminology across entire content and option to
adjust it to our needs
Individual sources / platforms typically have their own standards and
terminologies
One terminology
for entire platform
Big Data Platform
Why did we build it?
/// AI-SDV 2022 // Integrated Data Platform at Bayer
20
6. Comprehensiveness and quality of meta-data
Since we built on 20 years of thesauri and NLP algorithms optimized to
Bayer’s needs, our terminologies cover the real-life use of science much
better than established terminologies
Big Data Platform
Why did we build it?
MeSH:
/// AI-SDV 2022 // Integrated Data Platform at Bayer
21
6. Comprehensiveness and quality of meta-data
Proprietary disease thesaurus:
Big Data Platform
Why did we build it?
/// AI-SDV 2022 // Integrated Data Platform at Bayer
22
Architecture & Data enrichment
4
/// AI-SDV 2022 // Integrated Data Platform at Bayer
23
Conference Abstracts
Literature Abstracts
Literature Fulltexts
Patents
Patent Chemistry
Clinical Trials
Pipeline Information
Market reports
Company Websites Industry News
Research Grants
Tech Transfer Offers
D
A
T
A
Data Engineering: Normalization, Deduplication, Classification, etc
(Kafka Streams)
Index, Search, and API Services (Elastic)
Semantic Enrichment: Targets, Organisms, Sequences, Drugs,
Active Ingredients, Companies/Organizations, Analytics, etc
Automated Data Acquisition (Kafka Technology)
P
R
O
C
E
S
S
APIs & Data Science
Platform architecture
End User Products
D
E
L
I
V
E
R
Cross-search GUI
Advanced literature GUI
Advanced patent GUI
System/Application Integrations
Other proprietary
platforms and
workflows use this
platform as source
/// AI-SDV 2022 // Integrated Data Platform at Bayer
24
Resolve all flavours of heterogeneity to make textual data FAIR
Big Data Platform
Semantic data integration at large
Semantic data
integration
Structural heterogeneity
Same facts expressed in different
schemata
Missing / additional attributes
Technical heterogeneity
Data formats (JSON vs. XML),
communication protocols (REST vs.
ODBC), query languages (SQL vs.
SPARQL)
Data model heterogeneity
Relational vs. Semi-structured, Tuples
vs. Graphs,…
Syntactic heterogeneity
Different presentation of the same fact
(Unicode or ASCII, EUR or €,…)
Semantic heterogeneity
Same concepts are named differently
➢ Pulmonary carcinoma
➢ Neoplasm of the lung
➢ ….
Different concepts are named same
GSK
Lung cancer
/// AI-SDV 2022 // Integrated Data Platform at Bayer
26
Challenges
5
Heterogeneous
formats
/// AI-SDV 2022 // Integrated Data Platform at Bayer
27
Challenges: Data ingestion
Heterogeneous
update schedules
hourly
daily
weekly
monthly
/// AI-SDV 2022 // Integrated Data Platform at Bayer
28
Challenges: Data ingestion
Changes in record
structure
Changes in
volume over time
/// AI-SDV 2022 // Integrated Data Platform at Bayer
29
Challenges: Data ingestion
De-duplication
De-duplication
De-duplication
De-duplication
De-duplication
/// AI-SDV 2022 // Integrated Data Platform at Bayer
30
Challenges: Semantic enrichment
Lack of universially accepted identifier for an entity class
Human gene
NCBI Gene ID
Chemical compound
INN name
IUPAC
CAS-Nr
PubChem CID
Canonical smiles
Disease
MeSH ID
UMLS ID
Snomed ID
NCIT ID
Orphanet ID
Mondo ID
ICD-10 ID
MedDRA ID
DO ID
…..
/// AI-SDV 2022 // Integrated Data Platform at Bayer
31
Challenges: Semantic enrichment
Identification of different entities require different technologies:
➢Terminology based NLP (e.g., disease names)
➢ML based NLP (e.g., for ambiguous acronyms like cell lines, gene acronyms etc.)
➢Rule/pattern-based extraction (e.g., IUPAC chemical names, gene mutations)
“A lamp-snp assay detecting c580y mutation in pfkelch13 gene from clinically dried blood spot samples”
➢Image/graph processing (e.g., image2mol)
C1=CC=C(C(=C1)CC(=O)[O-])NC2=C(C=CC=C2Cl)Cl.[Na+]
/// AI-SDV 2022 // Integrated Data Platform at Bayer
32
Status quo & Plans for the future
6
/// AI-SDV 2022 // Integrated Data Platform at Bayer
33
Are we now living in a fairytale where everything is perfect?
/// AI-SDV 2022 // Integrated Data Platform at Bayer
34
Are we now living in a fairytale where everything is perfect?
There is still a lot to do…
➢Terminology is constantly evolving (new companies, new technologies etc.)
➢Development of scalable algorithms for complex entities
➢Finding the most relevant information in the ocean of data
➢Advanced visualization and analytics
➢Further standardization
➢…..
/// AI-SDV 2022 // Integrated Data Platform at Bayer
35
What can you do to help us in our endevour?
Vendors / Publisher / Data base producers
• Data quality
• FAIRification
• Using generally available
standards & IDs
• Consistency
• Collecting scattered data
• Harmonization
/// AI-SDV 2022 // Integrated Data Platform at Bayer
36
SOURCES
e.g., drug labels,
guidelines
USABILITY
THESAURI
Automatization
e.g. alerting CHEMISTRY
ANALYSES features
Big Data Platform
Plans for the future
Thank you!
Special thanks to
my colleagues on
the team

AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insights Wolfgang Thielemann (Bayer, Germany )

  • 1.
    Integrated Data Platform atBayer Turning bits into insights Wolfgang Thielemann
  • 2.
    Agenda What platform didwe built? What does it look like? Why did we build it? Architecture and data enrichment Challenges Plans for the future 2 /// AI-SDV 2022 // Integrated Data Platform at Bayer
  • 3.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 3 What Platform did we built? 1
  • 4.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 4 Our platform semantically integrates Terabytes of external scientific textual data to support insight generation along the R&D value chain
  • 5.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 5 Big data platform This platform is… • A semantically integrated and harmonized big data hub containing major external, text- rich, and life-science related data sources • Enriched with FAIR meta-data generated by extracting the key information (e.g., molecular targets, medical conditions, active ingredients, technologies etc.) using NLP • An analysis-ready platform for end-users (GUI access) and data scientists (API access)
  • 6.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 6 Scientific end users Data scientists Developers of digital products The users
  • 7.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 7 The users End-user GUIs more power & precision for scientific search Project leaders R&D scientists Tech scouts & Co Find relevant information Alerts Analysis Filter & Review Expert APIs Provide structured data for insight generation Data scientists Computational scientists Information professionals Bioinformaticians Generate insights Find new targets & treatments Support pipeline decisions Build predictive models
  • 8.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 8 What does it look like? 2
  • 9.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 9 Example: Liver cancer Google-like search interface
  • 10.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 10 Example: Liver cancer Interactive analysis and filtering
  • 11.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 11 Example: Liver cancer Result overview
  • 12.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 12 Example: Liver cancer Record view
  • 13.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 13 Why did we build it? 3
  • 14.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 14 Big Data Platform 6 Reasons why building it made and makes sense Richness of data sources Flexibility Costs Scalability FAIR meta-data Full transparency and control
  • 15.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 15 Scientific sources in our platform Platforms limited to publicly available data 1. Bandwidth and richness of data sources Big Data Platform Why did we build it?
  • 16.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 16 2. Maximum flexibility to analyze the data and to integrate it into our Bayer data ecosystem Existing platforms often come with limited/pre-defined analysis options and limited integrability Big Data Platform Why did we build it?
  • 17.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 17 Our platform is built on a scalable cloud infrastructure for big data analysis and does allow you to analyze millions of records in one go. Big Data Platform Why did we build it? 3. Full scalability
  • 18.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 18 4. Costs This platform allowed us to save money and reduce complexity be replacing various proprietary legacy platforms Big Data Platform Why did we build it?
  • 19.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 19 5. One terminology across entire content and option to adjust it to our needs Individual sources / platforms typically have their own standards and terminologies One terminology for entire platform Big Data Platform Why did we build it?
  • 20.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 20 6. Comprehensiveness and quality of meta-data Since we built on 20 years of thesauri and NLP algorithms optimized to Bayer’s needs, our terminologies cover the real-life use of science much better than established terminologies Big Data Platform Why did we build it? MeSH:
  • 21.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 21 6. Comprehensiveness and quality of meta-data Proprietary disease thesaurus: Big Data Platform Why did we build it?
  • 22.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 22 Architecture & Data enrichment 4
  • 23.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 23 Conference Abstracts Literature Abstracts Literature Fulltexts Patents Patent Chemistry Clinical Trials Pipeline Information Market reports Company Websites Industry News Research Grants Tech Transfer Offers D A T A Data Engineering: Normalization, Deduplication, Classification, etc (Kafka Streams) Index, Search, and API Services (Elastic) Semantic Enrichment: Targets, Organisms, Sequences, Drugs, Active Ingredients, Companies/Organizations, Analytics, etc Automated Data Acquisition (Kafka Technology) P R O C E S S APIs & Data Science Platform architecture End User Products D E L I V E R Cross-search GUI Advanced literature GUI Advanced patent GUI System/Application Integrations Other proprietary platforms and workflows use this platform as source
  • 24.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 24 Resolve all flavours of heterogeneity to make textual data FAIR Big Data Platform Semantic data integration at large Semantic data integration Structural heterogeneity Same facts expressed in different schemata Missing / additional attributes Technical heterogeneity Data formats (JSON vs. XML), communication protocols (REST vs. ODBC), query languages (SQL vs. SPARQL) Data model heterogeneity Relational vs. Semi-structured, Tuples vs. Graphs,… Syntactic heterogeneity Different presentation of the same fact (Unicode or ASCII, EUR or €,…) Semantic heterogeneity Same concepts are named differently ➢ Pulmonary carcinoma ➢ Neoplasm of the lung ➢ …. Different concepts are named same GSK Lung cancer
  • 25.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 26 Challenges 5
  • 26.
    Heterogeneous formats /// AI-SDV 2022// Integrated Data Platform at Bayer 27 Challenges: Data ingestion Heterogeneous update schedules hourly daily weekly monthly
  • 27.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 28 Challenges: Data ingestion Changes in record structure Changes in volume over time
  • 28.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 29 Challenges: Data ingestion De-duplication De-duplication De-duplication De-duplication De-duplication
  • 29.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 30 Challenges: Semantic enrichment Lack of universially accepted identifier for an entity class Human gene NCBI Gene ID Chemical compound INN name IUPAC CAS-Nr PubChem CID Canonical smiles Disease MeSH ID UMLS ID Snomed ID NCIT ID Orphanet ID Mondo ID ICD-10 ID MedDRA ID DO ID …..
  • 30.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 31 Challenges: Semantic enrichment Identification of different entities require different technologies: ➢Terminology based NLP (e.g., disease names) ➢ML based NLP (e.g., for ambiguous acronyms like cell lines, gene acronyms etc.) ➢Rule/pattern-based extraction (e.g., IUPAC chemical names, gene mutations) “A lamp-snp assay detecting c580y mutation in pfkelch13 gene from clinically dried blood spot samples” ➢Image/graph processing (e.g., image2mol) C1=CC=C(C(=C1)CC(=O)[O-])NC2=C(C=CC=C2Cl)Cl.[Na+]
  • 31.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 32 Status quo & Plans for the future 6
  • 32.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 33 Are we now living in a fairytale where everything is perfect?
  • 33.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 34 Are we now living in a fairytale where everything is perfect? There is still a lot to do… ➢Terminology is constantly evolving (new companies, new technologies etc.) ➢Development of scalable algorithms for complex entities ➢Finding the most relevant information in the ocean of data ➢Advanced visualization and analytics ➢Further standardization ➢…..
  • 34.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 35 What can you do to help us in our endevour? Vendors / Publisher / Data base producers • Data quality • FAIRification • Using generally available standards & IDs • Consistency • Collecting scattered data • Harmonization
  • 35.
    /// AI-SDV 2022// Integrated Data Platform at Bayer 36 SOURCES e.g., drug labels, guidelines USABILITY THESAURI Automatization e.g. alerting CHEMISTRY ANALYSES features Big Data Platform Plans for the future
  • 36.
    Thank you! Special thanksto my colleagues on the team