SlideShare a Scribd company logo
© 2018 KNIME AG. All Rights Reserved.
How Do You Build and Validate 1500
Models and What Can You Learn from
Them?
Greg Landrum*, Anna Martin,
Daria Goldmann
KNIME AG
2018 ICCS
@dr_greg_landrum
© 2018 KNIME AG. All Rights Reserved.
The Monster Model Factory
Greg Landrum*, Anna Martin,
Daria Goldmann
KNIME AG
2018 ICCS
@dr_greg_landrum
© 2018 KNIME AG. All Rights Reserved. 3
Who cares?
• I have >1500 datasets from ChEMBL that I would like
to build models for
• I want to actually use the models, so they need to
be deployed
• The whole process needs to be automated and
reproducible so that I can do it again when ChEMBL
is updated
• Maybe we can learn something interesting from the
models themselves
4© 2018 KNIME AG. All Rights Reserved.
Back to the beginning
© 2018 KNIME AG. All Rights Reserved. 5
The model process
Image from:
https://upload.wikimedia.org/wikipedia/commons
/b/b9/CRISP-DM_Process_Diagram.png
CRISP-DM (CRoss Industry
Standard Process for Data
Mining) is a standard
process for data mining
solutions.
wikipedia://CRISP-DM
© 2018 KNIME AG. All Rights Reserved. 6
The model process
Image from:
https://upload.wikimedia.org/wiki
pedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 7
The model process, multiple models
…
© 2018 KNIME AG. All Rights Reserved. 8
The model process, multiple models
…
© 2018 KNIME AG. All Rights Reserved. 9
The model process, multiple models
…
https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
© 2018 KNIME AG. All Rights Reserved. 10
The model process, multiple models
…
It’s not feasible to manually do this
for a daunting number of models!
https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
11© 2018 KNIME AG. All Rights Reserved.
https://www.publicdomainpictures.net/view-image.php?image=155188
© 2018 KNIME AG. All Rights Reserved. 12
Automation: the model process factory
© 2018 KNIME AG. All Rights Reserved. 13
Init Load Transform Learn Score Evaluate Deploy
Automation: the model process factory
Score EvaluateTransform DeployLoad Learn
Score
Learn
Load Transform Evaluate Deploy
Score EvaluateTransform DeployLoad Learn
Score
Learn
Load Transform Evaluate Deploy
Make each step a separate workflow.
Use KNIME to orchestrate calling those workflows
KNIME blog post: https://goo.gl/LvESqB
White paper: https://goo.gl/d6UpUu
© 2018 KNIME AG. All Rights Reserved. 14
Model Factory Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 15
The heart of the factory: Call Local Workflow1
• Executes another workflow in the same local repository
https://pixabay.com/en/heart-veins-arteries-anatomy-152594/
1 Call Remote Workflow
when run on the KNIME
Server
© 2018 KNIME AG. All Rights Reserved. 16
Model Factory Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 17
Model Factory Init Load Transform Learn Score Evaluate Deploy
18© 2018 KNIME AG. All Rights Reserved.
Details
© 2018 KNIME AG. All Rights Reserved. 19
Extracting the data
• Data source: ChEMBL 23
• Activity types:
('GI50', 'IC50', 'Ki', 'MIC', 'EC50', 'AC50', 'ED50', 'GI', 'Kd', 'CC50', 'LC50',
'MIC90', 'MIC50', 'ID50’) -> 6.5 million points
• Define active:
Standard_value < 100nM -> 1.3 million actives
• Define inactive:
Standard_value > 1uM
• Define an interesting assay
At least 50 actives -> 1556 assays
• Final dataset size: 2.5 million data points, 1.5 million
compounds
Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 20
Init Load Transform Learn Score Evaluate DeployFinding more inactives
• The ChEMBL datasets almost all have an
unrealistically high ratio of actives to inactives
• “Fix” that by adding enough assumed inactives to
each dataset to get a 1:10 active:inactive ratio
• Pick those assumed inactives to be roughly similar
to the actives: Tanimoto similarity of between 0.35
and 0.6 using RDKit Morgan 2 fingerprints
© 2018 KNIME AG. All Rights Reserved. 21
Extracting the data Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 22
Transform
• Convert SMILES from database into chemical
structures
• Cleanup the chemical structures
Init Load Transform Learn Score Evaluate Deploy
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 23
Transform
• Convert SMILES from database into chemical
structures
• Cleanup the chemical structures
• Generate five chemical fingerprints for each
structure
Init Load Transform Learn Score Evaluate Deploy
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 24
Transform
• Convert SMILES from database into chemical
structures
• Cleanup the chemical structures
• Generate five chemical fingerprints for each
structure
– Morgan 3 counts (ECFC6), 4K “bits”
– Morgan 3 (ECFP6), 4K bits
– Morgan 2 (ECFP4), 2K bits
– RDKit FP, length 1-5, 2K bits
– Atom pairs, distances 1-20, 4K bits
Init Load Transform Learn Score Evaluate Deploy
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 25
Learn and Score Init Load Transform Learn Score Evaluate Deploy
10 different stratified random
training/holdout splits generated for
each assay
© 2018 KNIME AG. All Rights Reserved. 26
Learn Init Load Transform Learn Score Evaluate Deploy
Learning:
• Fingerprint Bayes (NB)
• Logistic Regression (LR)
• Random Forest (RF)
200 trees, max depth=10,
min_leaf_size=3, min_node_size=6
• Gradient Boosting (H2O)
100 trees, max_depth = 5,
learning_rate = 0.05
Model Selection:
• Pick best model based on Enrichment
factor at 5% (EF5)
© 2018 KNIME AG. All Rights Reserved. 27
Learn Init Load Transform Learn Score Evaluate Deploy
Where did these parameters
come from?
Learning:
• Fingerprint Bayes (NB)
• Logistic Regression (LR)
• Random Forest (RF)
200 trees, max depth=10,
min_leaf_size=3, min_node_size=6
• Gradient Boosting (H2O)
100 trees, max_depth = 5,
learning_rate = 0.05
© 2018 KNIME AG. All Rights Reserved. 28
Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
• Full parameter optimization done for each
method+fingerprint on 70 assays
• Results used to pick “standard” parameter
sets:
– Random Forest: 200 trees, max depth=10,
min_leaf_size=3, min_node_size=6
– Gradient Boosting: 100 trees, max_depth = 5,
learning_rate = 0.05
© 2018 KNIME AG. All Rights Reserved. 29
Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 30
Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
The optimization and model selection workflow is presented in detail in Daria’s
KNIME blog post:
https://www.knime.com/blog/stuck-in-the-nine-circles-of-hell-try-parameter-
optimization-a-cup-of-tea
The workflow is available in the EXAMPLES folder inside KNIME:
04_Analytics/11_Optimization/08_Model_Optimization_and_Selection
31© 2018 KNIME AG. All Rights Reserved.
Making it all run
Init Load Transform Learn Score Evaluate Deploy
© 2018 KNIME AG. All Rights Reserved. 32
Execution
• In total >310K models were built1
1 ~1550 assays * 4 methods * 5 FPs * 10 repeats
© 2018 KNIME AG. All Rights Reserved. 33
Execution
KNIME Analytics
Platform
KNIME
Server
...
Distributed Executor
Distributed Executor
Distributed Executor
Build/test workflows Run model factory Run individual assays
65-70 load-balanced
distributed executors
34© 2018 KNIME AG. All Rights Reserved.
Are the models any good?
© 2018 KNIME AG. All Rights Reserved. 35
Performance on validation sets
• AUC: mean=0.958
s=0.070
• Cohen’s kappa:
mean=0.690 s=0.382
© 2018 KNIME AG. All Rights Reserved. 36
Performance on validation sets
• AUC: mean=0.958
s=0.070
• Cohen’s kappa:
mean=0.690 s=0.382
Yeah!
© 2018 KNIME AG. All Rights Reserved. 37
Performance on validation sets
• AUC: mean=0.958
s=0.070
• Cohen’s kappa:
mean=0.690 s=0.382
Yeah! Uh oh…
38© 2018 KNIME AG. All Rights Reserved.
https://www.publicdomainpictures.net/view-image.php?image=155188
© 2018 KNIME AG. All Rights Reserved. 39
An experiment to check model generalizability
• Pick assays where standard_type is Ki
• Group them by target ID
• Limit to targets where Ki was measured in at least 5
assays -> 11 targets, 73 assays
• Use the model built on one assay from a target ID to
predict activity across the other assays.
© 2018 KNIME AG. All Rights Reserved. 40
An experiment to check model generalizability
• The targets:
TargetID Name Num Assays
CHEMBL205 Carbonic anhydrase II 7
CHEMBL224 Serotonin 2a (5-HT2a) receptor 8
CHEMBL234 Dopamine D3 receptor 10
CHEMBL243 Human immunodeficiency virus type 1 protease 6
CHEMBL244 Coagulation factor X 5
CHEMBL253 Cannabinoid CB2 receptor 7
CHEMBL281 Carbonic anhydrase IV 5
CHEMBL3371 Serotonin 6 (5-HT6) receptor 8
CHEMBL344 Melanin-concentrating hormone receptor 1 5
CHEMBL4550 5-lipoxygenase activating protein 5
CHEMBL4908 Trace amine-associated receptor 1 7
© 2018 KNIME AG. All Rights Reserved. 41
Carbonic Anhydrase IV
Carbonic Anhydrase II
HIV Protease
Factor X
5-HT6
TAAR1
© 2018 KNIME AG. All Rights Reserved. 42
Carbonic Anhydrase IV
Carbonic Anhydrase II HIV Protease
Factor X
5-HT6 TAAR1
© 2018 KNIME AG. All Rights Reserved. 43
An Example
Target: CHEMBL3371 (5-HT6)
Train on Assay ID: 448716
Test with Assay ID: 1366806
AUROC: 0.38
EF5: 0
© 2018 KNIME AG. All Rights Reserved. 44
An Example
Assay_ID 448716 Assay_ID 1366806
© 2018 KNIME AG. All Rights Reserved. 45
An Example
Target: CHEMBL3371 (5-HT6)
Train on Assay ID: 448716
Test with Assay ID: 659849
AUROC: 0.99
EF5: 8.8
© 2018 KNIME AG. All Rights Reserved. 46
An Example
Assay_ID 448716 Assay_ID 659849
© 2018 KNIME AG. All Rights Reserved. 47
An Example
Target: CHEMBL3371 (5-HT6)
Train on Assay ID: 448716
Test with Assay ID: 1528679
AUROC: 0.83
EF5: 0.4
© 2018 KNIME AG. All Rights Reserved. 48
An Example
Assay_ID 448716 Assay_ID 1528679
© 2018 KNIME AG. All Rights Reserved. 49
Intermediate conclusion
• Many/most of the models have likely overfit the
training data
• Alternative interpretation: we’ve actually built
models to predict whether or not a compound is
taken from a particular paper
• Unfortunately these are functionally the same if you
want to predict activity
50© 2018 KNIME AG. All Rights Reserved.
https://www.publicdomainpictures.net/view-image.php?image=155188
© 2018 KNIME AG. All Rights Reserved. 51
Look for frequent algorithm + fingerprint combinations
• For each of the ~1550 assays * 4 learning algorithms
* 10 repeats, look at which fingerprint performed
best (as measured by EF5)
© 2018 KNIME AG. All Rights Reserved. 52
Look for frequent algorithm + fingerprint combinations
For each of the ~1550 assays * 4 learning
algorithms * 10 repeats, look at which fingerprint
performed best (as measured by EF5)
© 2018 KNIME AG. All Rights Reserved. 53
Which method/FP pair is best for each assay?
• For each of the ~1550 assays * 10 repeats, look at
which algorithm + fingerprint performed best (as
measured by EF51, AUC2, and algorithm complexity3)
1 Rounded to 1 decimal point
2 Rounded to 2 decimal points
3 Random Forest > Gradient Boosting > Fingerprint Bayes > Logistic
Regression
© 2018 KNIME AG. All Rights Reserved. 54
Which method/FP pair is best for each assay?
Select best model using EF5,
AUC, algorithm complexity
© 2018 KNIME AG. All Rights Reserved. 55
Wrapping up
• We have automated the construction and evaluation
of >1500 models for bioassays using data pulled
from ChEMBL
• We’ve got some strong evidence that the models
themselves are significantly overfit
• We were able to start to draw some general
conclusions about fingerprints and methods
© 2018 KNIME AG. All Rights Reserved. 56
There’s still a lot left to do
• Verify the repeatability of the process by updating
when the next version of ChEMBL is released
• Some more thought into combining assays to get
around the “one series per paper” problem
• Look into doing the full optimization run
• Come up with a good way of presenting the
predictions
© 2018 KNIME AG. All Rights Reserved. 57
More details…
• Model process factory blog post: https://goo.gl/LvESqB
• Model process factory white paper:
https://goo.gl/d6UpUu
• Model process factory workflow:
knime://EXAMPLES/50_Applications/26_Model_Process_
Management
• Daria’s blog post on the model optimization workflow:
https://www.knime.com/blog/stuck-in-the-nine-circles-
of-hell-try-parameter-optimization-a-cup-of-tea
• Accompanying workflow: knime://EXAMPLES/
04_Analytics/11_Optimization/08_Model_Optimization_
and_Selection
• When we’re done cleaning up, there will be a blog
post/sample workflow for the monster model factory too.
© 2018 KNIME AG. All Rights Reserved. 58
7th RDKit UGM: 19 - 21 September
• Hosted by Andreas Bender, Cambridge
University
• Free registration: https://goo.gl/VVvHUH
(or get it on http://www.rdkit.org)
http://www.rdkit.org
© 2018 KNIME AG. All Rights Reserved. 59
KNIME Fall Summit 2018
November 6 – 9 at AT&T Executive Education and
Conference Center, Austin, Texas
• Tuesday & Wednesday: One-day courses
• Thursday & Friday: Summit sessions
Use the code
ICCS-2018
for 10% off tickets.
Register at:
knime.com/fall-summit2018
60© 2018 KNIME AG. All Rights Reserved.
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by
KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.

More Related Content

What's hot (20)

PDF
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
InfluxData
 
PDF
Case Studies in advanced analytics with R
Wit Jakuczun
 
PDF
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
 
PPTX
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData
 
PPTX
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Revolution Analytics
 
PDF
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
PDF
Performance Co-Pilot
YOSHIKAWA Ryota
 
PPTX
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
 
PDF
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
 
PPTX
Raster Algebra mit Oracle Spatial und uDig
Karin Patenge
 
PDF
RIPE Atlas
RIPE NCC
 
PDF
OPTIMIZING THE TICK STACK
InfluxData
 
PPTX
OpenACC Highlights - February
NVIDIA
 
PDF
Migrating PostgreSQL to the Cloud
Mike Fowler
 
PPTX
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula The Science Cloud
 
PDF
Deploying MariaDB for HA on Google Cloud Platform
MariaDB plc
 
PDF
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Sonja Schweigert
 
PDF
Make your PySpark Data Fly with Arrow!
Databricks
 
PDF
Scossu gdi iiif_r+d_report_2019
Stefano Cossu
 
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
InfluxData
 
Case Studies in advanced analytics with R
Wit Jakuczun
 
Know your R usage workflow to handle reproducibility challenges
Wit Jakuczun
 
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...
InfluxData
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Revolution Analytics
 
Clearing Airflow Obstructions
Tatiana Al-Chueyr
 
Performance Co-Pilot
YOSHIKAWA Ryota
 
HDF Kita Lab: JupyterLab + HDF Service
The HDF-EOS Tools and Information Center
 
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
 
Raster Algebra mit Oracle Spatial und uDig
Karin Patenge
 
RIPE Atlas
RIPE NCC
 
OPTIMIZING THE TICK STACK
InfluxData
 
OpenACC Highlights - February
NVIDIA
 
Migrating PostgreSQL to the Cloud
Mike Fowler
 
Helix Nebula the Science Cloud: Pre-Commercial Procurement pilot
Helix Nebula The Science Cloud
 
Deploying MariaDB for HA on Google Cloud Platform
MariaDB plc
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Sonja Schweigert
 
Make your PySpark Data Fly with Arrow!
Databricks
 
Scossu gdi iiif_r+d_report_2019
Stefano Cossu
 

Similar to How Do You Build and Validate 1500 Models and What Can You Learn from Them? (20)

PDF
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
 
PDF
From raw data to deployment
KNIMESlides
 
PPTX
From Raw Data to Deployment
KNIMESlides
 
PDF
Python tutorial for ML
Bin Han
 
PDF
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIMESlides
 
PDF
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
PDF
ODSC data science to DataOps
Christopher Bergh
 
PDF
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 
PPTX
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
 
PPTX
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
PDF
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
PDF
Master the RETE algorithm
Masahiko Umeno
 
PDF
Your Flight is Boarding Now!
MeetupDataScienceRoma
 
PDF
Visionaize - Upstream-Midstream-Downstream Use Cases.pdf
SumantaBasu12
 
PDF
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 
PDF
Sharing and Deploying Data Science with KNIME Server
KNIMESlides
 
PDF
Industrial Algorithms
Alkis Vazacopoulos
 
PDF
Kbc Petro-SIM
Maicon Cruz
 
PDF
The Digital Twin For Production Optimization
Yokogawa1
 
PPTX
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
Charles Sonigo
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIMESlides
 
From raw data to deployment
KNIMESlides
 
From Raw Data to Deployment
KNIMESlides
 
Python tutorial for ML
Bin Han
 
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIMESlides
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Greg Landrum
 
ODSC data science to DataOps
Christopher Bergh
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
Alok Singh
 
Webinar: Deep Learning Pipelines Beyond the Learning
Mesosphere Inc.
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Databricks
 
Master the RETE algorithm
Masahiko Umeno
 
Your Flight is Boarding Now!
MeetupDataScienceRoma
 
Visionaize - Upstream-Midstream-Downstream Use Cases.pdf
SumantaBasu12
 
OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolki...
NETWAYS
 
Sharing and Deploying Data Science with KNIME Server
KNIMESlides
 
Industrial Algorithms
Alkis Vazacopoulos
 
Kbc Petro-SIM
Maicon Cruz
 
The Digital Twin For Production Optimization
Yokogawa1
 
Charles sonigo - Demuxed 2018 - How to be data-driven when you aren't Netflix...
Charles Sonigo
 
Ad

More from Greg Landrum (13)

PDF
Chemical registration
Greg Landrum
 
PDF
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
 
PDF
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
 
PDF
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
PDF
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
PDF
Is one enough? Data warehousing for biomedical research
Greg Landrum
 
PDF
Some "challenges" on the open-source/open-data front
Greg Landrum
 
PDF
Large scale classification of chemical reactions from patent data
Greg Landrum
 
PDF
Machine learning in the life sciences with knime
Greg Landrum
 
PDF
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
PDF
Open-source tools for querying and organizing large reaction databases
Greg Landrum
 
PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Chemical registration
Greg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Greg Landrum
 
Google BigQuery for analysis of scientific datasets: Interactive exploration ...
Greg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
Building useful models for imbalanced datasets (without resampling)
Greg Landrum
 
Is one enough? Data warehousing for biomedical research
Greg Landrum
 
Some "challenges" on the open-source/open-data front
Greg Landrum
 
Large scale classification of chemical reactions from patent data
Greg Landrum
 
Machine learning in the life sciences with knime
Greg Landrum
 
Open-source from/in the enterprise: the RDKit
Greg Landrum
 
Open-source tools for querying and organizing large reaction databases
Greg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Ad

Recently uploaded (20)

PDF
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
PDF
Primordial Black Holes and the First Stars
Sérgio Sacani
 
PDF
RODENT PEST MANAGEMENT-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
PPTX
Animal Reproductive Behaviors Quiz Presentation in Maroon Brown Flat Graphic ...
LynetteGaniron1
 
PDF
A proposed mechanism for the formation of protocell-like structures on Titan
Sérgio Sacani
 
PDF
GK_GS One Liner For Competitive Exam.pdf
abhi01nm
 
PDF
NRRM 330 Dynamic Equlibrium Presentation
Rowan Sales
 
PDF
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
PDF
Annual report 2024 - Inria - English version.pdf
Inria
 
PDF
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
PPTX
Envenomation AND ANIMAL BITES DETAILS.pptx
HARISH543351
 
PPTX
formations-of-rock-layers-grade 11_.pptx
GraceSarte
 
PPTX
How to write a research paper July 3 2025.pptx
suneeta panicker
 
PPTX
Different formulation of fungicides.pptx
MrRABIRANJAN
 
PDF
Step-by-Step Guide: How mRNA Vaccines Works
TECNIC
 
PPT
Cell cycle,cell cycle checkpoint and control
DrMukeshRameshPimpli
 
PPTX
Immunopharmaceuticals and microbial Application
xxkaira1
 
PPTX
Cooking Oil Tester How to Measure Quality of Frying Oil.pptx
M-Kube Enterprise
 
The-Origin- of -Metazoa-vertebrates .ppt
S.B.P.G. COLLEGE BARAGAON VARANASI
 
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
Primordial Black Holes and the First Stars
Sérgio Sacani
 
RODENT PEST MANAGEMENT-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
Animal Reproductive Behaviors Quiz Presentation in Maroon Brown Flat Graphic ...
LynetteGaniron1
 
A proposed mechanism for the formation of protocell-like structures on Titan
Sérgio Sacani
 
GK_GS One Liner For Competitive Exam.pdf
abhi01nm
 
NRRM 330 Dynamic Equlibrium Presentation
Rowan Sales
 
Pharma Part 1.pdf #pharmacology #pharmacology
hikmatyt01
 
Annual report 2024 - Inria - English version.pdf
Inria
 
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
Envenomation AND ANIMAL BITES DETAILS.pptx
HARISH543351
 
formations-of-rock-layers-grade 11_.pptx
GraceSarte
 
How to write a research paper July 3 2025.pptx
suneeta panicker
 
Different formulation of fungicides.pptx
MrRABIRANJAN
 
Step-by-Step Guide: How mRNA Vaccines Works
TECNIC
 
Cell cycle,cell cycle checkpoint and control
DrMukeshRameshPimpli
 
Immunopharmaceuticals and microbial Application
xxkaira1
 
Cooking Oil Tester How to Measure Quality of Frying Oil.pptx
M-Kube Enterprise
 

How Do You Build and Validate 1500 Models and What Can You Learn from Them?

  • 1. © 2018 KNIME AG. All Rights Reserved. How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum*, Anna Martin, Daria Goldmann KNIME AG 2018 ICCS @dr_greg_landrum
  • 2. © 2018 KNIME AG. All Rights Reserved. The Monster Model Factory Greg Landrum*, Anna Martin, Daria Goldmann KNIME AG 2018 ICCS @dr_greg_landrum
  • 3. © 2018 KNIME AG. All Rights Reserved. 3 Who cares? • I have >1500 datasets from ChEMBL that I would like to build models for • I want to actually use the models, so they need to be deployed • The whole process needs to be automated and reproducible so that I can do it again when ChEMBL is updated • Maybe we can learn something interesting from the models themselves
  • 4. 4© 2018 KNIME AG. All Rights Reserved. Back to the beginning
  • 5. © 2018 KNIME AG. All Rights Reserved. 5 The model process Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png CRISP-DM (CRoss Industry Standard Process for Data Mining) is a standard process for data mining solutions. wikipedia://CRISP-DM
  • 6. © 2018 KNIME AG. All Rights Reserved. 6 The model process Image from: https://upload.wikimedia.org/wiki pedia/commons/b/b9/CRISP- DM_Process_Diagram.png Init Load Transform Learn Score Evaluate Deploy
  • 7. © 2018 KNIME AG. All Rights Reserved. 7 The model process, multiple models …
  • 8. © 2018 KNIME AG. All Rights Reserved. 8 The model process, multiple models …
  • 9. © 2018 KNIME AG. All Rights Reserved. 9 The model process, multiple models … https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
  • 10. © 2018 KNIME AG. All Rights Reserved. 10 The model process, multiple models … It’s not feasible to manually do this for a daunting number of models! https://commons.wikimedia.org/wiki/File:Jabberwocky.jpg
  • 11. 11© 2018 KNIME AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
  • 12. © 2018 KNIME AG. All Rights Reserved. 12 Automation: the model process factory
  • 13. © 2018 KNIME AG. All Rights Reserved. 13 Init Load Transform Learn Score Evaluate Deploy Automation: the model process factory Score EvaluateTransform DeployLoad Learn Score Learn Load Transform Evaluate Deploy Score EvaluateTransform DeployLoad Learn Score Learn Load Transform Evaluate Deploy Make each step a separate workflow. Use KNIME to orchestrate calling those workflows KNIME blog post: https://goo.gl/LvESqB White paper: https://goo.gl/d6UpUu
  • 14. © 2018 KNIME AG. All Rights Reserved. 14 Model Factory Init Load Transform Learn Score Evaluate Deploy
  • 15. © 2018 KNIME AG. All Rights Reserved. 15 The heart of the factory: Call Local Workflow1 • Executes another workflow in the same local repository https://pixabay.com/en/heart-veins-arteries-anatomy-152594/ 1 Call Remote Workflow when run on the KNIME Server
  • 16. © 2018 KNIME AG. All Rights Reserved. 16 Model Factory Init Load Transform Learn Score Evaluate Deploy
  • 17. © 2018 KNIME AG. All Rights Reserved. 17 Model Factory Init Load Transform Learn Score Evaluate Deploy
  • 18. 18© 2018 KNIME AG. All Rights Reserved. Details
  • 19. © 2018 KNIME AG. All Rights Reserved. 19 Extracting the data • Data source: ChEMBL 23 • Activity types: ('GI50', 'IC50', 'Ki', 'MIC', 'EC50', 'AC50', 'ED50', 'GI', 'Kd', 'CC50', 'LC50', 'MIC90', 'MIC50', 'ID50’) -> 6.5 million points • Define active: Standard_value < 100nM -> 1.3 million actives • Define inactive: Standard_value > 1uM • Define an interesting assay At least 50 actives -> 1556 assays • Final dataset size: 2.5 million data points, 1.5 million compounds Init Load Transform Learn Score Evaluate Deploy
  • 20. © 2018 KNIME AG. All Rights Reserved. 20 Init Load Transform Learn Score Evaluate DeployFinding more inactives • The ChEMBL datasets almost all have an unrealistically high ratio of actives to inactives • “Fix” that by adding enough assumed inactives to each dataset to get a 1:10 active:inactive ratio • Pick those assumed inactives to be roughly similar to the actives: Tanimoto similarity of between 0.35 and 0.6 using RDKit Morgan 2 fingerprints
  • 21. © 2018 KNIME AG. All Rights Reserved. 21 Extracting the data Init Load Transform Learn Score Evaluate Deploy
  • 22. © 2018 KNIME AG. All Rights Reserved. 22 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
  • 23. © 2018 KNIME AG. All Rights Reserved. 23 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures • Generate five chemical fingerprints for each structure Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
  • 24. © 2018 KNIME AG. All Rights Reserved. 24 Transform • Convert SMILES from database into chemical structures • Cleanup the chemical structures • Generate five chemical fingerprints for each structure – Morgan 3 counts (ECFC6), 4K “bits” – Morgan 3 (ECFP6), 4K bits – Morgan 2 (ECFP4), 2K bits – RDKit FP, length 1-5, 2K bits – Atom pairs, distances 1-20, 4K bits Init Load Transform Learn Score Evaluate Deploy http://www.rdkit.org
  • 25. © 2018 KNIME AG. All Rights Reserved. 25 Learn and Score Init Load Transform Learn Score Evaluate Deploy 10 different stratified random training/holdout splits generated for each assay
  • 26. © 2018 KNIME AG. All Rights Reserved. 26 Learn Init Load Transform Learn Score Evaluate Deploy Learning: • Fingerprint Bayes (NB) • Logistic Regression (LR) • Random Forest (RF) 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 • Gradient Boosting (H2O) 100 trees, max_depth = 5, learning_rate = 0.05 Model Selection: • Pick best model based on Enrichment factor at 5% (EF5)
  • 27. © 2018 KNIME AG. All Rights Reserved. 27 Learn Init Load Transform Learn Score Evaluate Deploy Where did these parameters come from? Learning: • Fingerprint Bayes (NB) • Logistic Regression (LR) • Random Forest (RF) 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 • Gradient Boosting (H2O) 100 trees, max_depth = 5, learning_rate = 0.05
  • 28. © 2018 KNIME AG. All Rights Reserved. 28 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy • Full parameter optimization done for each method+fingerprint on 70 assays • Results used to pick “standard” parameter sets: – Random Forest: 200 trees, max depth=10, min_leaf_size=3, min_node_size=6 – Gradient Boosting: 100 trees, max_depth = 5, learning_rate = 0.05
  • 29. © 2018 KNIME AG. All Rights Reserved. 29 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy
  • 30. © 2018 KNIME AG. All Rights Reserved. 30 Parameter Optimization Init Load Transform Learn Score Evaluate Deploy The optimization and model selection workflow is presented in detail in Daria’s KNIME blog post: https://www.knime.com/blog/stuck-in-the-nine-circles-of-hell-try-parameter- optimization-a-cup-of-tea The workflow is available in the EXAMPLES folder inside KNIME: 04_Analytics/11_Optimization/08_Model_Optimization_and_Selection
  • 31. 31© 2018 KNIME AG. All Rights Reserved. Making it all run Init Load Transform Learn Score Evaluate Deploy
  • 32. © 2018 KNIME AG. All Rights Reserved. 32 Execution • In total >310K models were built1 1 ~1550 assays * 4 methods * 5 FPs * 10 repeats
  • 33. © 2018 KNIME AG. All Rights Reserved. 33 Execution KNIME Analytics Platform KNIME Server ... Distributed Executor Distributed Executor Distributed Executor Build/test workflows Run model factory Run individual assays 65-70 load-balanced distributed executors
  • 34. 34© 2018 KNIME AG. All Rights Reserved. Are the models any good?
  • 35. © 2018 KNIME AG. All Rights Reserved. 35 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382
  • 36. © 2018 KNIME AG. All Rights Reserved. 36 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382 Yeah!
  • 37. © 2018 KNIME AG. All Rights Reserved. 37 Performance on validation sets • AUC: mean=0.958 s=0.070 • Cohen’s kappa: mean=0.690 s=0.382 Yeah! Uh oh…
  • 38. 38© 2018 KNIME AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
  • 39. © 2018 KNIME AG. All Rights Reserved. 39 An experiment to check model generalizability • Pick assays where standard_type is Ki • Group them by target ID • Limit to targets where Ki was measured in at least 5 assays -> 11 targets, 73 assays • Use the model built on one assay from a target ID to predict activity across the other assays.
  • 40. © 2018 KNIME AG. All Rights Reserved. 40 An experiment to check model generalizability • The targets: TargetID Name Num Assays CHEMBL205 Carbonic anhydrase II 7 CHEMBL224 Serotonin 2a (5-HT2a) receptor 8 CHEMBL234 Dopamine D3 receptor 10 CHEMBL243 Human immunodeficiency virus type 1 protease 6 CHEMBL244 Coagulation factor X 5 CHEMBL253 Cannabinoid CB2 receptor 7 CHEMBL281 Carbonic anhydrase IV 5 CHEMBL3371 Serotonin 6 (5-HT6) receptor 8 CHEMBL344 Melanin-concentrating hormone receptor 1 5 CHEMBL4550 5-lipoxygenase activating protein 5 CHEMBL4908 Trace amine-associated receptor 1 7
  • 41. © 2018 KNIME AG. All Rights Reserved. 41 Carbonic Anhydrase IV Carbonic Anhydrase II HIV Protease Factor X 5-HT6 TAAR1
  • 42. © 2018 KNIME AG. All Rights Reserved. 42 Carbonic Anhydrase IV Carbonic Anhydrase II HIV Protease Factor X 5-HT6 TAAR1
  • 43. © 2018 KNIME AG. All Rights Reserved. 43 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 1366806 AUROC: 0.38 EF5: 0
  • 44. © 2018 KNIME AG. All Rights Reserved. 44 An Example Assay_ID 448716 Assay_ID 1366806
  • 45. © 2018 KNIME AG. All Rights Reserved. 45 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 659849 AUROC: 0.99 EF5: 8.8
  • 46. © 2018 KNIME AG. All Rights Reserved. 46 An Example Assay_ID 448716 Assay_ID 659849
  • 47. © 2018 KNIME AG. All Rights Reserved. 47 An Example Target: CHEMBL3371 (5-HT6) Train on Assay ID: 448716 Test with Assay ID: 1528679 AUROC: 0.83 EF5: 0.4
  • 48. © 2018 KNIME AG. All Rights Reserved. 48 An Example Assay_ID 448716 Assay_ID 1528679
  • 49. © 2018 KNIME AG. All Rights Reserved. 49 Intermediate conclusion • Many/most of the models have likely overfit the training data • Alternative interpretation: we’ve actually built models to predict whether or not a compound is taken from a particular paper • Unfortunately these are functionally the same if you want to predict activity
  • 50. 50© 2018 KNIME AG. All Rights Reserved. https://www.publicdomainpictures.net/view-image.php?image=155188
  • 51. © 2018 KNIME AG. All Rights Reserved. 51 Look for frequent algorithm + fingerprint combinations • For each of the ~1550 assays * 4 learning algorithms * 10 repeats, look at which fingerprint performed best (as measured by EF5)
  • 52. © 2018 KNIME AG. All Rights Reserved. 52 Look for frequent algorithm + fingerprint combinations For each of the ~1550 assays * 4 learning algorithms * 10 repeats, look at which fingerprint performed best (as measured by EF5)
  • 53. © 2018 KNIME AG. All Rights Reserved. 53 Which method/FP pair is best for each assay? • For each of the ~1550 assays * 10 repeats, look at which algorithm + fingerprint performed best (as measured by EF51, AUC2, and algorithm complexity3) 1 Rounded to 1 decimal point 2 Rounded to 2 decimal points 3 Random Forest > Gradient Boosting > Fingerprint Bayes > Logistic Regression
  • 54. © 2018 KNIME AG. All Rights Reserved. 54 Which method/FP pair is best for each assay? Select best model using EF5, AUC, algorithm complexity
  • 55. © 2018 KNIME AG. All Rights Reserved. 55 Wrapping up • We have automated the construction and evaluation of >1500 models for bioassays using data pulled from ChEMBL • We’ve got some strong evidence that the models themselves are significantly overfit • We were able to start to draw some general conclusions about fingerprints and methods
  • 56. © 2018 KNIME AG. All Rights Reserved. 56 There’s still a lot left to do • Verify the repeatability of the process by updating when the next version of ChEMBL is released • Some more thought into combining assays to get around the “one series per paper” problem • Look into doing the full optimization run • Come up with a good way of presenting the predictions
  • 57. © 2018 KNIME AG. All Rights Reserved. 57 More details… • Model process factory blog post: https://goo.gl/LvESqB • Model process factory white paper: https://goo.gl/d6UpUu • Model process factory workflow: knime://EXAMPLES/50_Applications/26_Model_Process_ Management • Daria’s blog post on the model optimization workflow: https://www.knime.com/blog/stuck-in-the-nine-circles- of-hell-try-parameter-optimization-a-cup-of-tea • Accompanying workflow: knime://EXAMPLES/ 04_Analytics/11_Optimization/08_Model_Optimization_ and_Selection • When we’re done cleaning up, there will be a blog post/sample workflow for the monster model factory too.
  • 58. © 2018 KNIME AG. All Rights Reserved. 58 7th RDKit UGM: 19 - 21 September • Hosted by Andreas Bender, Cambridge University • Free registration: https://goo.gl/VVvHUH (or get it on http://www.rdkit.org) http://www.rdkit.org
  • 59. © 2018 KNIME AG. All Rights Reserved. 59 KNIME Fall Summit 2018 November 6 – 9 at AT&T Executive Education and Conference Center, Austin, Texas • Tuesday & Wednesday: One-day courses • Thursday & Friday: Summit sessions Use the code ICCS-2018 for 10% off tickets. Register at: knime.com/fall-summit2018
  • 60. 60© 2018 KNIME AG. All Rights Reserved. The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.