SlideShare a Scribd company logo
Bridging the Gap:
From Data Science to Production
Florian Wilhelm EuroPython 2018 @ Edinburgh, 2018-07-25
Special Interests
• Mathematical Modelling
• Recommendation Systems
• Data Science in Production
• Python Data Stack
Dr. Florian Wilhelm
Principal Data Scientist @ inovex
@FlorianWilhelm
FlorianWilhelm
florianwilhelm.info
2
IT-project house for digital transformation:
‣ Agile Development & Management
‣ Web · UI/UX · Replatforming · Microservices
‣ Mobile · Apps · Smart Devices · Robotics
‣ Big Data & Business Intelligence Platforms
‣ Data Science · Data Products · Search · Deep Learning
‣ Data Center Automation · DevOps · Cloud · Hosting
‣ Trainings & Coachings
Using technology to inspire our
clients. And ourselves.
inovex offices in
Karlsruhe · Cologne · Munich ·
Pforzheim · Hamburg · Stuttgart.
www.inovex.de
Agenda
Many facets
Data
Science to
Production
Organisation
Quality
Assurance
Deployment
Use-Case
Languages
4
Use-Case: High Level Perspective
What does your model pipeline look like?
f(...)Data
Results
Model
5
Use-Case: High Level Perspective
What is your Data Source?
Data
Variants:
• Database (PostgreSQL, C*)
• Distributed Filesystem (HDFS)
• Stream (Kafka)
• ...
How is your data accessed?
What are the frequency and recency requirements?
Batch, Near-Realtime, Realtime, Stream?
6
Use-Case: High Level Perspective
What is a model?
Model
Model includes:
• Preprocessing (cleansing, imputation, scaling)
• Construction of derived features (EMAs)
• Machine Learning Algorithm (Random Forest, ANN)
• ...
Is the input of your model raw data or pregenerated features?
Does your model have a state?
7
Use-Case: High Level Perspective
How is your result stored?
Results
Variants:
• Database (PostgreSQL, C*)
• Distributed Filesystem (HDFS)
• Stream (Kafka)
• On demand (REST API)
• ...
What are the frequency and recency requirements?
Batch, Near-Realtime, Realtime, Stream?
8
Use-Case: High Level Perspective
Our challenge
ModelData
Results
Deployment
Interface Interface
Production
9
Use-Case Evaluation
Delivery Problem Class Volume &
Velocity
Inference /
Prediction
Technical
Conditions
WebService Classification 10 GB weekly Batch Java-Stack +
Python
Stream Regression 1 GB daily Near-Realtime On-Premise
Database Recommendation 10k events/s Realtime AWS Cloud
Explainability? Stream
Characteristics of a Data Use-case
Note down your specific requirements before thinking about an architecture.
There is no one size fits all!
10
Use-Case: High Level Perspective
Right from the Start
• State the requirements of your data use-case
• Identify and check data sources
• Define interfaces with other teams/departments
• Test the whole data flow and infrastructure early on with
a dummy model
11
Many facets
Data
Science to
Production
Organisation
Quality
Assurance
Deployment
Use-Case
Languages
12
It‘s an iterative Process
Quality Assurance for smooth iterations
Data
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
CRISP-DM
13
http://clean-code-developer.de/
Clean Code
Clean code is code that is easy to understand and easy to change.
Resources:
• Software Design Patterns
• SOLID Principles
• The Pragmatic Programmer
• The Software Craftsman
14
https://www.linkedin.com/pulse/5-pro-tips-data-scientists-write-good-code-jason-byrne/
https://huddle.eurostarsoftwaretesting.com/4-ways-automation-and-ci-are-changing-testing-and-development/
Continuous Integration
• Version, package
and manage your
artefacts
• Provide tests (unit,
systems, ...)
• Automize as much
as possible
• Embrace processes
15
Monitoring
KPI and Stats
• KPIs (CTR, Conversions)
• Number of requests
• Timeouts, delays
• Total number of predictions
• Runtimes
• ... All monitoring needs to be linked to
the currently running version of your
model!
16
17https://landing.google.com/sre/book/chapters/part3.html
Monitoring
Site Reliability Engineering
How Google Runs
Production Systems
@Google
Lucas Javier Bernardi | Diagnosing Machine Learning Models: https://www.youtube.com/watch?v=ZD8LA3n6YvI
Monitoring
Model Stats
Monitor the results of
model‘s predicitons
Example:
Response Distribution
Analysis
a) working model
b) confused model
a)
b)
18
A/B Tests
Feedback for your model
• Always compare your “improved“
model to the current baseline
• Allows comparing two models not
only in offline metrics but also
online metrics and KPIs.
• Also possible to adjust
hyperparameters with online
feedback, e.g. multi-armed bandit
19
A/B Tests
Technical requirements
• Versioning of your models to allow linking them to test
groups
• Deploying and serving several models at the same time
independently (needed for fast rollback anyway!)
• Tracking the results of a given model up to the point of
facing the customer
Serving
20
Many facets
Data
Science to
Production
Organisation
Quality
Assurance
Deployment
Use-Case
Languages
21
Sebastian Neubauer - There should be one obvious way to bring python into production https://www.youtube.com/watch?v=hnQKsxKjCUo
Organisation of Teams
Wall of Confusion
• Code
• Tests
• Releases
• Version Control
• Continuous
Integration
• Features
Developers
• Packaging
• Deployment
• Lifecycle
• Configuration
• Security
• Monitoring
Operations
Release
v1.2.3
22
Different Cultures/Thinking
Wrong Approach!
• Especially dangerous
separation for data
products/features
• Speed and Time to Market
are important thus “not my
job“-thinking hurts
• “I made a great model“ vs.
„We made a great data
product“
23
https://en.wikipedia.org/wiki/DevOps
Organisation of Teams
Overcoming the Wall of Confusion
Continuous Delivery
Dev Ops
24
http://101.datascience.community/2016/11/28/data-scientists-data-engineers-software-engineers-the-difference-according-to-linkedin/
Heterogeneous Teams
How to bring Data Scientists into DevOps?
• Pure teams of Data Scientists
often struggle to get anything in
production
• As a minimum complement, SW
and Data Engineers are needed.
(2-3 Engineers per Data Scientist)
• Optionally a Data Product
Manager as well as an UI/UX
expert if necessary
25
http://www.full-stackagile.com/2016/02/14/team-organisation-squads-chapters-tribes-and-guilds/
Organisation around Features
Responsibility with vertical teams
• Fully autonomous teams
• End-to-end responsibility for a feature
• Works well with Agile Methods like Scrum
• Faster delivery and less politics
26
Many facets
Data
Science to
Production
Organisation
Quality
Assurance
Deployment
Use-Case
Languages
27
Programming Languages
The Two Language Problem
Industry
• Java stack common
also Scala
• Strongly typed
• Emphasize on
robustness and edge
cases
• Industrial standards
for deployment
Science
• Often Python and R
• Dynamic typed since
easier to get the job
done
• Emphasize on fancy
methods and results
• Runs on my machine
28
Language Problem
Solution: Select one to rule them all!
• Having a single language reduces
the complexity of deployment
• Implementation efforts due to
abandoning one ecosystem totally
29
https://www.analyticsvidhya.com/blog/2017/09/machine-learning-models-as-apis-using-flask/
Language Problem
Solution: Python in production
• Especially easy for batch prediction use-cases
• If a web service is needed flask is a viable
option
• Scale horizontally during prediction and use a
big metal node for training a model
• Tap into the Hadoop world by using PySpark,
PyHive etc.
• Consider isolated containers using docker
30
Language Problem
Solution: PoCs in Python/R, rewrite in Java for production
• Lots of efforts and slow
• Iterations and new feature are
hard to implement
• Reproducability of bugs is
cumbersome
• Pro: Everyone gets what they
want
Worst-case Scenario
31
Language Problem
Solution: Exchangable formats
• Works great in theory
• Limited functionality and
flexibility
• No guarantee the same model
description will be interpreted
the same by two different
implementations
• Preprocessing / feature
generation not included
32
Language Problem
Solution: Frameworks
• Various language bindings allow developing
in Python/R and running on the Java stack
• Be aware if framework also covers feature
generation
• Ease of use at the cost of flexibility
33
Two Language Problem
Three concepts of dealing with it
Reimplement Frameworks Single Language
1 2 3
x 34
Many facets
Data
Science to
Production
Organisation
Quality
Assurance
Deployment
Use-Case
Languages
35
Deployment
Deployment heavily depends
on the chosen approach!
Still some software engineering principles
apply like Continuous Integration or even
Continuous Delivery
36
Sculley et al (2015), Hidden Technical Debt in Machine Learning Systems
Technical Debt in ML Pipelines
Deployment
37
Deployment
General principles
• Versioning & packaging, defined processes, quality
management
• Keep the development and production environment as
similar as possible
• Automation is a must, avoid human error!
• Isolated and controllable environments are a great idea,
e.g. Docker.
38
39https://developers.google.com/machine-learning/rules-of-ml/
Google‘s Best Practices for ML Engineering
Best of Google‘s rules
Rule #1: Don’t be afraid to launch a product without machine learning.
Rule #2: First, design and implement metrics
Rule #4: Keep the first model simple and get the infrastructure right.
Rule #5: Test the infrastructure independently from the machine learning
Rule #9: Detect problems before exporting models.
Rule #11: Give feature columns owners and documentation.
Rule #13: Choose a simple, observable and attributable metric for your first objective.
Rule #14: Starting with an interpretable model makes debugging easier.
Rule #16: Plan to launch and iterate.
Rule #24: Measure the delta between models.
Rule #27: Try to quantify observed undesirable behavior. "measure first, optimize second“
Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible.
Most of the problems are engineering problems!
https://www.inovex.de/blog/data-science-in-production/
Example: Continuous Integration
40
devpi
https://pyscaffold.org/
Example: Python Package/Distribution
PyScaffold
• Easy and sane Python packaging
• Proper versioning of every commit
• Git integration, e.g. pre-commit
• Declarative configuration with setup.cfg
• Follows community standards
• Many extensions available
$> pip install pyscaffold
$> putup my_project
41
Key Learnings
Data Science to Production
Data
Science to
Production
Organisation
Quality
Assurance
Deployment
Use-Case
Languages
• Dependent on your use-case,
no one-size fits all!
• Think early on about QA
• DevOps Culture & team
responsibility
• Choose a framework or single
language to overcome the Two-
Language-Problem
• Embrace processes &
automation
42
Production is NOT an Afterthought!
Thank you!
Florian Wilhelm
Principal Data Scientist
inovex GmbH
Schanzenstraße 6-20
Kupferhütte 1.13
51063 Cologne, Germany
florian.wilhelm@inovex.de

More Related Content

PDF
Deep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Florian Wilhelm
 
PDF
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Florian Wilhelm
 
PDF
How mobile.de brings Data Science to Production for a Personalized Web Experi...
Florian Wilhelm
 
PDF
Which car fits my life? - PyData Berlin 2017
Florian Wilhelm
 
PDF
Which car fits my life? Mobile.de’s approach to recommendations
inovex GmbH
 
PDF
20101007 how smart use cases drive web development
Sander Hoogendoorn
 
PDF
STKI Summit 2014 - Trends and Positioning - Delivery domain
Pini Cohen
 
PDF
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
Paris Women in Machine Learning and Data Science
 
Deep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Florian Wilhelm
 
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Florian Wilhelm
 
How mobile.de brings Data Science to Production for a Personalized Web Experi...
Florian Wilhelm
 
Which car fits my life? - PyData Berlin 2017
Florian Wilhelm
 
Which car fits my life? Mobile.de’s approach to recommendations
inovex GmbH
 
20101007 how smart use cases drive web development
Sander Hoogendoorn
 
STKI Summit 2014 - Trends and Positioning - Delivery domain
Pini Cohen
 
From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned...
Paris Women in Machine Learning and Data Science
 

Similar to Bridging the Gap: from Data Science to Production (20)

PDF
Productionising Machine Learning Models
Tash Bickley
 
PPTX
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
PDF
Dances with unicorns
EspritAgile
 
PPTX
Deploying ML models in the enterprise
doppenhe
 
PDF
Best practices for structuring Machine Learning code
Erlangen Artificial Intelligence & Machine Learning Meetup
 
PPTX
ANIn Coimbatore Sep 2023 | Agile for data science by Venkatesa Prasanna Selvaraj
AgileNetwork
 
PDF
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Matt Stubbs
 
PPTX
230208 MLOps Getting from Good to Great.pptx
Arthur240715
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PDF
Productionizing Data Science at Experience
Matt Mills
 
PDF
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
PDF
From Lab to Factory: Creating value with data
Peadar Coyle
 
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
PDF
C2_W1---.pdf
Humayun Kabir
 
PPTX
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
PPTX
Data science tools of the trade
Fangda Wang
 
PPTX
Why do the majority of Data Science projects never make it to production?
Itai Yaffe
 
PDF
Challenges of Operationalising Data Science in Production
iguazio
 
PDF
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
PDF
Data Science meets Software Development
Alexis Seigneurin
 
Productionising Machine Learning Models
Tash Bickley
 
Software engineering practices for the data science and machine learning life...
DataWorks Summit
 
Dances with unicorns
EspritAgile
 
Deploying ML models in the enterprise
doppenhe
 
Best practices for structuring Machine Learning code
Erlangen Artificial Intelligence & Machine Learning Meetup
 
ANIn Coimbatore Sep 2023 | Agile for data science by Venkatesa Prasanna Selvaraj
AgileNetwork
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Matt Stubbs
 
230208 MLOps Getting from Good to Great.pptx
Arthur240715
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
Productionizing Data Science at Experience
Matt Mills
 
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
From Lab to Factory: Creating value with data
Peadar Coyle
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
C2_W1---.pdf
Humayun Kabir
 
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Open Data Group
 
Data science tools of the trade
Fangda Wang
 
Why do the majority of Data Science projects never make it to production?
Itai Yaffe
 
Challenges of Operationalising Data Science in Production
iguazio
 
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
Data Science meets Software Development
Alexis Seigneurin
 
Ad

More from Florian Wilhelm (14)

PDF
Why Exceptions are just sophisticated GoTos ... and How to Move Beyond
Florian Wilhelm
 
PDF
Vodafone Mathematical Modelling 2024.pdf
Florian Wilhelm
 
PDF
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
 
PDF
Unlocking the Power of Integer Programming
Florian Wilhelm
 
PDF
WALD: A Modern & Sustainable Analytics Stack
Florian Wilhelm
 
PDF
Forget about AI and do Mathematical Modelling instead!
Florian Wilhelm
 
PDF
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
Florian Wilhelm
 
PDF
Honey I Shrunk the Target Variable! Common pitfalls when transforming the tar...
Florian Wilhelm
 
PDF
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Florian Wilhelm
 
PDF
Uncertainty Quantification in AI
Florian Wilhelm
 
PDF
Performance evaluation of GANs in a semisupervised OCR use case
Florian Wilhelm
 
PDF
Declarative Thinking and Programming
Florian Wilhelm
 
PDF
PyData Meetup Berlin 2017-04-19
Florian Wilhelm
 
PDF
Explaining the idea behind automatic relevance determination and bayesian int...
Florian Wilhelm
 
Why Exceptions are just sophisticated GoTos ... and How to Move Beyond
Florian Wilhelm
 
Vodafone Mathematical Modelling 2024.pdf
Florian Wilhelm
 
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
 
Unlocking the Power of Integer Programming
Florian Wilhelm
 
WALD: A Modern & Sustainable Analytics Stack
Florian Wilhelm
 
Forget about AI and do Mathematical Modelling instead!
Florian Wilhelm
 
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
Florian Wilhelm
 
Honey I Shrunk the Target Variable! Common pitfalls when transforming the tar...
Florian Wilhelm
 
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Florian Wilhelm
 
Uncertainty Quantification in AI
Florian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Florian Wilhelm
 
Declarative Thinking and Programming
Florian Wilhelm
 
PyData Meetup Berlin 2017-04-19
Florian Wilhelm
 
Explaining the idea behind automatic relevance determination and bayesian int...
Florian Wilhelm
 
Ad

Recently uploaded (20)

PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PPTX
Nature of Science and the kinds of models used in science
JocelynEvascoRomanti
 
PPTX
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
PPTX
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
PPTX
Role of GIS in precision farming.pptx
BikramjitDeuri
 
PDF
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PPTX
Laboratory design and safe microbiological practices
Akanksha Divkar
 
PPTX
ANTIANGINAL DRUGS.pptx m pharm pharmacology
46JaybhayAshwiniHari
 
PDF
study of microbiologically influenced corrosion of 2205 duplex stainless stee...
ahmadfreak180
 
PPTX
Evolution of diet breadth in herbivorus insects.pptx
Mr. Suresh R. Jambagi
 
PDF
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
PPTX
Reticular formation_nuclei_afferent_efferent
muralinath2
 
PPTX
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PPTX
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
PDF
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Nature of Science and the kinds of models used in science
JocelynEvascoRomanti
 
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
Role of GIS in precision farming.pptx
BikramjitDeuri
 
Challenges of Transpiling Smalltalk to JavaScript
ESUG
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Laboratory design and safe microbiological practices
Akanksha Divkar
 
ANTIANGINAL DRUGS.pptx m pharm pharmacology
46JaybhayAshwiniHari
 
study of microbiologically influenced corrosion of 2205 duplex stainless stee...
ahmadfreak180
 
Evolution of diet breadth in herbivorus insects.pptx
Mr. Suresh R. Jambagi
 
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
Reticular formation_nuclei_afferent_efferent
muralinath2
 
Sleep_pysilogy_types_REM_NREM_duration_Sleep center
muralinath2
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Embark on a journey of cell division and it's stages
sakyierhianmontero
 
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 

Bridging the Gap: from Data Science to Production