SlideShare a Scribd company logo
Machine Learning
Systems for Engineers
Where Data Science Meets
Engineering
Who Am I?
• I’m Cameron!
• LinkedIn - https://www.linkedin.com/in/cameron-joannidis/
• Twitter - @CamJo89
• Consult across a range of areas and have built many big data
and machine learning systems
• Buzz Word Bingo
• Big Data
• Machine Learning
• Functional Programming
Machine learning systems for engineers
Agenda
• Data
• Deployment
• Metrics
• Big Data Iteration Speed
Data
Example Use Case: Churn Prediction
We want to predict which users are likely to leave our service
soon so that we can try and give them reasons to stay
Training Data Creation
• Historical Data (need actual
churn events as examples)
• We know the labels at train time
• Produce Features to try and
predict the label

Train Our Model
• Minimise our loss function to
best predict our labels (Churn/
No Churn)
Prediction Time
• Jason’s red feature value > 30
Prediction Time
• Jason’s red feature value > 30
• Jason’s yellow feature value != 7
Prediction Time
• Jason’s red feature value > 30
• Jason’s yellow feature value != 7
• We predict Jason will churn
Moving to Production
Training
Pipeline
Training
Pipeline
Scoring
Pipeline
Data Issues
• Data ingestion lags (systematic) or failures (random)
• Data is incorrect
Data Issues
• Data ingestion lags (systematic) or failures (random)
• Data is incorrect
Before We Change the System
• Fix the data source if that’s and option
• Measure the importance of the feature in the model to
quantify the cost/effort
Naive Best Effort
• Use most recent data for all
features
• Inconsistent customer view
• Retrain model with data lag in
mind
• Tightly couple model
Consistently Lagged
• Get a consistent snapshot at the
time of the most lagged data
source
• Predictions will be outdated
equal to the slowest data
source lag
Imputation
• Fill in missing values with median
for numerical, mode for
continuous
• Every users experience is the
median experience? Not useful.
• Contextual Imputing (e.g. Median
male height for men, median
female height for women)
• Lots of custom model specific
code necessary
Graceful Model Degradation
• Model Specific fallback to give a
best guess from the distribution
given the current inputs
• Doesn't come out of the box in
most cases
Deployment
Data to Model Deployments
• Containerised model exposing
scoring API
• Clean and simple model
management semantics
• Send your data to your models
• Network shuffle costs can be
substantial for larger datasets
Model to Data Deployment
• Distributed processing
framework performs scoring (e.g.
Spark)
• Send your models to your data
• Efficient but less portable
• Model lifecycle more difficult to
manage

Future Solutions
• Spark on Kubernetes
• Models and Executors colocated in pods with data locality.
• Model lifecycle management through Kubernetes
• Rollouts
• Rollbacks
• Canary Testing
• A/B testing
• Managed cloud solutions
• Complexity hidden from users

Metrics
A Few ML System Metrics
• Data Distribution
• Effectiveness in Market
Data Distribution
Your training data will have some
distribution of labels
Data Distribution
•In production, your data distribution
may be significantly different
•This can happen over time as these
systems tend to be dynamic

Possible Causes
• Changes to the domain you're modelling
• Seasonality or external effects
• Changes to the customers themselves or the way the
customers are using your service
• Problems with the data collection pipelines (corrupted
data feeds etc)
Effectiveness in Market
• Production is the first real test
• Need to capture metrics to measure the effect of the
model for its intended purpose
• Paves the road towards;
• Effective A/B testing
• Incremental model improvement
• Measurability of ROI
Big Data Iteration Speed
Training Models on Big Data is Slow
• Not all algorithms scale linearly as data/model
complexity increases
• Hit computation/memory bottlenecks
• Number of hypothesis we can test is reduced
• Generating new features can become prohibitively
expensive
Stratified Sampling
Sampling History
Customer Subset Sampling
Know Where to Spend Your Time
• Bad performance on training data = Bias Problem
• Improve features
• More complex model
• Train longer
• Good performance on training data and bad
performance on test set = Variance Problem
• Get more data for training
• Regularisation
Choice of Framework / Technology
• Modelling in R/Python and rewriting in production in
Scala/Spark is an expensive process
• Choose a tech stack that allows engineers and data
scientists to work together and productionise things
quickly. Leads to faster feedback loops
What We’ve Covered
• Data issues can be be a central issue to ML systems are
require a lot of up front design thought
• There are several modes of deployment, each with
their own tradeoffs for different scenarios
• Production is not the end of the process for ML
models. Metrics are a fundamental part of enabling
improvement and growth.
• Ways to improve iteration speed on ML projects
Thank You
Questions?

More Related Content

What's hot (20)

PDF
Agile Machine Learning for Real-time Recommender Systems
Johann Schleier-Smith
 
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
PDF
Best Practices for Engineering Production-Ready Software with Apache Spark
Databricks
 
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
PPTX
Using machine learning to determine drivers of bounce and conversion
Tammy Everts
 
PDF
Unleashing the Power of Machine Learning Prototyping Using Azure AutoML and P...
Luca Zavarella
 
PDF
MLOps Using MLflow
Databricks
 
PDF
SparkML: Easy ML Productization for Real-Time Bidding
Databricks
 
PDF
Walk through of azure machine learning studio new features
Luca Zavarella
 
PDF
Automatic machine learning (AutoML) 101
QuantUniversity
 
PDF
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
PPTX
Neel Sundaresan - Teaching a machine to code
MLconf
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PPTX
ODSC East 2018
Cameron Sim
 
PPT
Optimizing Java Performance
Konstantin Pavlov
 
PPTX
Building Custom
Machine Learning Algorithms
with Apache SystemML
sparktc
 
PPTX
Automated Machine Learning
safa cimenli
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
PPTX
Entity framework advanced
Usama Nada
 
PPTX
MLOps with serverless architectures (October 2018)
Julien SIMON
 
Agile Machine Learning for Real-time Recommender Systems
Johann Schleier-Smith
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Ido Shilon
 
Best Practices for Engineering Production-Ready Software with Apache Spark
Databricks
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
Using machine learning to determine drivers of bounce and conversion
Tammy Everts
 
Unleashing the Power of Machine Learning Prototyping Using Azure AutoML and P...
Luca Zavarella
 
MLOps Using MLflow
Databricks
 
SparkML: Easy ML Productization for Real-Time Bidding
Databricks
 
Walk through of azure machine learning studio new features
Luca Zavarella
 
Automatic machine learning (AutoML) 101
QuantUniversity
 
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
Neel Sundaresan - Teaching a machine to code
MLconf
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
ODSC East 2018
Cameron Sim
 
Optimizing Java Performance
Konstantin Pavlov
 
Building Custom
Machine Learning Algorithms
with Apache SystemML
sparktc
 
Automated Machine Learning
safa cimenli
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Entity framework advanced
Usama Nada
 
MLOps with serverless architectures (October 2018)
Julien SIMON
 

Similar to Machine learning systems for engineers (20)

PDF
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 
PPTX
Machine Learning vs Decision Optimization comparison
Alain Chabrier
 
PDF
Productionising Machine Learning Models
Tash Bickley
 
PDF
Demystifying ML/AI
Matthew Reynolds
 
PDF
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
PDF
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
PDF
Customer choice probabilities
Allan D. Butler
 
PDF
CD in Machine Learning Systems
Thoughtworks
 
PDF
C2_W1---.pdf
Humayun Kabir
 
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
PDF
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
PPTX
Ml2 production
Nikhil Ketkar
 
PPTX
Manoj Shanmugasundaram - Agile Machine Learning Development
Agile Impact Conference
 
PDF
Machine Learning in Production
Ben Freundorfer
 
PDF
Pragmatic Machine Learning @ ML Spain
Louis Dorard
 
PDF
ML Application Life Cycle
SrujanaMerugu1
 
PPTX
The 4 Machine Learning Models Imperative for Business Transformation
RocketSource
 
PPTX
Intro to ML for product school meetup
Erez Shilon
 
PDF
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Thoughtworks
 
PDF
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
Making Netflix Machine Learning Algorithms Reliable
Justin Basilico
 
Machine Learning vs Decision Optimization comparison
Alain Chabrier
 
Productionising Machine Learning Models
Tash Bickley
 
Demystifying ML/AI
Matthew Reynolds
 
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
Customer choice probabilities
Allan D. Butler
 
CD in Machine Learning Systems
Thoughtworks
 
C2_W1---.pdf
Humayun Kabir
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
 
Ml2 production
Nikhil Ketkar
 
Manoj Shanmugasundaram - Agile Machine Learning Development
Agile Impact Conference
 
Machine Learning in Production
Ben Freundorfer
 
Pragmatic Machine Learning @ ML Spain
Louis Dorard
 
ML Application Life Cycle
SrujanaMerugu1
 
The 4 Machine Learning Models Imperative for Business Transformation
RocketSource
 
Intro to ML for product school meetup
Erez Shilon
 
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Thoughtworks
 
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
Ad

Recently uploaded (20)

PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Python basic programing language for automation
DanialHabibi2
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Ad

Machine learning systems for engineers

  • 1. Machine Learning Systems for Engineers Where Data Science Meets Engineering
  • 2. Who Am I? • I’m Cameron! • LinkedIn - https://www.linkedin.com/in/cameron-joannidis/ • Twitter - @CamJo89 • Consult across a range of areas and have built many big data and machine learning systems • Buzz Word Bingo • Big Data • Machine Learning • Functional Programming
  • 4. Agenda • Data • Deployment • Metrics • Big Data Iteration Speed
  • 6. Example Use Case: Churn Prediction We want to predict which users are likely to leave our service soon so that we can try and give them reasons to stay
  • 7. Training Data Creation • Historical Data (need actual churn events as examples) • We know the labels at train time • Produce Features to try and predict the label

  • 8. Train Our Model • Minimise our loss function to best predict our labels (Churn/ No Churn)
  • 9. Prediction Time • Jason’s red feature value > 30
  • 10. Prediction Time • Jason’s red feature value > 30 • Jason’s yellow feature value != 7
  • 11. Prediction Time • Jason’s red feature value > 30 • Jason’s yellow feature value != 7 • We predict Jason will churn
  • 15. Data Issues • Data ingestion lags (systematic) or failures (random) • Data is incorrect
  • 16. Data Issues • Data ingestion lags (systematic) or failures (random) • Data is incorrect
  • 17. Before We Change the System • Fix the data source if that’s and option • Measure the importance of the feature in the model to quantify the cost/effort
  • 18. Naive Best Effort • Use most recent data for all features • Inconsistent customer view • Retrain model with data lag in mind • Tightly couple model
  • 19. Consistently Lagged • Get a consistent snapshot at the time of the most lagged data source • Predictions will be outdated equal to the slowest data source lag
  • 20. Imputation • Fill in missing values with median for numerical, mode for continuous • Every users experience is the median experience? Not useful. • Contextual Imputing (e.g. Median male height for men, median female height for women) • Lots of custom model specific code necessary
  • 21. Graceful Model Degradation • Model Specific fallback to give a best guess from the distribution given the current inputs • Doesn't come out of the box in most cases
  • 23. Data to Model Deployments • Containerised model exposing scoring API • Clean and simple model management semantics • Send your data to your models • Network shuffle costs can be substantial for larger datasets
  • 24. Model to Data Deployment • Distributed processing framework performs scoring (e.g. Spark) • Send your models to your data • Efficient but less portable • Model lifecycle more difficult to manage

  • 25. Future Solutions • Spark on Kubernetes • Models and Executors colocated in pods with data locality. • Model lifecycle management through Kubernetes • Rollouts • Rollbacks • Canary Testing • A/B testing • Managed cloud solutions • Complexity hidden from users

  • 27. A Few ML System Metrics • Data Distribution • Effectiveness in Market
  • 28. Data Distribution Your training data will have some distribution of labels
  • 29. Data Distribution •In production, your data distribution may be significantly different •This can happen over time as these systems tend to be dynamic

  • 30. Possible Causes • Changes to the domain you're modelling • Seasonality or external effects • Changes to the customers themselves or the way the customers are using your service • Problems with the data collection pipelines (corrupted data feeds etc)
  • 31. Effectiveness in Market • Production is the first real test • Need to capture metrics to measure the effect of the model for its intended purpose • Paves the road towards; • Effective A/B testing • Incremental model improvement • Measurability of ROI
  • 33. Training Models on Big Data is Slow • Not all algorithms scale linearly as data/model complexity increases • Hit computation/memory bottlenecks • Number of hypothesis we can test is reduced • Generating new features can become prohibitively expensive
  • 37. Know Where to Spend Your Time • Bad performance on training data = Bias Problem • Improve features • More complex model • Train longer • Good performance on training data and bad performance on test set = Variance Problem • Get more data for training • Regularisation
  • 38. Choice of Framework / Technology • Modelling in R/Python and rewriting in production in Scala/Spark is an expensive process • Choose a tech stack that allows engineers and data scientists to work together and productionise things quickly. Leads to faster feedback loops
  • 39. What We’ve Covered • Data issues can be be a central issue to ML systems are require a lot of up front design thought • There are several modes of deployment, each with their own tradeoffs for different scenarios • Production is not the end of the process for ML models. Metrics are a fundamental part of enabling improvement and growth. • Ways to improve iteration speed on ML projects