SlideShare a Scribd company logo
Drifting Away:
Testing ML Models in
Production
Chengyin Eng
Niall Turbitt
Outline
Chengyin Eng
Data Scientist @ Databricks
▪ Machine Learning Practice Team
▪ Experience
▪ Life Insurance
▪ Teaching ML in Production, Deep Learning,
NLP, etc.
▪ MS in Computer Science at University of
Massachusetts, Amherst
▪ BA in Statistics & Environmental Studies at
Mount Holyoke College, Massachusetts
About
Niall Turbitt
Senior Data Scientist @ Databricks
▪ EMEA ML Practice Team
▪ Experience
▪ Energy & Industrial Applications
▪ e-Commerce
▪ Recommender Systems & Personalisation
▪ MS Statistics University College Dublin
▪ BA Mathematics & Economics Trinity College
Dublin
About
• Motivation
• Machine Learning System Life Cycle
• Why Monitor?
• Types of drift
• What to Monitor?
• How to Monitor?
• Demo
Outline
ML is everywhere, but often fails to reach
production
85% of DS projects fail
4% of companies succeed in
deploying ML models to
production
Source:
https://www.datanami.com/2020/10/01/most-data-science-projects-fail-but-yours-doesnt-have-to/
Why do ML projects fail in production?
Neglect maintenance: Lack of re-training and testing
Source:
https://databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html
This talk focuses on two questions:
This talk focuses on two questions:
What are the statistical tests to
use when monitoring models in
production?
This talk focuses on two questions:
What are the statistical tests to
use when monitoring models in
production?
What tools can I use to
coordinate the monitoring of data
and models?
What this talk is not
• A tutorial on model deployment strategies
• An exhaustive walk through of how to robustly test your
production ML code
• A prescriptive list of when to update a model in production
Machine Learning
System Life Cycle
Business
Problem
ML system life cycle
Business
Problem
Define
Success
Criteria
ML system life cycle
Business
Problem
Define
Success
Criteria
Data
Collection
Data
Preprocessing/
Feature
Engineering
ML system life cycle
Business
Problem
Define
Success
Criteria
Data
Collection
Model
Training
Model
Evaluation
Data
Preprocessing/
Feature
Engineering
ML system life cycle
Business
Problem
Define
Success
Criteria
Data
Collection
Model
Training
Model
Evaluation
Data
Preprocessing/
Feature
Engineering
ML system life cycle
Business
Problem
Define
Success
Criteria
Data
Collection
Model
Training
Model
Evaluation
Data
Preprocessing/
Feature
Engineering
Model
Deployment
Model
Monitoring
ML system life cycle
Why Monitor?
Model deployment is not the end
▪ Data distributions and feature types can change over time due to:
It is the beginning of model measurement and monitoring
Upstream Errors Market Change Human Behaviour Change
Potential model performance degradation
Models will degrade over time
Challenge: catching this when it happens
Types of drift
Feature Drift Label Drift Prediction Drift Concept Drift
External factors
cause the label to
evolve
Model prediction
distribution deviates
Label distribution
deviates
Input feature(s)
distributions deviate
Feature, Label, and Prediction Drift
Sources:
https://dataz4s.com/statistics/chi-square-test/
https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb
Concept drift
Source: Krawczyk and Cano 2018. Online Ensemble Learning for Drifting and Noisy Data Streams
Drift types and actions to take
Drift Type Identified Action
Feature Drift ● Investigate feature generation process
● Retrain using new data
Label Drift ● Investigate label generation process
● Retrain using new data
Prediction Drift ● Investigate model training process
● Assess business impact of change in predictions
Concept Drift ● Investigate additional feature engineering
● Consider alternative approach/solution
● Retrain/tune using new data
What to Monitor?
What should I monitor?
• Basic summary statistics of features and target
• Distributions of features and target
• Model performance metrics
• Business metrics
Monitoring tests on data
▪ Summary statistics:
▪ Median / mean
▪ Minimum
▪ Maximum
▪ Percentage of missing values
▪ Statistical tests:
▪ Mean:
▪ Two-sample
Kolmogorov-Smirnov (KS) test with
Bonferroni correction
▪ Mann-Whitney (MW) test
▪ Variance:
▪ Levene test
Numeric Features
Kolmogorov-Smirnov (KS) test with Bonferroni correction
Comparison of two continuous distributions
▪ Null hypothesis (H0
):
Distributions x and y come from the same population
▪ If the KS statistic has a p-value lower than α, reject H0
▪ Bonferroni correction:
▪ Adjusts the αlevel to reduce false positives
▪ αnew
= αoriginal
/ n, where n = total number of feature comparisons
Numeric Feature Test
Levene test
Comparison of variances between two continuous distributions
▪ Null hypothesis (H0
):
σ2
1
= σ2
2
= … = σ2
n
▪ If the Levene statistic has a p-value lower than α, reject H0
Numeric Feature Test
Monitoring tests on data
▪ Summary statistics:
▪ Median / mean
▪ Minimum
▪ Maximum
▪ Percentage of missing values
▪ Statistical tests:
▪ Mean:
▪ Two-sample
Kolmogorov-Smirnov (KS) test with
Bonferroni correction
▪ Mann-Whitney (MW) test
▪ Variance:
▪ Levene test
▪ Summary statistics:
▪ Mode
▪ Number of unique levels
▪ Percentage of missing values
▪ Statistical test:
▪ One-way chi-squared test
Categorical Features
Numeric Features
One-way chi-squared test
Comparison of two categorical distributions
▪ Null hypothesis (H0
):
Expected distribution = observed distribution
▪ If the Chi-squared statistic has a p-value lower than α, reject H0
Categorical Feature Test
Monitoring tests on models
• Relationship between target and features
• Numeric Target: Pearson Coefficient
• Categorical Target: Contingency tables
• Model Performance
• Regression models: MSE, error distribution plots etc
• Classification models: ROC, confusion matrix, F1-score etc
• Performance on data slices
• Time taken to train
How to Monitor?
Demo: Measuring models in production
• Logging and Versioning
• MLflow (model)
• Delta (data)
• Statistical Tests
• SciPy
• statsmodels
• Visualizations
• seaborn
An open-source platform for ML lifecycle that helps with operationalizing ML
General model
format
that
standardizes
deployment
options
Centralized and
collaborative
model lifecycle
management
Tracking
Record and query
experiments: code,
metrics,
parameters,
artifacts, models
Projects
Packaging format
for reproducible
runs on any
compute platform
Models
General model
format that
standardizes
deployment options
Centralized and
collaborative model
lifecycle
management
Model Registry
An open-source platform for ML lifecycle that helps with operationalizing ML
General model
format
that
standardizes
deployment
options
Centralized and
collaborative
model lifecycle
management
Tracking
Record and query
experiments: code,
metrics,
parameters,
artifacts, models
Projects
Packaging format
for reproducible
runs on any
compute platform
Models
General model
format that
standardizes
deployment options
Model Registry
Centralized and
collaborative model
lifecycle
management
Demo Notebook
http://bit.ly/dais_2021_drifting_away
Conclusion
• Model measurement and monitoring are crucial when
operationalizing ML models
• No one-size fits all
• Domain & problem specific considerations
• Reproducibility
• Enable rollbacks and maintain record of historic performance
Literature resources
• Paleyes et al 2021. Challenges in Deploying ML
• Klaise et al. 2020 Monitoring and explainability of models in production
• Rabanser et al 2019 Failing Loudly: An Empirical Study of Methods for
Detecting Dataset Shift
• Martin Fowler: Continuous Delivery for Machine Learning
Emerging open-source monitoring packages
• EvidentlyAI
• Data Drift Detector
• Alibi Detect
• scikit-multiflow
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot (20)

PPTX
MLOps in action
Pieter de Bruin
 
PDF
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
PPTX
From Data Science to MLOps
Carl W. Handlin
 
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
PPTX
Pythonsevilla2019 - Introduction to MLFlow
Fernando Ortega Gallego
 
PDF
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
PDF
MLOps Using MLflow
Databricks
 
PDF
Introduction to MLflow
Databricks
 
PPTX
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
Databricks
 
PDF
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
PPTX
Feature store: Solving anti-patterns in ML-systems
Andrzej Michałowski
 
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
PDF
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
PDF
What is MLOps
Henrik Skogström
 
PDF
Devops, the future is here, it's just not evenly distributed yet.
Kris Buytaert
 
PPTX
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
PDF
MLops workshop AWS
Gili Nachum
 
MLOps in action
Pieter de Bruin
 
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
From Data Science to MLOps
Carl W. Handlin
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
Pythonsevilla2019 - Introduction to MLFlow
Fernando Ortega Gallego
 
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 
MLOps Using MLflow
Databricks
 
Introduction to MLflow
Databricks
 
DataOps introduction : DataOps is not only DevOps applied to data!
Adrien Blind
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
Databricks
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Databricks
 
Feature store: Solving anti-patterns in ML-systems
Andrzej Michałowski
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
What is MLOps
Henrik Skogström
 
Devops, the future is here, it's just not evenly distributed yet.
Kris Buytaert
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
MLops workshop AWS
Gili Nachum
 

Similar to Drifting Away: Testing ML Models in Production (20)

PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
PPTX
Driving Digital Transformation with Machine Learning in Oracle Analytics
Perficient, Inc.
 
PDF
Machine learning in production
Turi, Inc.
 
PPTX
Business intelligence prof nikhat fatma mumtaz husain shaikh
Nikhat Fatma Mumtaz Husain Shaikh
 
PDF
Business Applications of Predictive Modeling at Scale
Songtao Guo
 
PDF
Practical Applications of Machine Learning in Cybersecurity
scoopnewsgroup
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
ML Application Life Cycle
SrujanaMerugu1
 
PPTX
It’s all about me_ From big data models to personalized experience Presentation
Yao H. Morin, Ph.D.
 
PDF
C2_W1---.pdf
Humayun Kabir
 
PDF
BigMLSchool: ML Platforms and AutoML in the Enterprise
BigML, Inc
 
PPTX
MLOps.pptx
sundharakumarkb1
 
PDF
Data-Driven Organisation
Jaakko Särelä
 
PDF
ML platforms & auto ml - UEM annotated (2) - #digitalbusinessweek
Ed Fernandez
 
PDF
A missing link in the ML infrastructure stack?
Chester Chen
 
PPTX
AI-900 - Fundamental Principles of ML.pptx
kprasad8
 
PPTX
2018-Sogeti-TestExpo-Intelligent_Predictive_Models.pptx
Minh Nguyen
 
PPTX
Introduction to ml ops in daily apps
Vincent Tatan
 
PPTX
Building enterprise advance analytics platform
Haoran Du
 
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Perficient, Inc.
 
Machine learning in production
Turi, Inc.
 
Business intelligence prof nikhat fatma mumtaz husain shaikh
Nikhat Fatma Mumtaz Husain Shaikh
 
Business Applications of Predictive Modeling at Scale
Songtao Guo
 
Practical Applications of Machine Learning in Cybersecurity
scoopnewsgroup
 
Apache Spark Model Deployment
Databricks
 
ML Application Life Cycle
SrujanaMerugu1
 
It’s all about me_ From big data models to personalized experience Presentation
Yao H. Morin, Ph.D.
 
C2_W1---.pdf
Humayun Kabir
 
BigMLSchool: ML Platforms and AutoML in the Enterprise
BigML, Inc
 
MLOps.pptx
sundharakumarkb1
 
Data-Driven Organisation
Jaakko Särelä
 
ML platforms & auto ml - UEM annotated (2) - #digitalbusinessweek
Ed Fernandez
 
A missing link in the ML infrastructure stack?
Chester Chen
 
AI-900 - Fundamental Principles of ML.pptx
kprasad8
 
2018-Sogeti-TestExpo-Intelligent_Predictive_Models.pptx
Minh Nguyen
 
Introduction to ml ops in daily apps
Vincent Tatan
 
Building enterprise advance analytics platform
Haoran Du
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 

Drifting Away: Testing ML Models in Production

  • 1. Drifting Away: Testing ML Models in Production Chengyin Eng Niall Turbitt Outline
  • 2. Chengyin Eng Data Scientist @ Databricks ▪ Machine Learning Practice Team ▪ Experience ▪ Life Insurance ▪ Teaching ML in Production, Deep Learning, NLP, etc. ▪ MS in Computer Science at University of Massachusetts, Amherst ▪ BA in Statistics & Environmental Studies at Mount Holyoke College, Massachusetts About
  • 3. Niall Turbitt Senior Data Scientist @ Databricks ▪ EMEA ML Practice Team ▪ Experience ▪ Energy & Industrial Applications ▪ e-Commerce ▪ Recommender Systems & Personalisation ▪ MS Statistics University College Dublin ▪ BA Mathematics & Economics Trinity College Dublin About
  • 4. • Motivation • Machine Learning System Life Cycle • Why Monitor? • Types of drift • What to Monitor? • How to Monitor? • Demo Outline
  • 5. ML is everywhere, but often fails to reach production 85% of DS projects fail 4% of companies succeed in deploying ML models to production Source: https://www.datanami.com/2020/10/01/most-data-science-projects-fail-but-yours-doesnt-have-to/
  • 6. Why do ML projects fail in production? Neglect maintenance: Lack of re-training and testing Source: https://databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html
  • 7. This talk focuses on two questions:
  • 8. This talk focuses on two questions: What are the statistical tests to use when monitoring models in production?
  • 9. This talk focuses on two questions: What are the statistical tests to use when monitoring models in production? What tools can I use to coordinate the monitoring of data and models?
  • 10. What this talk is not • A tutorial on model deployment strategies • An exhaustive walk through of how to robustly test your production ML code • A prescriptive list of when to update a model in production
  • 19. Model deployment is not the end ▪ Data distributions and feature types can change over time due to: It is the beginning of model measurement and monitoring Upstream Errors Market Change Human Behaviour Change Potential model performance degradation
  • 20. Models will degrade over time Challenge: catching this when it happens
  • 21. Types of drift Feature Drift Label Drift Prediction Drift Concept Drift External factors cause the label to evolve Model prediction distribution deviates Label distribution deviates Input feature(s) distributions deviate
  • 22. Feature, Label, and Prediction Drift Sources: https://dataz4s.com/statistics/chi-square-test/ https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb
  • 23. Concept drift Source: Krawczyk and Cano 2018. Online Ensemble Learning for Drifting and Noisy Data Streams
  • 24. Drift types and actions to take Drift Type Identified Action Feature Drift ● Investigate feature generation process ● Retrain using new data Label Drift ● Investigate label generation process ● Retrain using new data Prediction Drift ● Investigate model training process ● Assess business impact of change in predictions Concept Drift ● Investigate additional feature engineering ● Consider alternative approach/solution ● Retrain/tune using new data
  • 26. What should I monitor? • Basic summary statistics of features and target • Distributions of features and target • Model performance metrics • Business metrics
  • 27. Monitoring tests on data ▪ Summary statistics: ▪ Median / mean ▪ Minimum ▪ Maximum ▪ Percentage of missing values ▪ Statistical tests: ▪ Mean: ▪ Two-sample Kolmogorov-Smirnov (KS) test with Bonferroni correction ▪ Mann-Whitney (MW) test ▪ Variance: ▪ Levene test Numeric Features
  • 28. Kolmogorov-Smirnov (KS) test with Bonferroni correction Comparison of two continuous distributions ▪ Null hypothesis (H0 ): Distributions x and y come from the same population ▪ If the KS statistic has a p-value lower than α, reject H0 ▪ Bonferroni correction: ▪ Adjusts the αlevel to reduce false positives ▪ αnew = αoriginal / n, where n = total number of feature comparisons Numeric Feature Test
  • 29. Levene test Comparison of variances between two continuous distributions ▪ Null hypothesis (H0 ): σ2 1 = σ2 2 = … = σ2 n ▪ If the Levene statistic has a p-value lower than α, reject H0 Numeric Feature Test
  • 30. Monitoring tests on data ▪ Summary statistics: ▪ Median / mean ▪ Minimum ▪ Maximum ▪ Percentage of missing values ▪ Statistical tests: ▪ Mean: ▪ Two-sample Kolmogorov-Smirnov (KS) test with Bonferroni correction ▪ Mann-Whitney (MW) test ▪ Variance: ▪ Levene test ▪ Summary statistics: ▪ Mode ▪ Number of unique levels ▪ Percentage of missing values ▪ Statistical test: ▪ One-way chi-squared test Categorical Features Numeric Features
  • 31. One-way chi-squared test Comparison of two categorical distributions ▪ Null hypothesis (H0 ): Expected distribution = observed distribution ▪ If the Chi-squared statistic has a p-value lower than α, reject H0 Categorical Feature Test
  • 32. Monitoring tests on models • Relationship between target and features • Numeric Target: Pearson Coefficient • Categorical Target: Contingency tables • Model Performance • Regression models: MSE, error distribution plots etc • Classification models: ROC, confusion matrix, F1-score etc • Performance on data slices • Time taken to train
  • 34. Demo: Measuring models in production • Logging and Versioning • MLflow (model) • Delta (data) • Statistical Tests • SciPy • statsmodels • Visualizations • seaborn
  • 35. An open-source platform for ML lifecycle that helps with operationalizing ML General model format that standardizes deployment options Centralized and collaborative model lifecycle management Tracking Record and query experiments: code, metrics, parameters, artifacts, models Projects Packaging format for reproducible runs on any compute platform Models General model format that standardizes deployment options Centralized and collaborative model lifecycle management Model Registry
  • 36. An open-source platform for ML lifecycle that helps with operationalizing ML General model format that standardizes deployment options Centralized and collaborative model lifecycle management Tracking Record and query experiments: code, metrics, parameters, artifacts, models Projects Packaging format for reproducible runs on any compute platform Models General model format that standardizes deployment options Model Registry Centralized and collaborative model lifecycle management
  • 38. Conclusion • Model measurement and monitoring are crucial when operationalizing ML models • No one-size fits all • Domain & problem specific considerations • Reproducibility • Enable rollbacks and maintain record of historic performance
  • 39. Literature resources • Paleyes et al 2021. Challenges in Deploying ML • Klaise et al. 2020 Monitoring and explainability of models in production • Rabanser et al 2019 Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift • Martin Fowler: Continuous Delivery for Machine Learning
  • 40. Emerging open-source monitoring packages • EvidentlyAI • Data Drift Detector • Alibi Detect • scikit-multiflow
  • 41. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.