SlideShare a Scribd company logo
Three lessons learned
from building a production
machine learning system
Michael Manapat
Stripe
@mlmanapat
Fraud
• Card numbers are stolen by hacking, malware, etc.
• “Dumps” are sold in “carding” forums
• Fraudsters use numbers in dumps to buy goods,
which they then resell
• Cardholders dispute transactions
• Merchant ends up bearing cost of fraud
• We train binary classifiers to predict fraud
• We use open source tools
• Scalding/Summingbird for feature generation
• scikit-learn for model training

(eventually: github.com/stripe/brushfire)
1
Don’t treat models as
black boxes
Early ML at Stripe
• Focused on training with more and more data and
adding more and more features
• Didn’t think much about
• ML algorithms (tuning hyperparameters, e.g.)
• The deeper reasons behind any particular set of
results
Substantial reduction in fraud rate
Product development
From a product standpoint:
• We were blocking high risk charges and surfacing
just the decision
• We wanted to provide Stripe users insight into our
actions—reasons for scores
Score reasons
X = 5, Y = 3: score = 0.1
Which feature is “driving” the score more?
X < 10
Y < 5 X < 15
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
True False
Score reasons
X = ?, Y = 3:

(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61
Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51
Now producing richer reasons with multiple predicates
X < ?
Y < 5 X < ?
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
Model introspection
If a model didn’t look good in validation, it wasn’t clear what
to do (besides trying more features/data)
What if we used our “score reasons” to debug model issues?
• Take all false positives (in validation data or in
production) and group by generated reason
• Were a substantial fraction of the false positives
driven by a few features?
• Did all the comparisons in the explanation
predicates make sense? (Were they comparisons a
human might make for fraud?)
• Our models were overfit!
Actioning insights
• Hyperparameter optimization
• Feature selection
Precision
Recall
Summary
• Don’t treat models as black boxes
• Thinking about the learning process (vs. just
features and data) can yield significant payoffs
• Tooling for introspection can accelerate model
development/“debugging”
Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie,
Jocelyn Ross, Tom Switzer
2
Have a plan for
counterfactual
evaluation
• December 31st, 2013
• Train a binary classifier for
disputes on data from Jan
1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Based on validation data, pick
a policy for actioning scores:

block if score > 50
Questions (1)
• Business complains about high false positive rate:
what would happen if we changed the policy to
"block if score > 70"?
• What are the production precision and recall of the
model?
• December 31st, 2014. We
repeat the exercise from a
year earlier
• Train a model on data
from Jan 1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to
wait ~60 days for labels)
• Validation results look
~ok (but not great)
• We put the model into
production and the results
are terrible
Questions (2)
• Why did the validation results for the new model
look so much worse?
• How do we know if the retrained model really is
better than the original model?
Counterfactual evaluation
• Our model changes reality (the world is different
because of its existence)
• We can answer some questions (around model
comparisons) with A/B tests
• For all these questions, we want an approximation
of the charge/outcome distribution that would exist
if there were no model
One approach
• Probabilistically reverse a
small fraction of our block
decisions
• The higher the score, the lower
probability we let the charge
through
• Weight samples by 1 / P(allow)
• Get information on the area we
want to improve on
ID Score p(Allow)
Original
Action
Selected
Action
Outcome
1 10 1.0 Allow Allow OK
2 45 1.0 Allow Allow Fraud
3 55 0.30 Block Block -
4 65 0.20 Block Allow Fraud
5 100 0.0005 Block Block -
6 60 0.25 Block Allow OK
ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56
Recall = 5 / 6 = 0.83
• The propensity function controls the exploration/
exploitation tradeoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the more we
allow through
• Bootstrap to get error bars (pick rows from the table
uniformly at random with replacement)
• Li, Chen, Kleban, Gupta: "Counterfactual Estimation
and Optimization of Click Metrics for Search Engines"
Summary
• Have a plan for counterfactual evaluation before
you productionize your first model
• You can back yourself into a corner (with no data to
retrain on) if you address this later
• You should be monitoring the production
performance of your model anyway (cf. next lesson)
Alyssa Frazee, Julia Evans, Roban Kramer, Ryan
Wang
3
Invest in production
monitoring for your
models
Production vs. data stack
• Ruby/Mongo vs. Scala/Hadoop/Thrift
• Some issues
• Divergence between production and training
definitions
• Upstream changes to library code in production
feature generation can change feature definitions
• True vs. “True”
Domain-specific
scoring
service
(business
logic)
“Pure”
model
evaluation
service
Aggregation
jobs
Aggregation jobs keep
track of
• Overall action rate and
rate per Stripe user
• Score distributions
• Feature distributions (%
null, p50/p90 for
numerical values, etc.)
Logged
scoring
requests
Aggregation
jobs (get all
aggregates per
model)
Logged
scoring
requests
Domain-specific
scoring
service
(business
logic)
“Pure”
model
evaluation
service
Summary
• Monitor the production inputs to and outputs of
your models
• Have dashboards that can be watched on deploys
and alerting for significant anomalies
• Bake the monitoring into generic ML infrastructure
(so that each ML application isn’t redoing this)
Steve Mardenfeld, Tom Switzer
• Don’t treat models as black boxes
• Have a plan for counterfactual evaluation before
productionizing your first model
• Build production monitoring for action rates, score
distributions, and feature distributions (and bake
into ML infra)

Thanks
Stripe is hiring data scientists, engineers, and
engineering managers!
mlm@stripe.com | @mlmanapat

More Related Content

What's hot (20)

PPTX
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Andreas Grabner
 
PPTX
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Andreas Grabner
 
PPTX
Hugs instead of Bugs: Dreaming of Quality Tools for Devs and Testers
Andreas Grabner
 
PDF
Agile Data
odsc
 
PDF
DMCA#21: reactive-programming
Olivier Destrebecq
 
PPTX
Database DevOps Anti-patterns
Alex Yates
 
PPTX
How to keep you out of the News: Web and End-to-End Performance Tips
Andreas Grabner
 
PPTX
Getting CI right for SQL Server
Alex Yates
 
PPTX
MLconf NYC Ted Willke
MLconf
 
PPTX
DevOps 101 for data professionals
Alex Yates
 
PDF
Building Scalable Prediction Services in R
Work-Bench
 
PPTX
Scaling Your Architecture with Services and Events
Randy Shoup
 
PPTX
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
Andreas Grabner
 
PPTX
Managing Data in Microservices
Randy Shoup
 
PPTX
HSPS 2015 - SharePoint Performance Santiy Checks
Andreas Grabner
 
PPTX
Four Practices to Fix Your Top .NET Performance Problems
Andreas Grabner
 
PDF
Workbooks-Convert-OMG-LOL!.PDF
Kyle Lambert
 
PDF
Making operations visible - devopsdays tokyo 2013
Nick Galbreath
 
PPTX
Web and App Performance: Top Problems to avoid to keep you out of the News
Andreas Grabner
 
PPTX
DevSecOps - London Gathering : June 2018
Michael Man
 
Application Quality Gates in Continuous Delivery: Deliver Better Software Fas...
Andreas Grabner
 
Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty
Andreas Grabner
 
Hugs instead of Bugs: Dreaming of Quality Tools for Devs and Testers
Andreas Grabner
 
Agile Data
odsc
 
DMCA#21: reactive-programming
Olivier Destrebecq
 
Database DevOps Anti-patterns
Alex Yates
 
How to keep you out of the News: Web and End-to-End Performance Tips
Andreas Grabner
 
Getting CI right for SQL Server
Alex Yates
 
MLconf NYC Ted Willke
MLconf
 
DevOps 101 for data professionals
Alex Yates
 
Building Scalable Prediction Services in R
Work-Bench
 
Scaling Your Architecture with Services and Events
Randy Shoup
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
Andreas Grabner
 
Managing Data in Microservices
Randy Shoup
 
HSPS 2015 - SharePoint Performance Santiy Checks
Andreas Grabner
 
Four Practices to Fix Your Top .NET Performance Problems
Andreas Grabner
 
Workbooks-Convert-OMG-LOL!.PDF
Kyle Lambert
 
Making operations visible - devopsdays tokyo 2013
Nick Galbreath
 
Web and App Performance: Top Problems to avoid to keep you out of the News
Andreas Grabner
 
DevSecOps - London Gathering : June 2018
Michael Man
 

Viewers also liked (20)

PPTX
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
PDF
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
PDF
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
PDF
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
PDF
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
PDF
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
PDF
DataEngConf: Apache Spark in Financial Modeling at BlackRock
Hakka Labs
 
PDF
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PDF
A Primer on Entity Resolution
Benjamin Bengfort
 
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf: Apache Spark in Financial Modeling at BlackRock
Hakka Labs
 
DataEngConf: Feature Extraction: Modern Questions and Challenges at Google
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
A Primer on Entity Resolution
Benjamin Bengfort
 
Ad

Similar to DataEngConf SF16 - Three lessons learned from building a production machine learning system (20)

PDF
Counterfactual evaluation of machine learning models
Michael Manapat
 
PDF
ML Application Life Cycle
SrujanaMerugu1
 
PDF
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Dhiana Deva
 
PPTX
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
DevClub_lv
 
PDF
CD in Machine Learning Systems
Thoughtworks
 
PDF
Randomness and fraud
Michael Manapat
 
PPTX
Ml2 production
Nikhil Ketkar
 
PPTX
230208 MLOps Getting from Good to Great.pptx
Arthur240715
 
PDF
Production model lifecycle management 2016 09
Greg Makowski
 
PPTX
Machine Learning vs Decision Optimization comparison
Alain Chabrier
 
PPTX
Model Development And Evaluation in ML.pptx
bismayabaliarsingh00
 
PDF
Managing machine learning
David Murgatroyd
 
PDF
Algorithmic Impact Assessment: Fairness, Robustness and Explainability in Aut...
Adriano Soares Koshiyama
 
PDF
"You can't just turn the crank": Machine learning for fighting abuse on the c...
David Freeman
 
PDF
(In)convenient truths about applied machine learning
Max Pagels
 
PDF
Bridging the gap from data science to service
dmoisset
 
PDF
Real-world Strategies for Debugging Machine Learning Systems
Databricks
 
PDF
Barga Data Science lecture 2
Roger Barga
 
PDF
The Machine Learning Audit
Andrew Clark
 
PDF
Experimenting with Data!
Andrea Montemaggio
 
Counterfactual evaluation of machine learning models
Michael Manapat
 
ML Application Life Cycle
SrujanaMerugu1
 
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
Dhiana Deva
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
DevClub_lv
 
CD in Machine Learning Systems
Thoughtworks
 
Randomness and fraud
Michael Manapat
 
Ml2 production
Nikhil Ketkar
 
230208 MLOps Getting from Good to Great.pptx
Arthur240715
 
Production model lifecycle management 2016 09
Greg Makowski
 
Machine Learning vs Decision Optimization comparison
Alain Chabrier
 
Model Development And Evaluation in ML.pptx
bismayabaliarsingh00
 
Managing machine learning
David Murgatroyd
 
Algorithmic Impact Assessment: Fairness, Robustness and Explainability in Aut...
Adriano Soares Koshiyama
 
"You can't just turn the crank": Machine learning for fighting abuse on the c...
David Freeman
 
(In)convenient truths about applied machine learning
Max Pagels
 
Bridging the gap from data science to service
dmoisset
 
Real-world Strategies for Debugging Machine Learning Systems
Databricks
 
Barga Data Science lecture 2
Roger Barga
 
The Machine Learning Audit
Andrew Clark
 
Experimenting with Data!
Andrea Montemaggio
 
Ad

More from Hakka Labs (11)

PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
PDF
DataEngConf: Data Science at the New York Times by Chris Wiggins
Hakka Labs
 
PPTX
DataEngConf: Building the Next New York Times Recommendation Engine
Hakka Labs
 
PDF
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
Hakka Labs
 
PPTX
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
Hakka Labs
 
PPTX
DataEngConf: The Science of Virality at BuzzFeed
Hakka Labs
 
PPTX
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
 
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
PPTX
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Hakka Labs
 
PPTX
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
DataEngConf: Data Science at the New York Times by Chris Wiggins
Hakka Labs
 
DataEngConf: Building the Next New York Times Recommendation Engine
Hakka Labs
 
DataEngConf: Measuring Impact with Data in a Distributed World at Conde Nast
Hakka Labs
 
DataEngConf: Talkographics: Using What Viewers Say Online to Measure TV and B...
Hakka Labs
 
DataEngConf: The Science of Virality at BuzzFeed
Hakka Labs
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Hakka Labs
 
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hakka Labs
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Hakka Labs
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 

Recently uploaded (20)

PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 

DataEngConf SF16 - Three lessons learned from building a production machine learning system

  • 1. Three lessons learned from building a production machine learning system Michael Manapat Stripe @mlmanapat
  • 2. Fraud • Card numbers are stolen by hacking, malware, etc. • “Dumps” are sold in “carding” forums • Fraudsters use numbers in dumps to buy goods, which they then resell • Cardholders dispute transactions • Merchant ends up bearing cost of fraud
  • 3. • We train binary classifiers to predict fraud • We use open source tools • Scalding/Summingbird for feature generation • scikit-learn for model training
 (eventually: github.com/stripe/brushfire)
  • 4. 1 Don’t treat models as black boxes
  • 5. Early ML at Stripe • Focused on training with more and more data and adding more and more features • Didn’t think much about • ML algorithms (tuning hyperparameters, e.g.) • The deeper reasons behind any particular set of results Substantial reduction in fraud rate
  • 6. Product development From a product standpoint: • We were blocking high risk charges and surfacing just the decision • We wanted to provide Stripe users insight into our actions—reasons for scores
  • 7. Score reasons X = 5, Y = 3: score = 0.1 Which feature is “driving” the score more? X < 10 Y < 5 X < 15 0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40) True False
  • 8. Score reasons X = ?, Y = 3:
 (20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61 Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51 Now producing richer reasons with multiple predicates X < ? Y < 5 X < ? 0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
  • 9. Model introspection If a model didn’t look good in validation, it wasn’t clear what to do (besides trying more features/data) What if we used our “score reasons” to debug model issues?
  • 10. • Take all false positives (in validation data or in production) and group by generated reason • Were a substantial fraction of the false positives driven by a few features? • Did all the comparisons in the explanation predicates make sense? (Were they comparisons a human might make for fraud?) • Our models were overfit!
  • 11. Actioning insights • Hyperparameter optimization • Feature selection Precision Recall
  • 12. Summary • Don’t treat models as black boxes • Thinking about the learning process (vs. just features and data) can yield significant payoffs • Tooling for introspection can accelerate model development/“debugging” Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie, Jocelyn Ross, Tom Switzer
  • 13. 2 Have a plan for counterfactual evaluation
  • 14. • December 31st, 2013 • Train a binary classifier for disputes on data from Jan 1st to Sep 30th • Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels) • Based on validation data, pick a policy for actioning scores:
 block if score > 50
  • 15. Questions (1) • Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"? • What are the production precision and recall of the model?
  • 16. • December 31st, 2014. We repeat the exercise from a year earlier • Train a model on data from Jan 1st to Sep 30th • Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels) • Validation results look ~ok (but not great) • We put the model into production and the results are terrible
  • 17. Questions (2) • Why did the validation results for the new model look so much worse? • How do we know if the retrained model really is better than the original model?
  • 18. Counterfactual evaluation • Our model changes reality (the world is different because of its existence) • We can answer some questions (around model comparisons) with A/B tests • For all these questions, we want an approximation of the charge/outcome distribution that would exist if there were no model
  • 19. One approach • Probabilistically reverse a small fraction of our block decisions • The higher the score, the lower probability we let the charge through • Weight samples by 1 / P(allow) • Get information on the area we want to improve on
  • 20. ID Score p(Allow) Original Action Selected Action Outcome 1 10 1.0 Allow Allow OK 2 45 1.0 Allow Allow Fraud 3 55 0.30 Block Block - 4 65 0.20 Block Allow Fraud 5 100 0.0005 Block Block - 6 60 0.25 Block Allow OK
  • 21. ID Score P(Allow) Weight Original Action Selected Action Outcome 1 10 1.0 1 Allow Allow OK 2 45 1.0 1 Allow Allow Fraud 4 65 0.20 5 Block Allow Fraud 6 60 0.25 4 Block Allow OK Evaluating the "block if score > 50" policy Precision = 5 / 9 = 0.56 Recall = 5 / 6 = 0.83
  • 22. • The propensity function controls the exploration/ exploitation tradeoff • Precision, recall, etc. are estimators • Variance of the estimators decreases the more we allow through • Bootstrap to get error bars (pick rows from the table uniformly at random with replacement) • Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"
  • 23. Summary • Have a plan for counterfactual evaluation before you productionize your first model • You can back yourself into a corner (with no data to retrain on) if you address this later • You should be monitoring the production performance of your model anyway (cf. next lesson) Alyssa Frazee, Julia Evans, Roban Kramer, Ryan Wang
  • 25. Production vs. data stack • Ruby/Mongo vs. Scala/Hadoop/Thrift • Some issues • Divergence between production and training definitions • Upstream changes to library code in production feature generation can change feature definitions • True vs. “True”
  • 26. Domain-specific scoring service (business logic) “Pure” model evaluation service Aggregation jobs Aggregation jobs keep track of • Overall action rate and rate per Stripe user • Score distributions • Feature distributions (% null, p50/p90 for numerical values, etc.) Logged scoring requests
  • 27. Aggregation jobs (get all aggregates per model) Logged scoring requests Domain-specific scoring service (business logic) “Pure” model evaluation service
  • 28. Summary • Monitor the production inputs to and outputs of your models • Have dashboards that can be watched on deploys and alerting for significant anomalies • Bake the monitoring into generic ML infrastructure (so that each ML application isn’t redoing this) Steve Mardenfeld, Tom Switzer
  • 29. • Don’t treat models as black boxes • Have a plan for counterfactual evaluation before productionizing your first model • Build production monitoring for action rates, score distributions, and feature distributions (and bake into ML infra)

  • 30. Thanks Stripe is hiring data scientists, engineers, and engineering managers! [email protected] | @mlmanapat