DataEngConf SF16 - Three lessons learned from building a production machine learning system

Three lessons learned
from building a production
machine learning system
Michael Manapat
Stripe
@mlmanapat

Fraud
• Card numbers are stolen by hacking, malware, etc.
• “Dumps” are sold in “carding” forums
• Fraudsters use numbers in dumps to buy goods,
which they then resell
• Cardholders dispute transactions
• Merchant ends up bearing cost of fraud

• We train binary classiﬁers to predict fraud
• We use open source tools
• Scalding/Summingbird for feature generation
• scikit-learn for model training 
(eventually: github.com/stripe/brushﬁre)

1
Don’t treat models as
black boxes

Early ML at Stripe
• Focused on training with more and more data and
adding more and more features
• Didn’t think much about
• ML algorithms (tuning hyperparameters, e.g.)
• The deeper reasons behind any particular set of
results
Substantial reduction in fraud rate

Product development
From a product standpoint:
• We were blocking high risk charges and surfacing
just the decision
• We wanted to provide Stripe users insight into our
actions—reasons for scores

Score reasons
X = 5, Y = 3: score = 0.1
Which feature is “driving” the score more?
X < 10
Y < 5 X < 15
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
True False

Score reasons
X = ?, Y = 3: 
(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61
Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51
Now producing richer reasons with multiple predicates
X < ?
Y < 5 X < ?
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)

Model introspection
If a model didn’t look good in validation, it wasn’t clear what
to do (besides trying more features/data)
What if we used our “score reasons” to debug model issues?

• Take all false positives (in validation data or in
production) and group by generated reason
• Were a substantial fraction of the false positives
driven by a few features?
• Did all the comparisons in the explanation
predicates make sense? (Were they comparisons a
human might make for fraud?)
• Our models were overﬁt!

Actioning insights
• Hyperparameter optimization
• Feature selection
Precision
Recall

Summary
• Don’t treat models as black boxes
• Thinking about the learning process (vs. just
features and data) can yield signiﬁcant payoffs
• Tooling for introspection can accelerate model
development/“debugging”
Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie,
Jocelyn Ross, Tom Switzer

2
Have a plan for
counterfactual
evaluation

• December 31st, 2013
• Train a binary classiﬁer for
disputes on data from Jan
1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to wait
~60 days for labels)
• Based on validation data, pick
a policy for actioning scores: 
block if score > 50

Questions (1)
• Business complains about high false positive rate:
what would happen if we changed the policy to
"block if score > 70"?
• What are the production precision and recall of the
model?

• December 31st, 2014. We
repeat the exercise from a
year earlier
• Train a model on data
from Jan 1st to Sep 30th
• Validate on data from Oct
1st to Oct 31st (need to
wait ~60 days for labels)
• Validation results look
~ok (but not great)
• We put the model into
production and the results
are terrible

Questions (2)
• Why did the validation results for the new model
look so much worse?
• How do we know if the retrained model really is
better than the original model?

Counterfactual evaluation
• Our model changes reality (the world is different
because of its existence)
• We can answer some questions (around model
comparisons) with A/B tests
• For all these questions, we want an approximation
of the charge/outcome distribution that would exist
if there were no model

One approach
• Probabilistically reverse a
small fraction of our block
decisions
• The higher the score, the lower
probability we let the charge
through
• Weight samples by 1 / P(allow)
• Get information on the area we
want to improve on

ID Score p(Allow)
Original
Action
Selected
Action
Outcome
1 10 1.0 Allow Allow OK
2 45 1.0 Allow Allow Fraud
3 55 0.30 Block Block -
4 65 0.20 Block Allow Fraud
5 100 0.0005 Block Block -
6 60 0.25 Block Allow OK

ID Score P(Allow) Weight
Original
Action
Selected
Action
Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56
Recall = 5 / 6 = 0.83

• The propensity function controls the exploration/
exploitation tradeoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the more we
allow through
• Bootstrap to get error bars (pick rows from the table
uniformly at random with replacement)
• Li, Chen, Kleban, Gupta: "Counterfactual Estimation
and Optimization of Click Metrics for Search Engines"

Summary
• Have a plan for counterfactual evaluation before
you productionize your ﬁrst model
• You can back yourself into a corner (with no data to
retrain on) if you address this later
• You should be monitoring the production
performance of your model anyway (cf. next lesson)
Alyssa Frazee, Julia Evans, Roban Kramer, Ryan
Wang

3
Invest in production
monitoring for your
models

Production vs. data stack
• Ruby/Mongo vs. Scala/Hadoop/Thrift
• Some issues
• Divergence between production and training
deﬁnitions
• Upstream changes to library code in production
feature generation can change feature deﬁnitions
• True vs. “True”

Domain-speciﬁc
scoring
service
(business
logic)
“Pure”
model
evaluation
service
Aggregation
jobs
Aggregation jobs keep
track of
• Overall action rate and
rate per Stripe user
• Score distributions
• Feature distributions (%
null, p50/p90 for
numerical values, etc.)
Logged
scoring
requests

Aggregation
jobs (get all
aggregates per
model)
Logged
scoring
requests
Domain-speciﬁc
scoring
service
(business
logic)
“Pure”
model
evaluation
service

Summary
• Monitor the production inputs to and outputs of
your models
• Have dashboards that can be watched on deploys
and alerting for signiﬁcant anomalies
• Bake the monitoring into generic ML infrastructure
(so that each ML application isn’t redoing this)
Steve Mardenfeld, Tom Switzer

• Don’t treat models as black boxes
• Have a plan for counterfactual evaluation before
productionizing your ﬁrst model
• Build production monitoring for action rates, score
distributions, and feature distributions (and bake
into ML infra)

Thanks
Stripe is hiring data scientists, engineers, and
engineering managers!
mlm@stripe.com | @mlmanapat

DataEngConf SF16 - Three lessons learned from building a production machine learning system

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to DataEngConf SF16 - Three lessons learned from building a production machine learning system (20)

More from Hakka Labs (11)

Recently uploaded (20)

DataEngConf SF16 - Three lessons learned from building a production machine learning system