SlideShare a Scribd company logo
May 15, 2014
!
Rong Yan
Machine
Learning 

@ Square
Birth of Square
Payment
StandReader
Payment Device	

Payment Aggregation	

Risk Model
Payment
Commerce Cash
Market
Our Mission
Make commerce easy.
Payment
Data
Commerce
The Next Big Thing
3M+
Readers
$15B+
Annualized
Scale
Offline and Online
Amount
Location
Item Desc.
Card #

Credit Score
Friends
Activity History
Inventory

Sales Volume
Haircut
Price
Turn Data into
Business Value
Fraud

Detection
Business

Insight
Customer

Relation
Information

Discovery
Fraud Detection 

@ Square
Fraud Detection in the payment flow
Bank
Clears for
settlement
Suspect
~2000 sellers
Risk Ops

Transaction review
150,000 active
sellers per day
Risk ML 

Fraud Detection
Payments
near-real-time
ML Architecture
Merchant
Devices
Bank
Accounts
Machine
Learning
(300+ features)
Suspicions
Card not present: Yes
Pan Diversity: 0.05
Use iPhone: No
Feature Generation
Easy to interpret

!
Dimension reduction
!
!
Very powerful in ensemble

Decline Rate >= 0.1
NoYes
Amount <= $10000
NoYes
Business Type = Auto repair
NoYes
0.9 0.6
Decision Tree Model
Random Forests: Decision Tree Ensemble
Decline Rate <= 0.1
NoYes
Amount <= $10000
Business Type = Auto repair
0.9 0.6
Tree 1 Tree N
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Mode for classification = Bad
Average for regression = 0.63

NoYes
NoYes
Success Rate <= 0.2
NoYes
Age >= 20
Amount <= $1000
0.4 0.7
NoYes
NoYes
Decline Rate <= 0.3
NoYes
Amount <= $20000
Age <= 22
0.8 0.6
NoYes
NoYes
Tree 2
Bad, 0.9
 Good, 0.4
 Bad, 0.6

Random Forests - Build each Tree
All data
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
All data
Samples
Random Forests - Build each Tree
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
All data
Samples
Random Forests - Build each Tree
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Random Forests - Build each Tree
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Grow Tree Grow Tree
Random Forests - Build each Tree
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Grow Tree Grow Tree
When sample size is small STOP
Random Forests - Build each Tree
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32.
Features
Dollar Amount
Connected with bad user
Business Type
Decline Rate
Time of Day
Location
Randomly select sqrt(n) features
Best split: feature and value
Decline Rate <= 0.1
NoYes
0.4 0.6
All data
Samples
Grow Tree Grow Tree
When sample size is small STOP
Repeat these steps multiple times to create a forest
Random Forests - Build each Tree
Boosting Trees
Tree 1
Boosting Trees
Tree 1 Tree 2
Help Tree 1
Boosting Trees
Tree 1 Tree 2 Tree 3 Tree 4
Help Tree 1
Help Tree 1, 2
Help Tree 1, 2, 3
Stop when no
help needed
0 weights all samples
Boosting Trees
Tree 1 Tree 2 Tree 3 Tree 4
Help Tree 1
Help Tree 1, 2
Help Tree 1, 2, 3
8.0 -2.0 1.0 0.57.5 = + + +
Boosting Trees - Algorithm
Objective function:
Loss
Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." 1999
Precision at a fixed recall level
Results - Precision
Model April May June
Random Forest 76% 77% 80%
Boosting Trees 85% 82% 88%
+11.8% +6.5% +10%
Results - Fraud Detection Recall
# Payments to Reject
Fraud$Prevented
Easy
Hard
Medium
Data Sampling
Highly biased in label distribution
- Less than 1 in 1000
!
Weighted training
- Higher weights on positive samples => oscillation
- Lower weights on negative samples => no real gain
!
Solution
- Keep negative:positive ratio to be 3:1 - 10:1
- Scale the final model if calibration is needed
!
Fewer data requires fewer resources to train
!
Observed +10% improvement from 20:1 to 3:1
Productionalize

Machine Learning
‣ Ruby-on-Rails + MySQL
‣ MySQL replication
‣ Tied to production schema
‣ Hard to do complex analysis
Startup
Architecture
‣ Jave services
‣ APIs
‣ HDFS
Scale it up: 

SOA + 

Data Warehouse
Scale it up: 

Data Transport
‣ Append-only feeds
‣ Kafka
‣ Replication
‣ Protocol buffers
Payments
Highly Available
Merchant
Devices
Bank
Accounts
Suspicions
Parallel Environments and Data Integrity
Blue
Green
VIPupstream
Square Random Forest
Learning Management
Recommendation
Other ML @ Square
Square Random Forest
RF Learner Implementation Time (Train / Test)
RiskML Random Forest
(Built on Scikit-Learn)
C / Cython / Python
(Open Source + Square Code)
72 minutes
WiseRF
C++
(Proprietary)
23 minutes
Square Random Forest
Java
(Square Code)
15 minutes
Note: time reported on 3M training and 15M testing data
Learning Management System
‣ Support non-sophisticated users
‣ Fast ad-hoc analytics
‣ Accessible to everyone for easy
model generation and evaluation
‣ Tracks results to ensure different
models can be compared
Square Market
Recommendation
10x conversion rate vs. random baseline
ML @ Square
!
rongyan@squareup.com

More Related Content

What's hot (20)

PDF
From Labelling Open data images to building a private recommender system
Pierre Gutierrez
 
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc
 
PDF
Machine learning the high interest credit card of technical debt [PWL]
Jenia Gorokhovsky
 
PDF
Building a performing Machine Learning model from A to Z
Charles Vestur
 
PDF
Machine Learning Goes Production
Michał Łopuszyński
 
PDF
DutchMLSchool. Machine Learning End-to-End
BigML, Inc
 
PPTX
End-to-End Machine Learning Project
Eng Teong Cheah
 
PDF
L11. The Future of Machine Learning
Machine Learning Valencia
 
PDF
Machine learning in action at Pipedrive
André Karpištšenko
 
PDF
Target Leakage in Machine Learning
Yuriy Guts
 
PDF
DutchMLSchool. Automating Decision Making
BigML, Inc
 
PDF
DutchMLSchool. ML: A Technical Perspective
BigML, Inc
 
PDF
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
Data Science Milan
 
PDF
The Power of Auto ML and How Does it Work
Ivo Andreev
 
PDF
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
BigML, Inc
 
PDF
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 
PPTX
Starting data science with kaggle.com
Nathaniel Shimoni
 
PDF
Modelling and evaluation
eShikshak
 
PDF
VSSML18 Introduction to Supervised Learning
BigML, Inc
 
PPTX
Automated Machine Learning
safa cimenli
 
From Labelling Open data images to building a private recommender system
Pierre Gutierrez
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
BigML, Inc
 
Machine learning the high interest credit card of technical debt [PWL]
Jenia Gorokhovsky
 
Building a performing Machine Learning model from A to Z
Charles Vestur
 
Machine Learning Goes Production
Michał Łopuszyński
 
DutchMLSchool. Machine Learning End-to-End
BigML, Inc
 
End-to-End Machine Learning Project
Eng Teong Cheah
 
L11. The Future of Machine Learning
Machine Learning Valencia
 
Machine learning in action at Pipedrive
André Karpištšenko
 
Target Leakage in Machine Learning
Yuriy Guts
 
DutchMLSchool. Automating Decision Making
BigML, Inc
 
DutchMLSchool. ML: A Technical Perspective
BigML, Inc
 
The Barclays Data Science Hackathon: Building Retail Recommender Systems base...
Data Science Milan
 
The Power of Auto ML and How Does it Work
Ivo Andreev
 
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
BigML, Inc
 
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 
Starting data science with kaggle.com
Nathaniel Shimoni
 
Modelling and evaluation
eShikshak
 
VSSML18 Introduction to Supervised Learning
BigML, Inc
 
Automated Machine Learning
safa cimenli
 

Viewers also liked (16)

PPTX
Managing and Versioning Machine Learning Models in Python
Simon Frid
 
PDF
PostgreSQL + Kafka: The Delight of Change Data Capture
Jeff Klukas
 
PDF
Machine learning in production with scikit-learn
Jeff Klukas
 
PDF
Machine learning in production
Turi, Inc.
 
PDF
Python as part of a production machine learning stack by Michael Manapat PyDa...
PyData
 
PDF
Building A Production-Level Machine Learning Pipeline
Robert Dempsey
 
PPTX
Production and Beyond: Deploying and Managing Machine Learning Models
Turi, Inc.
 
PDF
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
PDF
Serverless machine learning operations
Stepan Pushkarev
 
PDF
Multi runtime serving pipelines for machine learning
Stepan Pushkarev
 
PPTX
Production machine learning_infrastructure
joshwills
 
PPTX
Machine Learning In Production
Samir Bessalah
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Machine Learning Pipelines
jeykottalam
 
PDF
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
PPTX
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Managing and Versioning Machine Learning Models in Python
Simon Frid
 
PostgreSQL + Kafka: The Delight of Change Data Capture
Jeff Klukas
 
Machine learning in production with scikit-learn
Jeff Klukas
 
Machine learning in production
Turi, Inc.
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
PyData
 
Building A Production-Level Machine Learning Pipeline
Robert Dempsey
 
Production and Beyond: Deploying and Managing Machine Learning Models
Turi, Inc.
 
Using PySpark to Process Boat Loads of Data
Robert Dempsey
 
Serverless machine learning operations
Stepan Pushkarev
 
Multi runtime serving pipelines for machine learning
Stepan Pushkarev
 
Production machine learning_infrastructure
joshwills
 
Machine Learning In Production
Samir Bessalah
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Machine Learning Pipelines
jeykottalam
 
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
 
Ad

Similar to Square's Machine Learning Infrastructure and Applications - Rong Yan (20)

PDF
Defeating online fraud and abuse – Continuous Intelligence in action
Thoughtworks
 
PPTX
Data preparation and processing chapter 2
Mahmoud Alfarra
 
PPTX
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Elad Rosenheim
 
PDF
Picnic Big Data Expo
BigDataExpo
 
PDF
Marketing Analytics with R Lifting Campaign Success Rates
Revolution Analytics
 
PPTX
What is an AI System of Growth
Vincent Handley
 
PDF
Forecasting P2P Credit Risk based on Lending Club data
Archange Giscard DESTINE
 
PDF
Forecasting peer to_peer_lending_risk
stevenllerner
 
PDF
Artificial Intelligence high ROI case studies from around the world: approach...
Data Driven Innovation
 
PDF
How to get value out of data
Lars Trieloff
 
PDF
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
PPT
Lecture 14
Shani729
 
PPT
06FPBasic.ppt
KomalBanik
 
PPT
06FPBasic.ppt
KomalBanik
 
PDF
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 
PDF
Replication in Data Science - A Dance Between Data Science & Machine Learning...
June Andrews
 
PPTX
Fighting Financial Crime with Artificial Intelligence
DataWorks Summit
 
PPT
datamining and warehousing ppt
Satyamverma2011
 
PPTX
4 Data preparation and processing
Mahmoud Alfarra
 
PDF
06 fp basic
JoonyoungJayGwak
 
Defeating online fraud and abuse – Continuous Intelligence in action
Thoughtworks
 
Data preparation and processing chapter 2
Mahmoud Alfarra
 
Taking Machine Learning from Batch to Real-Time (big data eXposed 2015)
Elad Rosenheim
 
Picnic Big Data Expo
BigDataExpo
 
Marketing Analytics with R Lifting Campaign Success Rates
Revolution Analytics
 
What is an AI System of Growth
Vincent Handley
 
Forecasting P2P Credit Risk based on Lending Club data
Archange Giscard DESTINE
 
Forecasting peer to_peer_lending_risk
stevenllerner
 
Artificial Intelligence high ROI case studies from around the world: approach...
Data Driven Innovation
 
How to get value out of data
Lars Trieloff
 
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Lecture 14
Shani729
 
06FPBasic.ppt
KomalBanik
 
06FPBasic.ppt
KomalBanik
 
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 
Replication in Data Science - A Dance Between Data Science & Machine Learning...
June Andrews
 
Fighting Financial Crime with Artificial Intelligence
DataWorks Summit
 
datamining and warehousing ppt
Satyamverma2011
 
4 Data preparation and processing
Mahmoud Alfarra
 
06 fp basic
JoonyoungJayGwak
 
Ad

More from Hakka Labs (20)

PDF
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
PPTX
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
PDF
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PDF
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
PDF
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
PDF
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
PDF
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 

Recently uploaded (20)

PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PDF
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PDF
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
PDF
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
What companies do with Pharo (ESUG 2025)
ESUG
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
Presentation about variables and constant.pptx
kr2589474
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
How Agentic AI Networks are Revolutionizing Collaborative AI Ecosystems in 2025
ronakdubey419
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Farrell__10e_ch04_PowerPoint.pptx Programming Logic and Design slides
bashnahara11
 
How to Download and Install ADT (ABAP Development Tools) for Eclipse IDE | SA...
SAP Vista, an A L T Z E N Company
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 

Square's Machine Learning Infrastructure and Applications - Rong Yan