SlideShare a Scribd company logo
Joseph Malicki, Inaz Alaei-Novin
Yelp Ad Targeting at Scale
with Apache Spark
1
2
Background - Yelp
Ad Targeting Intro
Model Training
Tools
Deployment to Production
Wrap-up
About us
• Joseph Malicki, Inaz Alaei-Novin
• Data mining engineers at
• Ad delivery team
3
Yelp’s Mission
Connecting people with great
local businesses.
4
Yelp Stats
As of Q1 2017
127M

Monthly Unique

Desktop Users
76% of

Searches via

Mobile App
99M

Monthly Unique

Mobile Users
5
6
Background - Yelp
Ad Targeting Intro
Model Training
Tools
Deployment to Production
Wrap-up
Yelp Ads
7
Yelp Ad Targeting
• Majority of Yelp ads are cost-per-click
- Yelp only gets paid if user clicks on an ad
• Native advertisements
- Advertisers and content within Yelp platform
8
Cost-per-Click Ad Auction
Maximize expected revenue:
1. Order by advertiser bid × predicted

click-through-rate (pCTR)
2. Pay second price
Expected[Revenue] = Bid * Expected[CTR]



Because of multiplication, predicted CTR

must be well-calibrated, not only well-ordered
9
Yelp Ad Targeting
Ad delivery
Context
• Each request different
• Context matters
• Location
• Search query
• User attributes
• etc.
How to Generalize?
Use machine learning to
estimate CTR and show
relevant ads
10
11
Background - Yelp
Ad Targeting Intro
Model Training
Tools
Deployment to Production
Wrap-up
CTR prediction system overview
Logs
Online
Offline
Ad delivery
Training pipeline
Model
12
Offline Training at Yelp
SamplingAd event Logs
Feature
Extraction
Model
Training
Evaluating
New Features
13
Ad Event Logs
Sampling
Feature
Extraction
Model
Training
Evaluating
New Features
JSON
14
Sampling
Feature
Extraction
Model
Training
Evaluating
New Features
JSON
Train
Samples
15
Test
Samples
Feature
Extraction
Model
Training
Evaluating
New Features
JSON
Sampling is 4X faster
with Spark!
Sampling
16
Train
Samples
Test
Samples
Feature
Extraction
Model
Training
Evaluating
New Features
JSON
Sampling
17
Train
Samples
Test
Samples
Feature Extraction
Model
Training
Evaluating
New Features
JSON
Ad event id
Number of features
Label
Feature indices and values
18
Train
Samples
Test
Samples
Model
Training
Evaluating
New Features
JSON
Model Training
19
Train
Samples
Test
Samples
Evaluating
New Features
JSON
Model
Model Training
20
Train
Samples
Test
Samples
New Features
JSON
compare MXE with
status quo and
visualizations
Evaluation
21
Train
Samples
Test
Samples
Model
Feature contribution in a Model
22
Standard deviation * model coefficient
Feature contribution (i) = σiωi
Compare Feature Importance in
Multiple Models
23
Feature contribution (i) = μiωi
Feature mean * model coefficient
Use colStats from
pyspark.mllib.stat.Statistics to compute
column summary statistics
Compare Feature Contributions in Models
24
Compare feature contribution
in 2 models:
- How much would status quo
MXE change if we change the
coefficient of one feature from
status quo to challenger?
New Features
JSON
New Features
25
Train
Samples
Test
Samples
Model
New Features
JSON
26
Train
Samples
Test
Samples
Model
Visualizations
27
Use RDD’s Histogram method
and some RDD mappings to
generate the plots
Pair of Features against oCTR or
Ad events
Visualizations
28
Use RDD’s Histogram method
and some RDD mappings to
generate the plots
Training Pipeline
JSON
29
Train
Samples
Test
Samples
Model
30
Background - Yelp
Ad Targeting Intro
Model Training
Tools
Deployment to Production
Wrap-up
Spark related tools
• Zeppelin Notebook
• mrjob
31
Zeppelin Notebook
•Web-based notebook
•Interactive data analytics
•Supports multiple languages
•Supports Spark
•At Yelp we use it for:
◦ Ad-hoc analysis
◦ Testing new training algorithms
◦ Debugging

32
mrjob
• One of Yelp’s contribution to open source!
• Lets you Write multi-step MapReduce jobs in Python
• Test on your local machine
• Run on a Hadoop cluster
• Run in the cloud using EMR
• Run in the cloud using Google Cloud Dataproc
• Easily run Spark jobs on EMR or your own Hadoop
cluster
33
34
Background - Yelp
Ad Targeting Intro
Model Training
Tools
Deployment to Production
Wrap-up
Production concerns
Offline Batch

• Overnight or developer-initiated
jobs
• Millions to billions of datapoints
• Batch-oriented (hours)
• Apache Spark







Online Ad Serving

• User hits button on app, needs
quick response
• Smaller number of locally and
contextually relevant candidates
• Real-time (milliseconds)
• Java servlet







Shared code

(libraries)
35
Monitoring
• If CTR prediction model stops being accurate,
could lead to loss of revenue
• How do we know models are working
properly?
• Need to check model predictions are accurate
over time
36
Monitoring
• Large batch jobs check actual user ad click-
through-rate against predicted CTR
• Model accuracy far more sensitive than
overall metrics: traffic mix is accounted for
• Spark streaming allows real-time alerts
- A practical approach to building a streaming processing pipeline for an
online advertising platform - Spark Summit 2017
37
Monitoring - Examples
• Misspelled header in API call refactor
• Change in HTTPS caching behavior affects CTR
ProblemNormal
38
Monitoring - Calibration Plot
• Recall ad auction orders by advertiser bid ×
predicted click-through-rate (pCTR)
• Because of multiplication, predicted
probabilities need to be well-calibrated

• Goal:
39
P clicked | CTˆR = y( )= y
Monitoring - Calibration Plot
Predicted CTR
Fractionofads
40
Observed-PredictedCTR
Monitoring - Calibration Plot
• Logistic regression loss is a proper scoring rule
- Generates models that are well-calibrated on
average
• Feature engineering problems can cause poor
calibration
• Probability distribution drifting over time will cause
loss of calibration
- e.g. changes to user interface affecting behavior
41
42
Background - Yelp
Ad Targeting Intro
Model Training
Tools and Visualizations
Deployment to Production
Wrap-up
Spark at Yelp
• Spark increasingly used throughout Yelp
• Streaming
• Iteration
• Easy specification of job flows
• Want to work with Spark? We’re hiring - stop by
Yelp booth in exhibition area, until 4:30pm
43
www.yelp.com/careers/
We're Hiring!
44
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
45
Thank You.
Questions?
46

More Related Content

What's hot (20)

PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PDF
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Databricks
 
PDF
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
PDF
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
PDF
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Databricks
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
PDF
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
PDF
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
PDF
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
PDF
Spark Summit EU talk by Reza Karimi
Spark Summit
 
PDF
MLeap: Productionize Data Science Workflows Using Spark
Jen Aman
 
PDF
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
When Apache Spark Meets TiDB with Xiaoyu Ma
Databricks
 
PDF
Porting R Models into Scala Spark
carl_pulley
 
PDF
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Databricks
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Databricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Scaling Ride-Hailing with Machine Learning on MLflow
Databricks
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
Spark Summit EU talk by Reza Karimi
Spark Summit
 
MLeap: Productionize Data Science Workflows Using Spark
Jen Aman
 
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Apache Spark Model Deployment
Databricks
 
When Apache Spark Meets TiDB with Xiaoyu Ma
Databricks
 
Porting R Models into Scala Spark
carl_pulley
 
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 

Similar to Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Malicki (20)

PPTX
Fast Data Driving Personalization - Nick Gorski
Hakka Labs
 
PPTX
Response prediction for display advertising - WSDM 2014
Olivier Chapelle
 
PPTX
Driving Digital Transformation with Machine Learning in Oracle Analytics
Perficient, Inc.
 
PDF
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
Formulatedby
 
PPTX
CTR Prediction using Spark Machine Learning Pipelines
Manisha Sule
 
PDF
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Databricks
 
PDF
Project Lotus Intro Deck Aug
martinsunwenhua
 
PPTX
Why is programmatic taking off? What is this revolution all about?
Datacratic
 
PDF
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
MLconf
 
PPTX
Automation Insights
Hanapin Marketing
 
PDF
Our Experience with Adobe Audience Manager DMP
Matěj Novák
 
PPT
Int'l Conference on Predictive APIs: RTB Optimizer presentation
Datacratic
 
PPTX
Digital Marketing Campaign Conversion Prediction
Boston Institute of Analytics
 
PPTX
Digital Marketing Campaign Conversion Prediction.
Boston Institute of Analytics
 
PDF
Ron_Tharp_Resume_2016
Ron Tharp
 
PDF
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Matt Stubbs
 
PDF
Practical model management in the age of Data science and ML
QuantUniversity
 
PDF
Nicolas Kruchten @ Datacratic
PAPIs.io
 
PPTX
Wayfair's Data Science Team and Case Study: Uplift Modeling
Patricia Stichnoth
 
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Fast Data Driving Personalization - Nick Gorski
Hakka Labs
 
Response prediction for display advertising - WSDM 2014
Olivier Chapelle
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Perficient, Inc.
 
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
Formulatedby
 
CTR Prediction using Spark Machine Learning Pipelines
Manisha Sule
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Databricks
 
Project Lotus Intro Deck Aug
martinsunwenhua
 
Why is programmatic taking off? What is this revolution all about?
Datacratic
 
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
MLconf
 
Automation Insights
Hanapin Marketing
 
Our Experience with Adobe Audience Manager DMP
Matěj Novák
 
Int'l Conference on Predictive APIs: RTB Optimizer presentation
Datacratic
 
Digital Marketing Campaign Conversion Prediction
Boston Institute of Analytics
 
Digital Marketing Campaign Conversion Prediction.
Boston Institute of Analytics
 
Ron_Tharp_Resume_2016
Ron Tharp
 
Big Data LDN 2017: Advanced Analytics Applied to Marketing Attribution
Matt Stubbs
 
Practical model management in the age of Data science and ML
QuantUniversity
 
Nicolas Kruchten @ Datacratic
PAPIs.io
 
Wayfair's Data Science Team and Case Study: Uplift Modeling
Patricia Stichnoth
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 

Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Malicki