SlideShare a Scribd company logo
Using GraphX/Pregel on Browsing
History to Discover Purchase Intent
Zhang, Lisa
Rubicon Project Buyer Cloud
Problem
• Identify possible new customers for
our advertisers using intent data,
one of which is browsing history
travel-site-101.com, spark-summit.org
Challenges
Sites are numerous
and ever-changing
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Offline Evaluation Metrics
• AUC: area under ROC
curve
• Precision at top 5% of
score: model used to
identify top users only
• Baseline: Previous
solution prior to Spark
Linear Dimensionality Reduction
SVDINPUT GBT OUTPUT
per advertiser
Dimension Reduction Classification
Evaluation
SVD: Top Sites
Home Improvement Advertiser
deal-site-101.com
chat-site-001.com
ecommerce-site-001.com
chat-site-002.com
invitation-site-001.com
classified-site-001.com
Telecom Advertiser
developer-forum-001.com
chat-site-001.com
invitation-site-001.com
deal-site-101.com
college-site-001.com
chat-site-002.com
The Issue with SVDs
• Dominated by the same signal across all
advertisers
• Identify online buyers, but not those
specific to each advertiser
• Not appropriate for our use case
SVD per Advertiser?
SVDINPUT GBT OUTPUT
per advertiser
Dimension Reduction Classification
Non-linear Approaches?
Too Complex:

Cannot run frequently,
we become slow to learn
about new sites
Too Simple:

Possibly same
problem as SVD
Speed
Complexity
Can We Simplify?
Intuition:

Given a known positive training case,
target other users that have similar site
history as the current user.
One natural way is to treat sites as a graph.
Sites as Graphs
• Easy to interpret
• Easy to
visualize
• Graph algos
well studied
Spark GraphX
• Spark’s API for parallel graph computations
• Comes with some common graph
algorithms
• API for developing new graph algorithms:
e.g. via pregel
Pregel API
• Pass messages from vertices to other, typically
adjacent, vertices: “Think like a vertex”
• Define an algorithm by stating:
how to send messages

how to merge multiple messages

how to update a vertex with message
repeat
Propagation Based Approach
• Pass positive
(converter)
information
across edges
• Give credit to
“similar” sites
Example Scenario
travel-site-101.com
book-my-travel-103.com
canoe-travel-102.com
1 converter / 40,000 visitors
0 converter / 48,000 visitors
0 converter / 41,000 visitors
Sending Messages
ω = 1/40,000
Δω = ω * edge_weight
Δω = ω * edge_weight
canoe-travel-102.com book-my-travel-103.com
travel-site-101.com
Receiving Messages
Δω1
hawaii-999.com
…
Δω2
Δωn
ωnew = ωold + λ • Σ Δωi
canoe-travel-102.com
travel-site-101.com
Weights After One Iteration
book-my-travel-103.com
canoe-travel-102.com
2.5 x 10^(-5)
1.2 x 10^(-5)
0.8 x 10^(-5)
travel-site-101.com
Simplified Code
type MT = Double; type ED = Double; type VD = Double

val lambda = …; val maxIterations = …

val initialMsg = 0.0



def updateVertex(id: VertexId, w: VD, delta_w: MT): VD =

w + lambda * delta_w

def sendMessage(edge: EdgeTriplet[VD, ED]): Iterator[(VertexId, MT)] = {

Iterator((edge.srcId, edge.attr * edge.dstAttr),

(edge.dstId, edge.attr * edge.srcAttr))

}

def mergeMsgs(w1: MT, w2: MT): MT = x + y



val graph: Graph[VD, ED] = …

graph.pregel(initialMessage, maxIterations, EdgeDirection.out)(

updateVertex, sendMessage, mergeMessage)
Model Output & Application
• Model output is a
mapping of sites to
final scores
• To apply the model,
aggregate scores of
sites visited by user
SITE SCORE
travel-site-101.com 0.5
canoe-travel-102.com 0.4
sport-team-101.com 0.1
… …
Other Factors
• Edge Weights: Cosine Similarity, Jaccard Index,
Conditional Probability
• Edge/Vertex Removal: Remove sites and edges on
the long-tail
• Hyper parameter Tuning: lambda, numIterations
and others through testing (there is no convergence)
Evaluation
Propagation: Top Sites
Home Improvement Advrt.
label-maker-101.com
laptop-bags-101.com
renovations-101.com
fitness-equipment-101.com
renovations-102.com
buy-realestate-101.com
Telecom Advertiser
canada-movies-101.ca
canadian-news-101.ca
canadian-jobs-101.ca
canadian-teacher-rating-101.ca
watch-tv-online.com
phone-system-review-101.com
Canadian
Telecom
Renovations
Challenges (from earlier)
Sites are numerous
and ever-changing
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
Positive training cases
are sparse
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
Propagation resolves
sparsity: intuitive and
interpretable
Models run frequently:
every few hours
Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
Propagation resolves
sparsity: intuitive and
interpretable
Evaluating users fast;
does not require GraphX
General Spark Learnings
• Many small jobs > one large job: We split big jobs into multiple smaller,
concurrent, jobs and increased throughput (more jobs could run
concurrently).
• Serialization: Don’t save SparkContext as a member variable, define Python
classes in a separate file, check if your object serializes/deserializes well!
• Use rdd.reduceByKey() and others over rdd.groupByKey().
• Be careful with rdd.coalesce() vs rdd.repartition(), rdd.partitionBy() can be
your friend in the right circumstances.
THANK YOU.
lzhang@rubiconproject.com

More Related Content

What's hot (20)

PDF
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
PPTX
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
PDF
Enhancements on Spark SQL optimizer by Min Qiu
Spark Summit
 
PDF
Extending Machine Learning Algorithms with PySpark
Databricks
 
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
PDF
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
PDF
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Spark Summit
 
PDF
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Databricks
 
PDF
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PDF
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
PDF
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
PDF
AI made easy with Flink AI Flow
Jiangjie Qin
 
PDF
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
PDF
MLeap: Productionize Data Science Workflows Using Spark
Jen Aman
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PDF
Apply MLOps at Scale by H&M
Databricks
 
PDF
Operationalizing Machine Learning at Scale with Sameer Nori
Databricks
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Databricks
 
Tailored for Spark
DataWorks Summit/Hadoop Summit
 
Enhancements on Spark SQL optimizer by Min Qiu
Spark Summit
 
Extending Machine Learning Algorithms with PySpark
Databricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Spark Summit
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Databricks
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
AI made easy with Flink AI Flow
Jiangjie Qin
 
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
MLeap: Productionize Data Science Workflows Using Spark
Jen Aman
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Apply MLOps at Scale by H&M
Databricks
 
Operationalizing Machine Learning at Scale with Sameer Nori
Databricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 

Viewers also liked (20)

PDF
Pregel: A System for Large-Scale Graph Processing
Chris Bunch
 
PPTX
Spark Summit Keynote by Suren Nathan
Spark Summit
 
PDF
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
PDF
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Spark Summit
 
PDF
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Spark Summit
 
PDF
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
PDF
Data Scientist Workbench 入門
soh kaijima
 
PDF
Dev Ops Training
Spark Summit
 
PDF
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 
PDF
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Summit
 
PDF
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
PDF
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 
PDF
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
PPTX
Spark Summit Keynote by Shaun Connolly
Spark Summit
 
PDF
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 
PDF
Not Your Father's Database by Vida Ha
Spark Summit
 
PDF
GraphX and Pregel - Apache Spark
Ashutosh Trivedi
 
PPTX
Introduction to Scala
Rahul Jain
 
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Pregel: A System for Large-Scale Graph Processing
Chris Bunch
 
Spark Summit Keynote by Suren Nathan
Spark Summit
 
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by S...
Spark Summit
 
Monte Carlo Simulations in Ad-Lift Measurement Using Spark by Prasad Chalasan...
Spark Summit
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Spark Summit
 
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
Spark Summit
 
Data Scientist Workbench 入門
soh kaijima
 
Dev Ops Training
Spark Summit
 
Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng
Spark Summit
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Summit
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Spark Summit Keynote by Shaun Connolly
Spark Summit
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 
Not Your Father's Database by Vida Ha
Spark Summit
 
GraphX and Pregel - Apache Spark
Ashutosh Trivedi
 
Introduction to Scala
Rahul Jain
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks
 
Ad

Similar to Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang (20)

PPTX
A Swarm of Ads
dalewong108
 
PDF
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
PDF
Xia Zhu – Intel at MLconf ATL
MLconf
 
PDF
Spark graphx
Carol McDonald
 
PDF
Write Graph Algorithms Like a Boss Andrew Ray
Databricks
 
PDF
Graph Analytics in Spark
Paco Nathan
 
PDF
Halko_santafe_2015
Nathan Halko
 
PDF
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
PDF
Graph Theory and Databases
Pere Urbón-Bayes
 
PDF
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PDF
Approximation algorithms for stream and batch processing
Gabriele Modena
 
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
PDF
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Ontico
 
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Alexey Zinoviev
 
PDF
F14 lec12graphs
ankush karwa
 
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
PDF
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
PDF
Microservices, containers, and machine learning
Paco Nathan
 
A Swarm of Ads
dalewong108
 
Metadata and the Power of Pattern-Finding
DATAVERSITY
 
Xia Zhu – Intel at MLconf ATL
MLconf
 
Spark graphx
Carol McDonald
 
Write Graph Algorithms Like a Boss Andrew Ray
Databricks
 
Graph Analytics in Spark
Paco Nathan
 
Halko_santafe_2015
Nathan Halko
 
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Graph Theory and Databases
Pere Urbón-Bayes
 
Spark Meetup @ Netflix, 05/19/2015
Yves Raimond
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Alexey Zinoviev
 
F14 lec12graphs
ankush karwa
 
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Provectus
 
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Microservices, containers, and machine learning
Paco Nathan
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang

  • 1. Using GraphX/Pregel on Browsing History to Discover Purchase Intent Zhang, Lisa Rubicon Project Buyer Cloud
  • 2. Problem • Identify possible new customers for our advertisers using intent data, one of which is browsing history travel-site-101.com, spark-summit.org
  • 3. Challenges Sites are numerous and ever-changing Need to build one model per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 4. Offline Evaluation Metrics • AUC: area under ROC curve • Precision at top 5% of score: model used to identify top users only • Baseline: Previous solution prior to Spark
  • 5. Linear Dimensionality Reduction SVDINPUT GBT OUTPUT per advertiser Dimension Reduction Classification
  • 7. SVD: Top Sites Home Improvement Advertiser deal-site-101.com chat-site-001.com ecommerce-site-001.com chat-site-002.com invitation-site-001.com classified-site-001.com Telecom Advertiser developer-forum-001.com chat-site-001.com invitation-site-001.com deal-site-101.com college-site-001.com chat-site-002.com
  • 8. The Issue with SVDs • Dominated by the same signal across all advertisers • Identify online buyers, but not those specific to each advertiser • Not appropriate for our use case
  • 9. SVD per Advertiser? SVDINPUT GBT OUTPUT per advertiser Dimension Reduction Classification
  • 10. Non-linear Approaches? Too Complex:
 Cannot run frequently, we become slow to learn about new sites Too Simple:
 Possibly same problem as SVD Speed Complexity
  • 11. Can We Simplify? Intuition:
 Given a known positive training case, target other users that have similar site history as the current user. One natural way is to treat sites as a graph.
  • 12. Sites as Graphs • Easy to interpret • Easy to visualize • Graph algos well studied
  • 13. Spark GraphX • Spark’s API for parallel graph computations • Comes with some common graph algorithms • API for developing new graph algorithms: e.g. via pregel
  • 14. Pregel API • Pass messages from vertices to other, typically adjacent, vertices: “Think like a vertex” • Define an algorithm by stating: how to send messages
 how to merge multiple messages
 how to update a vertex with message repeat
  • 15. Propagation Based Approach • Pass positive (converter) information across edges • Give credit to “similar” sites
  • 16. Example Scenario travel-site-101.com book-my-travel-103.com canoe-travel-102.com 1 converter / 40,000 visitors 0 converter / 48,000 visitors 0 converter / 41,000 visitors
  • 17. Sending Messages ω = 1/40,000 Δω = ω * edge_weight Δω = ω * edge_weight canoe-travel-102.com book-my-travel-103.com travel-site-101.com
  • 18. Receiving Messages Δω1 hawaii-999.com … Δω2 Δωn ωnew = ωold + λ • Σ Δωi canoe-travel-102.com travel-site-101.com
  • 19. Weights After One Iteration book-my-travel-103.com canoe-travel-102.com 2.5 x 10^(-5) 1.2 x 10^(-5) 0.8 x 10^(-5) travel-site-101.com
  • 20. Simplified Code type MT = Double; type ED = Double; type VD = Double
 val lambda = …; val maxIterations = …
 val initialMsg = 0.0
 
 def updateVertex(id: VertexId, w: VD, delta_w: MT): VD =
 w + lambda * delta_w
 def sendMessage(edge: EdgeTriplet[VD, ED]): Iterator[(VertexId, MT)] = {
 Iterator((edge.srcId, edge.attr * edge.dstAttr),
 (edge.dstId, edge.attr * edge.srcAttr))
 }
 def mergeMsgs(w1: MT, w2: MT): MT = x + y
 
 val graph: Graph[VD, ED] = …
 graph.pregel(initialMessage, maxIterations, EdgeDirection.out)(
 updateVertex, sendMessage, mergeMessage)
  • 21. Model Output & Application • Model output is a mapping of sites to final scores • To apply the model, aggregate scores of sites visited by user SITE SCORE travel-site-101.com 0.5 canoe-travel-102.com 0.4 sport-team-101.com 0.1 … …
  • 22. Other Factors • Edge Weights: Cosine Similarity, Jaccard Index, Conditional Probability • Edge/Vertex Removal: Remove sites and edges on the long-tail • Hyper parameter Tuning: lambda, numIterations and others through testing (there is no convergence)
  • 24. Propagation: Top Sites Home Improvement Advrt. label-maker-101.com laptop-bags-101.com renovations-101.com fitness-equipment-101.com renovations-102.com buy-realestate-101.com Telecom Advertiser canada-movies-101.ca canadian-news-101.ca canadian-jobs-101.ca canadian-teacher-rating-101.ca watch-tv-online.com phone-system-review-101.com Canadian Telecom Renovations
  • 25. Challenges (from earlier) Sites are numerous and ever-changing Need to build one model per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 26. Resolutions Graph built just in time for training Need to build one model per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 27. Resolutions Graph built just in time for training Graph built once; propagation runs per advertiser Positive training cases are sparse Models run frequently: every few hours
  • 28. Resolutions Graph built just in time for training Graph built once; propagation runs per advertiser Propagation resolves sparsity: intuitive and interpretable Models run frequently: every few hours
  • 29. Resolutions Graph built just in time for training Graph built once; propagation runs per advertiser Propagation resolves sparsity: intuitive and interpretable Evaluating users fast; does not require GraphX
  • 30. General Spark Learnings • Many small jobs > one large job: We split big jobs into multiple smaller, concurrent, jobs and increased throughput (more jobs could run concurrently). • Serialization: Don’t save SparkContext as a member variable, define Python classes in a separate file, check if your object serializes/deserializes well! • Use rdd.reduceByKey() and others over rdd.groupByKey(). • Be careful with rdd.coalesce() vs rdd.repartition(), rdd.partitionBy() can be your friend in the right circumstances.