Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang

Using GraphX/Pregel on Browsing
History to Discover Purchase Intent
Zhang, Lisa
Rubicon Project Buyer Cloud

Problem
• Identify possible new customers for
our advertisers using intent data,
one of which is browsing history
travel-site-101.com, spark-summit.org

Challenges
Sites are numerous
and ever-changing
Need to build one
model per advertiser
Positive training cases
are sparse
Models run frequently:
every few hours

Offline Evaluation Metrics
• AUC: area under ROC
curve
• Precision at top 5% of
score: model used to
identify top users only
• Baseline: Previous
solution prior to Spark

Linear Dimensionality Reduction
SVDINPUT GBT OUTPUT
per advertiser
Dimension Reduction Classification

SVD: Top Sites
Home Improvement Advertiser
deal-site-101.com
chat-site-001.com
ecommerce-site-001.com
chat-site-002.com
invitation-site-001.com
classified-site-001.com
Telecom Advertiser
developer-forum-001.com
chat-site-001.com
invitation-site-001.com
deal-site-101.com
college-site-001.com
chat-site-002.com

The Issue with SVDs
• Dominated by the same signal across all
advertisers
• Identify online buyers, but not those
speciﬁc to each advertiser
• Not appropriate for our use case

SVD per Advertiser?
SVDINPUT GBT OUTPUT
per advertiser
Dimension Reduction Classification

Non-linear Approaches?
Too Complex: 
Cannot run frequently,
we become slow to learn
about new sites
Too Simple: 
Possibly same
problem as SVD
Speed
Complexity

Can We Simplify?
Intuition: 
Given a known positive training case,
target other users that have similar site
history as the current user.
One natural way is to treat sites as a graph.

Sites as Graphs
• Easy to interpret
• Easy to
visualize
• Graph algos
well studied

Spark GraphX
• Spark’s API for parallel graph computations
• Comes with some common graph
algorithms
• API for developing new graph algorithms:
e.g. via pregel

Pregel API
• Pass messages from vertices to other, typically
adjacent, vertices: “Think like a vertex”
• Deﬁne an algorithm by stating:
how to send messages 
how to merge multiple messages 
how to update a vertex with message
repeat

Propagation Based Approach
• Pass positive
(converter)
information
across edges
• Give credit to
“similar” sites

Example Scenario
travel-site-101.com
book-my-travel-103.com
canoe-travel-102.com
1 converter / 40,000 visitors

Sending Messages
ω = 1/40,000
Δω = ω * edge_weight
Δω = ω * edge_weight
canoe-travel-102.com book-my-travel-103.com
travel-site-101.com

Receiving Messages
Δω1
hawaii-999.com
…
Δω2
Δωn
ωnew = ωold + λ • Σ Δωi
travel-site-101.com

Weights After One Iteration
book-my-travel-103.com
2.5 x 10^(-5)
1.2 x 10^(-5)
0.8 x 10^(-5)
travel-site-101.com

Simplified Code
type MT = Double; type ED = Double; type VD = Double 
val lambda = …; val maxIterations = … 
val initialMsg = 0.0 
 
def updateVertex(id: VertexId, w: VD, delta_w: MT): VD = 
w + lambda * delta_w 
def sendMessage(edge: EdgeTriplet[VD, ED]): Iterator[(VertexId, MT)] = { 
Iterator((edge.srcId, edge.attr * edge.dstAttr), 
(edge.dstId, edge.attr * edge.srcAttr)) 
} 
def mergeMsgs(w1: MT, w2: MT): MT = x + y 
 
val graph: Graph[VD, ED] = … 
graph.pregel(initialMessage, maxIterations, EdgeDirection.out)( 
updateVertex, sendMessage, mergeMessage)

Model Output & Application
• Model output is a
mapping of sites to
ﬁnal scores
• To apply the model,
aggregate scores of
sites visited by user
SITE SCORE
travel-site-101.com 0.5
canoe-travel-102.com 0.4
sport-team-101.com 0.1
… …

Other Factors
• Edge Weights: Cosine Similarity, Jaccard Index,
Conditional Probability
• Edge/Vertex Removal: Remove sites and edges on
the long-tail
• Hyper parameter Tuning: lambda, numIterations
and others through testing (there is no convergence)

Propagation: Top Sites
Home Improvement Advrt.
label-maker-101.com
laptop-bags-101.com
renovations-101.com
fitness-equipment-101.com
renovations-102.com
buy-realestate-101.com
Telecom Advertiser
canada-movies-101.ca
canadian-news-101.ca
canadian-jobs-101.ca
canadian-teacher-rating-101.ca
watch-tv-online.com
phone-system-review-101.com
Canadian
Telecom
Renovations

Challenges (from earlier)
Sites are numerous
and ever-changing
Need to build one
are sparse
every few hours

Resolutions
Graph built just in
time for training
Need to build one
are sparse
every few hours

Resolutions
Graph built just in
time for training
Graph built once;
propagation runs per
advertiser
are sparse
every few hours

Resolutions
Graph built just in
time for training
Graph built once;
advertiser
Propagation resolves
sparsity: intuitive and
interpretable
every few hours

Resolutions
Graph built just in
time for training
Graph built once;
advertiser
Propagation resolves
sparsity: intuitive and
interpretable
Evaluating users fast;
does not require GraphX

General Spark Learnings
• Many small jobs > one large job: We split big jobs into multiple smaller,
concurrent, jobs and increased throughput (more jobs could run
concurrently).
• Serialization: Don’t save SparkContext as a member variable, deﬁne Python
classes in a separate ﬁle, check if your object serializes/deserializes well!
• Use rdd.reduceByKey() and others over rdd.groupByKey().
• Be careful with rdd.coalesce() vs rdd.repartition(), rdd.partitionBy() can be
your friend in the right circumstances.

THANK YOU.
lzhang@rubiconproject.com

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang (20)

More from Spark Summit (20)

Recently uploaded (20)

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Zhang