A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Databricks)

A More Scalable Way of Making
Recommendations with MLlib
Xiangrui Meng
Spark Summit 2015

More interested in application than implementation?
iRIS: A Large-Scale Food and Recipe
Recommendation System Using Spark
Joohyun Kim (MyFitnessPal, Under Armour)
3:30 – 4:00 PM
Imperial Ballroom (Level 2)
2

About Databricks
• Founded by Apache spark creators
• Largest contributor to Spark project, committed to
keeping Spark 100% open source
• End-to-end hosted platform
https://www.databricks.com/product/databricks
3

Spark MLlib
Large-scale machine learning on Apache Spark

About MLlib
• Started in UC Berkeley AMPLab
• Shipped with Spark 0.8
• Currently (Spark 1.4)
• Contributions from 50+ organizations, 150+
individuals
• Good coverage of algorithms
5

MLlib’s Mission
MLlib’s mission is to make practical machine
learning easy and scalable.
• Easy to build machine learning applications
• Capable of learning from large-scale datasets
6

7
A Brief History of MLlib
Optim Algo API App
v0.8
GLMs
k-means
ALS
Java/Scalagradient decent

8
Optim Algo API App
v0.9
Python
naive Bayes
implicit ALS

9
Optim Algo API App
v1.0
decision tree
PCA
SVD
sparse data
L-BFGS

10
v1.1
Optim Algo API App
tree reduce
torrent broadcast
statistics
NMF
streaming linear regression
Word2Vec
gap

11
v1.2
Optim Algo API App
pipeline
random forest
gradient boosted trees
streaming k-means

12
v1.3
Optim Algo API App
ALSv2
latent Dirichlet allocation (LDA)
multinomial logistic regression
Gaussian mixture model (GMM)
distributed block matrix
FP-growth / isotonic regression
power iteration clustering
pipeline in python
model import/export
SparkPackages

13
Optim Algo API App
GLMs with elastic-net
online LDA
ALS.recommendAll feature transformers
estimators
Python pipeline API
v1.4
OWL-QN

Alternating Least Squares (ALS)
Collaborative filtering via matrix factorization

15
Collaborative Filtering
items
users
A: a rating matrix

Low-Rank Assumption
• What kind of movies do you like?
• sci-fi / crime / action
Perception of preferences usually takes place in a
low dimensional latent space.
So the rating matrix is approximately low-rank.
16
A ⇡ UV T
, U 2 Rm⇥k
, V 2 Rn⇥k
aij ⇡ uT
i vj

Objective Function
• minimize the reconstruction error
• only check observed ratings
17
minimize
1
2
kA UV T
k2
F
minimize
1
2
X
(i,j)2⌦
(aij uT
i vj)2

Alternating Least Squares (ALS)
• If we fix U, the objective becomes convex and
separable:
• Each sub-problem is a least squares problem, which
can be solved in parallel. So we take alternating
directions to minimize the objective:
• fix U, solve for V;
• fix V, solve for U. 18
minimize
1
2
X
j
0
@
X
i,(i,j)2⌦
(aij uT
i vj)2
1
A

Complexity
• To solve a least squares problem of size n-by-k, we need
O(n k2) time. So the total computation cost is O(nnz k2),
where nnz is the total number of ratings.
• We take the normal equation approach in ALS
• Solving each subproblem requires O(k2) storage. We call
LAPACK’s routine to solve this problem.
19
AT
Ax = AT
b

ALS Implementation in MLlib
How to scale to 100,000,000,000 ratings?

Communication Cost
The most important factor of implementing an
algorithm in parallel is the communication cost.
To make ALS scale to billions of ratings, millions of
users/items, we have to distribute ratings (A), user
factors (U), and item factors (V). How?
• all-to-all
• block-to-block
• …
21

Communication: All-to-All
• users: u1, u2, u3; items: v1, v2, v3, v4
• shuffle size: O(nnz k) (nnz: number of nonzeros, i.e., ratings)
• sending the same factor multiple times
22

Communication: Block-to-Block
• OutBlocks (P1, P2)
• for each item block, which user factors to send
• InBlocks (Q1, Q2)
• for each item, which user factors to use
23

Communication: Block-to-Block
• Shuffle size is significantly reduced.
• We cache two copies of ratings — InBlocks for users and
InBlocks for items.
24

DAG Visualization of an ALS Job
25
ratingBlocks
itemOutBlocks
userInBlocks itemInBlocks
userOutBlocks
itemFactors 0
userFactors 1 itemFactors 1
preparation iterations

Compressed Storage for InBlocks
Array of rating tuples
• huge storage overhead
• high garbage collection (GC) pressure
26
[(v1, u1, a11), (v2, u1, a12), (v1, u2, a21), (v2, u2, a22), (v2, u3, a32)]

Three primitive arrays
• low GC pressure
• constructing all sub-problems together
• O(nj k2) storage
27
([v1, v2, v1, v2, v2], [u1, u1, u2, u2, u3], [a11, a12, a21, a22, a32])

Primitive arrays with items ordered:
• solving sub-problems in sequence:
• O(k2) storage
• TimSort
28
([v1, v1, v2, v2, v2], [u1, u2, u1, u2, u3], [a11, a21, a12, a22, a32])

Compressed items:
• no duplicated items
• map lookup for user factors
29
([v1, v2], [0, 2, 5], [u1, u2, u1, u2, u3], [a11, a21, a12, a22, a32])

Store block IDs and local indices instead of user IDs. For example, u3
is the first vector sent from P2.
Encode (block ID, local index) into an integer
• use higher bits for block ID
• use lower bits for local index
• works for ~4 billions of unique users/items 
01 | 00 0000 0000 0000
30
([v1, v2], [0, 2, 5], [0|0, 0|1, 0|0, 0|1, 1|0], [a11, a21, a12, a22, a32])

Avoid Garbage Collection
We use specialized code to replace the following:
• initial partitioning of ratings
ratings.map { r =>
((srcPart.getPartition(r.user), dstPart.getPartition(r.item)), r)
}.aggregateByKey(new RatingBlockBuilder)(
seqOp = (b, r) => b.add(r),
combOp = (b0, b1) => b0.merge(b1.build()))
.mapValues(_.build())
• map IDs to local indices
dstIds.toSet.toSeq.sorted.zipWithIndex.toMap
31

Amazon Reviews Dataset
• Amazon Reviews: ~6.6 million users, ~2.2 million items, and ~30 million
ratings
• Tested ALS on stacked copies on a 16-node m3.2xlarge cluster with
rank=10, iter=10:
32

Storage Comparison
33
1.2 1.3/1.4
userInBlock 941MB 277MB
userOutBlock 355MB 65MB
itemInBlock 1380MB 243MB
itemOutBlock 119MB 37MB

Spotify Dataset
• Spotify: 75+ million users and 30+ million songs
• Tested ALS on a subset with ~50 million users, ~5
million songs, and ~50 billion ratings.
• thanks to Chris Johnson and Anders Arpteg
• 32 r3.8xlarge nodes (~$10/hr with spot instances)
• It took 1 hour to finish 10 iterations with rank 10.
• 10 mins to prepare in/out blocks
• 5 mins per iteration
34

ALS Implementation in MLlib
• Save communication by duplicating data
• Efficient storage format
• Watch out for GC
• Native LAPACK calls
35

Future Directions
• Leverage on Project Tungsten to save some
specialized code that avoids GC.
• Solve issues with really popular items.
• Explore other recommendation algorithms, e.g.,
factorization machine.
36

Thank you.
• Spark: http://spark.apache.org
• Databricks: http://databricks.com

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Databricks)

More Related Content

What's hot (20)

Similar to A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Databricks) (20)

More from Spark Summit (20)

Recently uploaded (20)

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Databricks)