SlideShare a Scribd company logo
A More Scalable Way of Making
Recommendations with MLlib
Xiangrui Meng
Spark Summit 2015
More interested in application than implementation?
iRIS: A Large-Scale Food and Recipe
Recommendation System Using Spark
Joohyun Kim (MyFitnessPal, Under Armour)
3:30 – 4:00 PM
Imperial Ballroom (Level 2)
2
About Databricks
• Founded by Apache spark creators
• Largest contributor to Spark project, committed to
keeping Spark 100% open source
• End-to-end hosted platform
https://www.databricks.com/product/databricks
3
Spark MLlib
Large-scale machine learning on Apache Spark
About MLlib
• Started in UC Berkeley AMPLab
• Shipped with Spark 0.8
• Currently (Spark 1.4)
• Contributions from 50+ organizations, 150+
individuals
• Good coverage of algorithms
5
MLlib’s Mission
MLlib’s mission is to make practical machine
learning easy and scalable.
• Easy to build machine learning applications
• Capable of learning from large-scale datasets
6
7
A Brief History of MLlib
Optim Algo API App
v0.8
GLMs
k-means
ALS
Java/Scalagradient decent
8
A Brief History of MLlib
Optim Algo API App
v0.9
Python
naive Bayes
implicit ALS
9
A Brief History of MLlib
Optim Algo API App
v1.0
decision tree
PCA
SVD
sparse data
L-BFGS
10
A Brief History of MLlib
v1.1
Optim Algo API App
tree reduce
torrent broadcast
statistics
NMF
streaming linear regression
Word2Vec
gap
11
A Brief History of MLlib
v1.2
Optim Algo API App
pipeline
random forest
gradient boosted trees
streaming k-means
12
A Brief History of MLlib
v1.3
Optim Algo API App
ALSv2
latent Dirichlet allocation (LDA)
multinomial logistic regression
Gaussian mixture model (GMM)
distributed block matrix
FP-growth / isotonic regression
power iteration clustering
pipeline in python
model import/export
SparkPackages
13
A Brief History of MLlib
Optim Algo API App
GLMs with elastic-net
online LDA
ALS.recommendAll feature transformers
estimators
Python pipeline API
v1.4
OWL-QN
Alternating Least Squares (ALS)
Collaborative filtering via matrix factorization
15
Collaborative Filtering
items
users
A: a rating matrix
Low-Rank Assumption
• What kind of movies do you like?
• sci-fi / crime / action
Perception of preferences usually takes place in a
low dimensional latent space.
So the rating matrix is approximately low-rank.
16
A ⇡ UV T
, U 2 Rm⇥k
, V 2 Rn⇥k
aij ⇡ uT
i vj
Objective Function
• minimize the reconstruction error
• only check observed ratings
17
minimize
1
2
kA UV T
k2
F
minimize
1
2
X
(i,j)2⌦
(aij uT
i vj)2
Alternating Least Squares (ALS)
• If we fix U, the objective becomes convex and
separable:
• Each sub-problem is a least squares problem, which
can be solved in parallel. So we take alternating
directions to minimize the objective:
• fix U, solve for V;
• fix V, solve for U. 18
minimize
1
2
X
j
0
@
X
i,(i,j)2⌦
(aij uT
i vj)2
1
A
Complexity
• To solve a least squares problem of size n-by-k, we need
O(n k2) time. So the total computation cost is O(nnz k2),
where nnz is the total number of ratings.
• We take the normal equation approach in ALS
• Solving each subproblem requires O(k2) storage. We call
LAPACK’s routine to solve this problem.
19
AT
Ax = AT
b
ALS Implementation in MLlib
How to scale to 100,000,000,000 ratings?
Communication Cost
The most important factor of implementing an
algorithm in parallel is the communication cost.
To make ALS scale to billions of ratings, millions of
users/items, we have to distribute ratings (A), user
factors (U), and item factors (V). How?
• all-to-all
• block-to-block
• …
21
Communication: All-to-All
• users: u1, u2, u3; items: v1, v2, v3, v4
• shuffle size: O(nnz k) (nnz: number of nonzeros, i.e., ratings)
• sending the same factor multiple times
22
Communication: Block-to-Block
• OutBlocks (P1, P2)
• for each item block, which user factors to send
• InBlocks (Q1, Q2)
• for each item, which user factors to use
23
Communication: Block-to-Block
• Shuffle size is significantly reduced.
• We cache two copies of ratings — InBlocks for users and
InBlocks for items.
24
DAG Visualization of an ALS Job
25
ratingBlocks
itemOutBlocks
userInBlocks itemInBlocks
userOutBlocks
itemFactors 0
userFactors 1 itemFactors 1
preparation iterations
Compressed Storage for InBlocks
Array of rating tuples
• huge storage overhead
• high garbage collection (GC) pressure
26
[(v1, u1, a11), (v2, u1, a12), (v1, u2, a21), (v2, u2, a22), (v2, u3, a32)]
Compressed Storage for InBlocks
Three primitive arrays
• low GC pressure
• constructing all sub-problems together
• O(nj k2) storage
27
([v1, v2, v1, v2, v2], [u1, u1, u2, u2, u3], [a11, a12, a21, a22, a32])
Compressed Storage for InBlocks
Primitive arrays with items ordered:
• solving sub-problems in sequence:
• O(k2) storage
• TimSort
28
([v1, v1, v2, v2, v2], [u1, u2, u1, u2, u3], [a11, a21, a12, a22, a32])
Compressed Storage for InBlocks
Compressed items:
• no duplicated items
• map lookup for user factors
29
([v1, v2], [0, 2, 5], [u1, u2, u1, u2, u3], [a11, a21, a12, a22, a32])
Compressed Storage for InBlocks
Store block IDs and local indices instead of user IDs. For example, u3
is the first vector sent from P2.
Encode (block ID, local index) into an integer
• use higher bits for block ID
• use lower bits for local index
• works for ~4 billions of unique users/items

01 | 00 0000 0000 0000
30
([v1, v2], [0, 2, 5], [0|0, 0|1, 0|0, 0|1, 1|0], [a11, a21, a12, a22, a32])
Avoid Garbage Collection
We use specialized code to replace the following:
• initial partitioning of ratings
ratings.map { r =>
((srcPart.getPartition(r.user), dstPart.getPartition(r.item)), r)
}.aggregateByKey(new RatingBlockBuilder)(
seqOp = (b, r) => b.add(r),
combOp = (b0, b1) => b0.merge(b1.build()))
.mapValues(_.build())
• map IDs to local indices
dstIds.toSet.toSeq.sorted.zipWithIndex.toMap
31
Amazon Reviews Dataset
• Amazon Reviews: ~6.6 million users, ~2.2 million items, and ~30 million
ratings
• Tested ALS on stacked copies on a 16-node m3.2xlarge cluster with
rank=10, iter=10:
32
Storage Comparison
33
1.2 1.3/1.4
userInBlock 941MB 277MB
userOutBlock 355MB 65MB
itemInBlock 1380MB 243MB
itemOutBlock 119MB 37MB
Spotify Dataset
• Spotify: 75+ million users and 30+ million songs
• Tested ALS on a subset with ~50 million users, ~5
million songs, and ~50 billion ratings.
• thanks to Chris Johnson and Anders Arpteg
• 32 r3.8xlarge nodes (~$10/hr with spot instances)
• It took 1 hour to finish 10 iterations with rank 10.
• 10 mins to prepare in/out blocks
• 5 mins per iteration
34
ALS Implementation in MLlib
• Save communication by duplicating data
• Efficient storage format
• Watch out for GC
• Native LAPACK calls
35
Future Directions
• Leverage on Project Tungsten to save some
specialized code that avoids GC.
• Solve issues with really popular items.
• Explore other recommendation algorithms, e.g.,
factorization machine.
36
Thank you.
• Spark: http://spark.apache.org
• Databricks: http://databricks.com

More Related Content

PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
PDF
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
PDF
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
PDF
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Databricks
 

What's hot (20)

PDF
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
PDF
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
 
PDF
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Databricks
 
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PPTX
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
PDF
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
PPTX
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
PPTX
Magellan FOSS4G Talk, Boston 2017
Ram Sriharsha
 
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PPTX
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
MLconf
 
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Generalized Linear Models in Spark MLlib and SparkR
Databricks
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
Spark Summit
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Magellan FOSS4G Talk, Boston 2017
Ram Sriharsha
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
MLconf
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Ad

Similar to A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Databricks) (20)

PDF
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Dr. Cornelius Ludmann
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
Recommender Systems at Scale
Eoin Hurrell, PhD
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PDF
Collaborative Filtering with Spark
Chris Johnson
 
PPTX
MLconf NYC Xiangrui Meng
MLconf
 
PPTX
Spark for Recommender Systems
Sorin Peste
 
PDF
Machine learning @ Spotify - Madison Big Data Meetup
Andy Sloane
 
ODP
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
PPTX
Lessons learnt at building recommendation services at industry scale
Domonkos Tikk
 
PDF
Introduction to behavior based recommendation system
Kimikazu Kato
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
Frequently Bought Together Recommendations Based on Embeddings
Databricks
 
PDF
Machine learning at Scale with Apache Spark
Martin Zapletal
 
PDF
Music Recommendations at Scale with Spark
Chris Johnson
 
PDF
Speeding up Distributed Big Data Recommendation in Spark
Hans De Sterck
 
PPTX
Recommendation Systems
Robin Reni
 
PDF
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
PDF
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Dr. Cornelius Ludmann
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Recommender Systems at Scale
Eoin Hurrell, PhD
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Collaborative Filtering with Spark
Chris Johnson
 
MLconf NYC Xiangrui Meng
MLconf
 
Spark for Recommender Systems
Sorin Peste
 
Machine learning @ Spotify - Madison Big Data Meetup
Andy Sloane
 
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Lessons learnt at building recommendation services at industry scale
Domonkos Tikk
 
Introduction to behavior based recommendation system
Kimikazu Kato
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Frequently Bought Together Recommendations Based on Embeddings
Databricks
 
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Music Recommendations at Scale with Spark
Chris Johnson
 
Speeding up Distributed Big Data Recommendation in Spark
Hans De Sterck
 
Recommendation Systems
Robin Reni
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
Nose Dive into Apache Spark ML
Ahmet Bulut
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Databricks)

  • 1. A More Scalable Way of Making Recommendations with MLlib Xiangrui Meng Spark Summit 2015
  • 2. More interested in application than implementation? iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark Joohyun Kim (MyFitnessPal, Under Armour) 3:30 – 4:00 PM Imperial Ballroom (Level 2) 2
  • 3. About Databricks • Founded by Apache spark creators • Largest contributor to Spark project, committed to keeping Spark 100% open source • End-to-end hosted platform https://www.databricks.com/product/databricks 3
  • 4. Spark MLlib Large-scale machine learning on Apache Spark
  • 5. About MLlib • Started in UC Berkeley AMPLab • Shipped with Spark 0.8 • Currently (Spark 1.4) • Contributions from 50+ organizations, 150+ individuals • Good coverage of algorithms 5
  • 6. MLlib’s Mission MLlib’s mission is to make practical machine learning easy and scalable. • Easy to build machine learning applications • Capable of learning from large-scale datasets 6
  • 7. 7 A Brief History of MLlib Optim Algo API App v0.8 GLMs k-means ALS Java/Scalagradient decent
  • 8. 8 A Brief History of MLlib Optim Algo API App v0.9 Python naive Bayes implicit ALS
  • 9. 9 A Brief History of MLlib Optim Algo API App v1.0 decision tree PCA SVD sparse data L-BFGS
  • 10. 10 A Brief History of MLlib v1.1 Optim Algo API App tree reduce torrent broadcast statistics NMF streaming linear regression Word2Vec gap
  • 11. 11 A Brief History of MLlib v1.2 Optim Algo API App pipeline random forest gradient boosted trees streaming k-means
  • 12. 12 A Brief History of MLlib v1.3 Optim Algo API App ALSv2 latent Dirichlet allocation (LDA) multinomial logistic regression Gaussian mixture model (GMM) distributed block matrix FP-growth / isotonic regression power iteration clustering pipeline in python model import/export SparkPackages
  • 13. 13 A Brief History of MLlib Optim Algo API App GLMs with elastic-net online LDA ALS.recommendAll feature transformers estimators Python pipeline API v1.4 OWL-QN
  • 14. Alternating Least Squares (ALS) Collaborative filtering via matrix factorization
  • 16. Low-Rank Assumption • What kind of movies do you like? • sci-fi / crime / action Perception of preferences usually takes place in a low dimensional latent space. So the rating matrix is approximately low-rank. 16 A ⇡ UV T , U 2 Rm⇥k , V 2 Rn⇥k aij ⇡ uT i vj
  • 17. Objective Function • minimize the reconstruction error • only check observed ratings 17 minimize 1 2 kA UV T k2 F minimize 1 2 X (i,j)2⌦ (aij uT i vj)2
  • 18. Alternating Least Squares (ALS) • If we fix U, the objective becomes convex and separable: • Each sub-problem is a least squares problem, which can be solved in parallel. So we take alternating directions to minimize the objective: • fix U, solve for V; • fix V, solve for U. 18 minimize 1 2 X j 0 @ X i,(i,j)2⌦ (aij uT i vj)2 1 A
  • 19. Complexity • To solve a least squares problem of size n-by-k, we need O(n k2) time. So the total computation cost is O(nnz k2), where nnz is the total number of ratings. • We take the normal equation approach in ALS • Solving each subproblem requires O(k2) storage. We call LAPACK’s routine to solve this problem. 19 AT Ax = AT b
  • 20. ALS Implementation in MLlib How to scale to 100,000,000,000 ratings?
  • 21. Communication Cost The most important factor of implementing an algorithm in parallel is the communication cost. To make ALS scale to billions of ratings, millions of users/items, we have to distribute ratings (A), user factors (U), and item factors (V). How? • all-to-all • block-to-block • … 21
  • 22. Communication: All-to-All • users: u1, u2, u3; items: v1, v2, v3, v4 • shuffle size: O(nnz k) (nnz: number of nonzeros, i.e., ratings) • sending the same factor multiple times 22
  • 23. Communication: Block-to-Block • OutBlocks (P1, P2) • for each item block, which user factors to send • InBlocks (Q1, Q2) • for each item, which user factors to use 23
  • 24. Communication: Block-to-Block • Shuffle size is significantly reduced. • We cache two copies of ratings — InBlocks for users and InBlocks for items. 24
  • 25. DAG Visualization of an ALS Job 25 ratingBlocks itemOutBlocks userInBlocks itemInBlocks userOutBlocks itemFactors 0 userFactors 1 itemFactors 1 preparation iterations
  • 26. Compressed Storage for InBlocks Array of rating tuples • huge storage overhead • high garbage collection (GC) pressure 26 [(v1, u1, a11), (v2, u1, a12), (v1, u2, a21), (v2, u2, a22), (v2, u3, a32)]
  • 27. Compressed Storage for InBlocks Three primitive arrays • low GC pressure • constructing all sub-problems together • O(nj k2) storage 27 ([v1, v2, v1, v2, v2], [u1, u1, u2, u2, u3], [a11, a12, a21, a22, a32])
  • 28. Compressed Storage for InBlocks Primitive arrays with items ordered: • solving sub-problems in sequence: • O(k2) storage • TimSort 28 ([v1, v1, v2, v2, v2], [u1, u2, u1, u2, u3], [a11, a21, a12, a22, a32])
  • 29. Compressed Storage for InBlocks Compressed items: • no duplicated items • map lookup for user factors 29 ([v1, v2], [0, 2, 5], [u1, u2, u1, u2, u3], [a11, a21, a12, a22, a32])
  • 30. Compressed Storage for InBlocks Store block IDs and local indices instead of user IDs. For example, u3 is the first vector sent from P2. Encode (block ID, local index) into an integer • use higher bits for block ID • use lower bits for local index • works for ~4 billions of unique users/items
 01 | 00 0000 0000 0000 30 ([v1, v2], [0, 2, 5], [0|0, 0|1, 0|0, 0|1, 1|0], [a11, a21, a12, a22, a32])
  • 31. Avoid Garbage Collection We use specialized code to replace the following: • initial partitioning of ratings ratings.map { r => ((srcPart.getPartition(r.user), dstPart.getPartition(r.item)), r) }.aggregateByKey(new RatingBlockBuilder)( seqOp = (b, r) => b.add(r), combOp = (b0, b1) => b0.merge(b1.build())) .mapValues(_.build()) • map IDs to local indices dstIds.toSet.toSeq.sorted.zipWithIndex.toMap 31
  • 32. Amazon Reviews Dataset • Amazon Reviews: ~6.6 million users, ~2.2 million items, and ~30 million ratings • Tested ALS on stacked copies on a 16-node m3.2xlarge cluster with rank=10, iter=10: 32
  • 33. Storage Comparison 33 1.2 1.3/1.4 userInBlock 941MB 277MB userOutBlock 355MB 65MB itemInBlock 1380MB 243MB itemOutBlock 119MB 37MB
  • 34. Spotify Dataset • Spotify: 75+ million users and 30+ million songs • Tested ALS on a subset with ~50 million users, ~5 million songs, and ~50 billion ratings. • thanks to Chris Johnson and Anders Arpteg • 32 r3.8xlarge nodes (~$10/hr with spot instances) • It took 1 hour to finish 10 iterations with rank 10. • 10 mins to prepare in/out blocks • 5 mins per iteration 34
  • 35. ALS Implementation in MLlib • Save communication by duplicating data • Efficient storage format • Watch out for GC • Native LAPACK calls 35
  • 36. Future Directions • Leverage on Project Tungsten to save some specialized code that avoids GC. • Solve issues with really popular items. • Explore other recommendation algorithms, e.g., factorization machine. 36
  • 37. Thank you. • Spark: http://spark.apache.org • Databricks: http://databricks.com