MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

Distributing Large-scale ML
Algorithms: from GPUs to the
Cloud
MMDS 2014
June, 2014
Xavier Amatriain
Director - Algorithms Engineering @xamat

Outline
■ Introduction
■ Emmy-winning Algorithms
■ Distributing ML Algorithms in Practice
■ An example: ANN over GPUs & AWS Cloud

What we were interested in:
■ High quality recommendations
Proxy question:
■ Accuracy in predicted rating
■ Improve by 10% = $1million!
Data size:
■ 100M ratings (back then “almost massive”)

Netflix Scale
▪ > 44M members
▪ > 40 countries
▪ > 1000 device types
▪ > 5B hours in Q3 2013
▪ Plays: > 50M/day
▪ Searches: > 3M/day
▪ Ratings: > 5M/day
▪ Log 100B events/day
▪ 31.62% of peak US downstream
traffic

Smart Models ■ Regression models (Logistic,
Linear, Elastic nets)
■ GBDT/RF
■ SVD & other MF models
■ Factorization Machines
■ Restricted Boltzmann Machines
■ Markov Chains & other graphical
models
■ Clustering (from k-means to
modern non-parametric models)
■ Deep ANN
■ LDA
■ Association Rules
■ …

Netflix Algorithms
“Emmy Winning”

MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

2007 Progress Prize
▪ Top 2 algorithms
▪ MF/SVD - Prize RMSE: 0.8914
▪ RBM - Prize RMSE: 0.8990
▪ Linear blend Prize RMSE: 0.88
▪ Currently in use as part of Netflix’ rating prediction component
▪ Limitations
▪ Designed for 100M ratings, we have 5B ratings
▪ Not adaptable as users add ratings
▪ Performance issues

Distributing ML algorithms in practice

1. Do I need all that data?
2. At what level should I distribute/parallelize?
3. What latency can I afford?

Really?
Anand Rajaraman: Former Stanford Prof. &
Senior VP at Walmart

Sometimes, it’s not
about more data

[Banko and Brill, 2001]
Norvig: “Google does not
have better Algorithms,
only more Data”
Many features/
low-bias models

At what level should I parallelize?

The three levels of Distribution/Parallelization
1. For each subset of the population (e.g.
region)
2. For each combination of the
hyperparameters
3. For each subset of the training data
Each level has different requirements

Level 1 Distribution
■ We may have subsets of the
population for which we need to
train an independently optimized
model.
■ Training can be fully distributed
requiring no coordination or data
communication

■ For a given subset of the population we
need to find the “optimal” model
■ Train several models with different
hyperparameter values
■ Worst-case: grid search
■ Can do much better than this (E.g. Bayesian
Optimization with Gaussian Process Priors)
■ This process *does* require coordination
■ Need to decide on next “step”
■ Need to gather final optimal result
■ Requires data distribution, not sharing

■ For each combination of
hyperparameters, model training may
still be expensive
■ Process requires coordination and
data sharing/communication
■ Can distribute computation over
machines splitting examples or
parameters (e.g. ADMM)
■ Or parallelize on a single multicore
machine (e.g. Hogwild)
■ Or… use GPUs

ANN Training over GPUS and AWS

ANN Training over GPUS and AWS
■ Level 1 distribution: machines over different AWS
regions
■ Level 2 distribution: machines in AWS and same AWS
region
■ Use coordination tools
■ Spearmint or similar for parameter optimization
■ Condor, StarCluster, Mesos… for distributed cluster coordination
■ Level 3 parallelization: highly optimized parallel CUDA
code on GPUs

3 shades of latency
▪ Blueprint for multiple
algorithm services
▪ Ranking
▪ Row selection
▪ Ratings
▪ Search
▪ …
▪ Multi-layered Machine
Learning

Xavier Amatriain (@xamat)
xavier@netflix.com
Thanks!
(and yes, we are hiring)

MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud (20)

More from Xavier Amatriain (20)

Recently uploaded (20)

MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud