Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Foundations for Scaling ML
in Apache Spark
Joseph K. Bradley
August 14, 2016
® ™

Who am I?
Apache Spark committer & PMC member
Software Engineer @ Databricks
Machine Learning Department @ Carnegie Mellon
2

•  General engine for big data computing
•  Fast
•  Easy to use
•  APIs in Python, Scala, Java & R
3
Apache Spark
Spark SQL Streaming MLlib GraphX
Largest cluster:
8000 Nodes (Tencent)
Open source
•  Apache Software Foundation
•  1000+ contributors
•  200+ companies & universities

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update

MLlib: Spark’s ML library
5
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent itemsets
Data utilities
Featurization
Statistics
Linear algebra
Workflow utilities
Model import/export
Pipelines
DataFrames
Cross validation
Goals
Scale-out ML
Standard library
Extensible API

MLlib: original design
RDDs
Challenges for scalability
6

Resilient Distributed Datasets (RDDs)
7
Map Reduce
master

8

9
Resiliency
•  Lineage
•  Caching &
checkpointing

ML on RDDs: the good
Flexible: GLMs, trees, matrix factorization, etc.
Scalable: E.g., Alternating Least Squares on Spotify data
•  50+ million users x 30+ million songs
•  50 billion ratings
Cost ~ $10
•  32 r3.8xlarge nodes (spot instances)
•  For rank 10 with 10 iterations, ~1 hour running time.
10

ML on RDDs: the challenges
Partitioning
•  Data partitioning impacts performance.
•  E.g., for Alternating Least Squares
Lineage
•  Iterative algorithms à long RDD lineage
•  Solvable via careful caching and checkpointing
JVM
•  Garbage collection (GC)
•  Boxed types
11

MLlib: current status
DataFrame & Dataset integration
Pipelines API
12

Spark DataFrames & Datasets
13
dept age name
Bio 48 H Smith
CS 34 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data grouped into
named columns
DSL for common tasks
•  Project, filter, aggregate, join, …
•  100+ functions available
•  User-Defined Functions (UDFs)
data.groupBy(“dept”).avg(“age”)
Datasets: Strongly typed DataFrames

DataFrame optimizations
Catalyst query optimizer
Project Tungsten
• Memory management
• Code generation
14
Predicate pushdown
Join selection
…
Oﬀ-heap
Avoid JVM GC
Compressed format
Combine operations into single,
eﬀicient code blocks

ML Pipelines
•  DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
15

Load data
Feature
extracIon
Original
dataset
16
PredicIve
model
EvaluaIon
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3

Extract features
Feature
extracIon
Original
dataset
17
PredicIve
model
EvaluaIon
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]

Fit a model
Feature
extracIon
Original
dataset
18
PredicIve
model
EvaluaIon
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7

Evaluate
Feature
extracIon
Original
dataset
19
PredicIve
model
EvaluaIon
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7

ML Pipelines
DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
•  Materialize columns lazily
•  Inspect intermediate results
20

Under the hood: optimizations
Current use of DataFrames
•  API
•  Transformations & predictions
21
Feature transformation & model
prediction are phrased as User-
Defined Functions (UDFs)
à Catalyst query optimizer
à Tungsten memory management
+ code generation

MLlib: future scaling
DataFrames for training
Potential benefits
•  Spilling to disk
•  Catalyst
•  Tungsten
Challenges remaining
22

Implementing ML on DataFrames
23
Map Reduce
master

Scalability
DataFrames automatically spill to disk
à Classic pain point of RDDs
24
java.lang.OutOfMemoryError
Goal: Smoothly scale, without custom per-algorithm optimizations

Catalyst in ML
Key idea: automatic query (ML algorithm) optimization
•  DataFrame operations are lazy.
•  Express entire algorithm as DataFrame operations.
•  Let Catalyst reorganize the algorithm, data, etc.
à Fewer manual optimizations
25

Tungsten in ML
Tungsten: oﬀ-heap memory management
•  Avoids JVM GC
•  Uses eﬀicient storage formats
•  Code generation
26
Issue in ML:
object creation during each iteration
Issue in ML:
Array[(Int,Double,Double)]
Issue in ML:
Volcano iterator model in MR/RDDs

Prototyping ML on DataFrames
Currently:
•  Belief propagation
•  Connected components
Current challenges:
•  DataFrame query plans do not have iteration as a top-level concept
•  ML/Graph-specific optimizations for Catalyst query planner
Eventual goal: Port all ML algorithms to run on top of DataFrames
à speed & scalability
27

To summarize...
MLlib on RDDs
•  Required custom optimizations
MLlib with a DataFrame-based API
•  Friendly API
•  Improvements for prediction
MLlib on DataFrames
•  Potential for even greater scaling for training
•  Simpler for non-experts to write new algorithms
28

Get started
Get involved
•  JIRA http://issues.apache.org
•  mailing lists http://spark.apache.org
•  Github http://github.com/apache/spark
•  Spark Packages http://spark-packages.org
Learn more
•  New in Apache Spark 2.0
http://databricks.com/blog/2016/06/01
•  MOOCs on EdX http://databricks.com/spark/training
29
Try out Apache Spark 2.0 in
Databricks Community Edition
http://databricks.com/ce
Many thanks to the community
for contributions & support!

Databricks
Founded by the creators of Apache Spark
Oﬀers hosted service
•  Spark on EC2
•  Notebooks
•  Visualizations
•  Cluster management
•  Scheduled jobs
30
We’re hiring!

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

More Related Content

Similar to Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16 (20)

More from BigMine (10)

Recently uploaded (20)

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16