SlideShare a Scribd company logo
Foundations for Scaling ML
in Apache Spark
Joseph K. Bradley
August 14, 2016
® ™
Who am I?
Apache Spark committer & PMC member
Software Engineer @ Databricks
Machine Learning Department @ Carnegie Mellon
2
•  General engine for big data computing
•  Fast
•  Easy to use
•  APIs in Python, Scala, Java & R
3
Apache Spark
Spark	SQL	 Streaming	 MLlib	 GraphX	
Largest cluster:
8000 Nodes (Tencent)
Open source
•  Apache Software Foundation
•  1000+ contributors
•  200+ companies & universities
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update
MLlib: Spark’s ML library
5
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent itemsets
Data utilities
Featurization
Statistics
Linear algebra
Workflow utilities
Model import/export
Pipelines
DataFrames
Cross validation
Goals
Scale-out ML
Standard library
Extensible API
MLlib: original design
RDDs
Challenges for scalability
6
Resilient Distributed Datasets (RDDs)
7
Map Reduce
master
Resilient Distributed Datasets (RDDs)
8
Resilient Distributed Datasets (RDDs)
9
Resiliency
•  Lineage
•  Caching &
checkpointing
ML on RDDs: the good
Flexible: GLMs, trees, matrix factorization, etc.
Scalable: E.g., Alternating Least Squares on Spotify data
•  50+ million users x 30+ million songs
•  50 billion ratings
Cost ~ $10
•  32 r3.8xlarge nodes (spot instances)
•  For rank 10 with 10 iterations, ~1 hour running time.
10
ML on RDDs: the challenges
Partitioning
•  Data partitioning impacts performance.
•  E.g., for Alternating Least Squares
Lineage
•  Iterative algorithms à long RDD lineage
•  Solvable via careful caching and checkpointing
JVM
•  Garbage collection (GC)
•  Boxed types
11
MLlib: current status
DataFrame & Dataset integration
Pipelines API
12
Spark DataFrames & Datasets
13
dept	 age	 name	
Bio	 48	 H	Smith	
CS	 34	 A	Turing	
Bio	 43	 B	Jones	
Chem	 61	 M	Kennedy	
Data grouped into
named columns
DSL for common tasks
•  Project, filter, aggregate, join, …
•  100+ functions available
•  User-Defined Functions (UDFs)
data.groupBy(“dept”).avg(“age”)
Datasets: Strongly typed DataFrames
DataFrame optimizations
Catalyst query optimizer
Project Tungsten
• Memory management
• Code generation
14
Predicate pushdown
Join selection
…
Off-heap
Avoid JVM GC
Compressed format
Combine operations into single,
efficient code blocks
ML Pipelines
•  DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
15
Load data
Feature	
extracIon	
Original	
dataset	
16
PredicIve	
model	
EvaluaIon	
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
Extract features
Feature	
extracIon	
Original	
dataset	
17
PredicIve	
model	
EvaluaIon	
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
Fit a model
Feature	
extracIon	
Original	
dataset	
18
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Evaluate
Feature	
extracIon	
Original	
dataset	
19
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
ML Pipelines
DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
•  Materialize columns lazily
•  Inspect intermediate results
20
Under the hood: optimizations
Current use of DataFrames
•  API
•  Transformations & predictions
21
Feature transformation & model
prediction are phrased as User-
Defined Functions (UDFs)
à Catalyst query optimizer
à Tungsten memory management
+ code generation
MLlib: future scaling
DataFrames for training
Potential benefits
•  Spilling to disk
•  Catalyst
•  Tungsten
Challenges remaining
22
Implementing ML on DataFrames
23
Map Reduce
master
Scalability
DataFrames automatically spill to disk
à Classic pain point of RDDs
24
java.lang.OutOfMemoryError
Goal: Smoothly scale, without custom per-algorithm optimizations
Catalyst in ML
Key idea: automatic query (ML algorithm) optimization
•  DataFrame operations are lazy.
•  Express entire algorithm as DataFrame operations.
•  Let Catalyst reorganize the algorithm, data, etc.
à Fewer manual optimizations
25
Tungsten in ML
Tungsten: off-heap memory management
•  Avoids JVM GC
•  Uses efficient storage formats
•  Code generation
26
Issue in ML:
object creation during each iteration
Issue in ML:
Array[(Int,Double,Double)]
Issue in ML:
Volcano iterator model in MR/RDDs
Prototyping ML on DataFrames
Currently:
•  Belief propagation
•  Connected components
Current challenges:
•  DataFrame query plans do not have iteration as a top-level concept
•  ML/Graph-specific optimizations for Catalyst query planner
Eventual goal: Port all ML algorithms to run on top of DataFrames
à speed & scalability
27
To summarize...
MLlib on RDDs
•  Required custom optimizations
MLlib with a DataFrame-based API
•  Friendly API
•  Improvements for prediction
MLlib on DataFrames
•  Potential for even greater scaling for training
•  Simpler for non-experts to write new algorithms
28
Get started
Get involved
•  JIRA http://issues.apache.org
•  mailing lists http://spark.apache.org
•  Github http://github.com/apache/spark
•  Spark Packages http://spark-packages.org
Learn more
•  New in Apache Spark 2.0
http://databricks.com/blog/2016/06/01
•  MOOCs on EdX http://databricks.com/spark/training
29
Try out Apache Spark 2.0 in
Databricks Community Edition
http://databricks.com/ce
Many thanks to the community
for contributions & support!
Databricks
Founded by the creators of Apache Spark
Offers hosted service
•  Spark on EC2
•  Notebooks
•  Visualizations
•  Cluster management
•  Scheduled jobs
30
We’re hiring!
Thank you!
Twitter: @jkbatcmu

More Related Content

Similar to Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16 (20)

PDF
Distributed ML in Apache Spark
Databricks
 
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
PDF
Practical Machine Learning Pipelines with MLlib
Databricks
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
PPTX
MLlib and Machine Learning on Spark
Petr Zapletal
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
Productionalizing Spark ML
datamantra
 
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Distributed ML in Apache Spark
Databricks
 
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Practical Machine Learning Pipelines with MLlib
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Apache Spark MLlib
Zahra Eskandari
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Machine Learning Pipelines - Joseph Bradley - Databricks
Spark Summit
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
MLconf
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
MLlib and Machine Learning on Spark
Petr Zapletal
 
Spark DataFrames and ML Pipelines
Databricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Databricks
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Productionalizing Spark ML
datamantra
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 

More from BigMine (10)

PDF
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
BigMine
 
PDF
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
BigMine
 
PDF
Big Data and Small Devices by Katharina Morik
BigMine
 
PDF
Exact Data Reduction for Big Data by Jieping Ye
BigMine
 
PPT
Processing Reachability Queries with Realistic Constraints on Massive Network...
BigMine
 
PPT
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
PDF
Big & Personal: the data and the models behind Netflix recommendations by Xa...
BigMine
 
PPSX
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
BigMine
 
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
PDF
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
BigMine
 
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
BigMine
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
BigMine
 
Big Data and Small Devices by Katharina Morik
BigMine
 
Exact Data Reduction for Big Data by Jieping Ye
BigMine
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
BigMine
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
BigMine
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
BigMine
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
BigMine
 
Ad

Recently uploaded (20)

PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
BinarySearchTree in datastructures in detail
kichokuttu
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Ad

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

  • 1. Foundations for Scaling ML in Apache Spark Joseph K. Bradley August 14, 2016 ® ™
  • 2. Who am I? Apache Spark committer & PMC member Software Engineer @ Databricks Machine Learning Department @ Carnegie Mellon 2
  • 3. •  General engine for big data computing •  Fast •  Easy to use •  APIs in Python, Scala, Java & R 3 Apache Spark Spark SQL Streaming MLlib GraphX Largest cluster: 8000 Nodes (Tencent) Open source •  Apache Software Foundation •  1000+ contributors •  200+ companies & universities
  • 4. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 5. MLlib: Spark’s ML library 5 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Learning tasks Classification Regression Recommendation Clustering Frequent itemsets Data utilities Featurization Statistics Linear algebra Workflow utilities Model import/export Pipelines DataFrames Cross validation Goals Scale-out ML Standard library Extensible API
  • 7. Resilient Distributed Datasets (RDDs) 7 Map Reduce master
  • 9. Resilient Distributed Datasets (RDDs) 9 Resiliency •  Lineage •  Caching & checkpointing
  • 10. ML on RDDs: the good Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time. 10
  • 11. ML on RDDs: the challenges Partitioning •  Data partitioning impacts performance. •  E.g., for Alternating Least Squares Lineage •  Iterative algorithms à long RDD lineage •  Solvable via careful caching and checkpointing JVM •  Garbage collection (GC) •  Boxed types 11
  • 12. MLlib: current status DataFrame & Dataset integration Pipelines API 12
  • 13. Spark DataFrames & Datasets 13 dept age name Bio 48 H Smith CS 34 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks •  Project, filter, aggregate, join, … •  100+ functions available •  User-Defined Functions (UDFs) data.groupBy(“dept”).avg(“age”) Datasets: Strongly typed DataFrames
  • 14. DataFrame optimizations Catalyst query optimizer Project Tungsten • Memory management • Code generation 14 Predicate pushdown Join selection … Off-heap Avoid JVM GC Compressed format Combine operations into single, efficient code blocks
  • 15. ML Pipelines •  DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution 15
  • 16. Load data Feature extracIon Original dataset 16 PredicIve model EvaluaIon Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  • 17. Extract features Feature extracIon Original dataset 17 PredicIve model EvaluaIon Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  • 18. Fit a model Feature extracIon Original dataset 18 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 19. Evaluate Feature extracIon Original dataset 19 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 20. ML Pipelines DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution •  Materialize columns lazily •  Inspect intermediate results 20
  • 21. Under the hood: optimizations Current use of DataFrames •  API •  Transformations & predictions 21 Feature transformation & model prediction are phrased as User- Defined Functions (UDFs) à Catalyst query optimizer à Tungsten memory management + code generation
  • 22. MLlib: future scaling DataFrames for training Potential benefits •  Spilling to disk •  Catalyst •  Tungsten Challenges remaining 22
  • 23. Implementing ML on DataFrames 23 Map Reduce master
  • 24. Scalability DataFrames automatically spill to disk à Classic pain point of RDDs 24 java.lang.OutOfMemoryError Goal: Smoothly scale, without custom per-algorithm optimizations
  • 25. Catalyst in ML Key idea: automatic query (ML algorithm) optimization •  DataFrame operations are lazy. •  Express entire algorithm as DataFrame operations. •  Let Catalyst reorganize the algorithm, data, etc. à Fewer manual optimizations 25
  • 26. Tungsten in ML Tungsten: off-heap memory management •  Avoids JVM GC •  Uses efficient storage formats •  Code generation 26 Issue in ML: object creation during each iteration Issue in ML: Array[(Int,Double,Double)] Issue in ML: Volcano iterator model in MR/RDDs
  • 27. Prototyping ML on DataFrames Currently: •  Belief propagation •  Connected components Current challenges: •  DataFrame query plans do not have iteration as a top-level concept •  ML/Graph-specific optimizations for Catalyst query planner Eventual goal: Port all ML algorithms to run on top of DataFrames à speed & scalability 27
  • 28. To summarize... MLlib on RDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction MLlib on DataFrames •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms 28
  • 29. Get started Get involved •  JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0 http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training 29 Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce Many thanks to the community for contributions & support!
  • 30. Databricks Founded by the creators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs 30 We’re hiring!