SlideShare a Scribd company logo
Jim Dowling, Logical Clocks AB and KTH
Moritz Meister, Logical Clocks AB
Asynchronous Hyperparameter
Optimization with Apache Spark
#UnifiedDataAnalytics #SparkAISummit
@jim_dowling
@morimeister
The Bitter Lesson (of AI)*
“Methods that scale with computation
are the future of AI”**
2
** https://www.youtube.com/watch?v=EeMCEQa85tw
Rich Sutton
(Father of Reinforcement Learning)
* http://www.incompleteideas.net/IncIdeas/BitterLesson.html
“The two (general purpose) methods that seem to scale
... are search and learning.”*
Spark scales with
available compute
=>
Spark is the answer!
This talk is about why bulk-
synchronous parallel compute
(Spark) does not scale
efficiently for search and how
we made Spark efficient for
directed search (task-based
asynchronous parallel
compute).
3
Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdz
Inner and Outer Loop of Deep Learning
Inner Loop
Outer Loop
Training Data
worker1 worker2 workerN
…
∆
1
∆
2
∆
N
Synchronization
Metric
Search
Method
HParams
http://tiny.cc/51yjdz
LEARNINGSEARCH
6
Hopsworks –
a platform for Data-Intensive AI
Hopsworks Technical Milestones
7
World’s first
Hadoop platform to
support
GPUs-as-a-Resource
World’s fastest
HDFS Published at
USENIX FAST with
Oracle and Spotify
World’s First
Open Source
Feature Store for
Machine Learning
World’s First
Distributed Filesystem to
store small files in metadata
on NVMe disks
Winner of IEEE
Scale Challenge
2017
with HopsFS - 1.2m
ops/sec
2017
World’s most scalable
POSIX-like Hierarchical
Filesystem with
Multi Data Center Availability
with 1.6m ops/sec on GCP
2018 2019
First non-Google ML
Platform with
TensorFlow Extended
(TFX) support through
Beam/Flink
World’s first
Unified
Hyperparam and
Ablation Study
Framework
The Complexity of Deep
Learning
8
Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
[Adapted from Schulley et al “Technical Debt of ML” ]
The Complexity of Deep
Learning
9
Data validation
Distributed
Training
Model
Serving
A/B
Testing
Monitoring
Pipeline
Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
Hopsworks
Feature Store
Hopsworks
REST API
[Adapted from Schulley et al “Technical Debt of ML” ]
10
11
12
Datasources
Applications
API
Dashboards
Hopsworks
Apache Beam
Apache Spark Pip
Conda
Tensorflow
scikit-learn
Keras
J upyter
Notebooks
Tensorboard
Apache Beam
Apache Spark
Apache Flink
Kubernetes
Batch Distributed
ML &DL
Model
Serving
Hopsworks
Feature Store
Kafka +
Spark
Streaming
Model
Monitoring
Orchestration in Airflow
Data Preparation
&Ingestion
Experimentation
&Model Training
Deploy
&Productionalize
Streaming
Filesystem and Metadata storage
HopsFS
“AI is the new Electricity”
– Andrew Ng
14
What engine should we use?
Engines Matter!
15
[Image from https://twitter.com/_youhadonejob1/status/1143968337359187968?s=20]
Engines Really Matter!
16
Photo by Zbynek Burival on Unsplash
Hopsworks Engine: ML Pipelines
17
Data
Pipelines
Ingest & Prep
Feature
Store
Machine Learning Experiments
Data Parallel
Training
Model
Serving
Ablation
Studies
Hyperparameter
Optimization
Bottleneck, due to
• iterative nature
• human interaction
Horizontal Scalability at all Stages
Iterative Model Development
• Trial and Error is slow
• Iterative approach is greedy
• Search spaces are usually large
• Sensitivity and interaction of
hyperparameters
18
Set Hyper-
parameters
Train Model
Evaluate
Performance
Black Box Optimization
19
Learning
Black Box
Metric
Meta-level
learning &
optimization
Search space
Parallel Black Box Optimization
20
Which algorithm to use for search? How to monitor progress?
Fault Tolerance?How to aggregate results?
Learning
Black Box
Metric
Meta-level
learning &
optimization Parallel
WorkersQueue
Trial
Trial
Search space
Parallel Black Box Optimization
21
Which algorithm to use for search? How to monitor progress?
Fault Tolerance?How to aggregate results?
Learning
Black Box
Metric
Meta-level
learning &
optimization Parallel
WorkersQueue
Trial
Trial
Search space
This should be managed with platform support!
Maggy
A flexible framework for
running different black-box
optimization algorithms
on Hopsworks:
ASHA, Bayesian
Optimization, Random
Search, Grid Search and
more to come…
22
Synchronous Search
23
Task11
Driver
Task12
Task13
Task1N
…
HDFS
Task21
Task22
Task23
Task2N
…
Barrier
Barrier
Task31
Task32
Task33
Task3N
…
Barrier
Metrics1 Metrics2 Metrics3
Add Early Stopping
24
Task11
Driver
Task12
Task13
Task1N
…
HDFS
Task21
Task22
Task23
Task2N
…
Barrier
Barrier
Task31
Task32
Task33
Task3N
…
Barrier
Metrics1 Metrics2 Metrics3
Wasted Compute Wasted ComputeWasted Compute
Performance Enhancement
25
Early Stopping:
● Median Stopping Rule
● Performance curve prediction
Multi-fidelity Methods:
● Successive Halving Algorithm
● Hyperband Figure: Successive Halving Algorithm
26
Synchronous Successive Halving
Kevin G. Jamieson et al. “Non-stochastic Best Arm Identification and Hyperparameter Optimization” (2015).
Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/
27
Asynchronous Successive Halving
Liam Li et al. “Massively Parallel Hyperparameter Tuning” (2018).
Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/
Challenge
How can we fit this into the bulk synchronous execution model of Spark?
Mismatch: Spark Tasks and Stages vs. Trials
28
Databricks’ approach: Project Hydrogen (barrier execution mode) & SparkTrials in Hyperopt
29
Task11
Driver
Task12
Task13
Task1N
…
Barrier
Metrics
New Trial
Early Stop
The Solution
Long running tasks:
Enter Maggy
31
User API
32
Developer API
33
Results
34
Hyperparameter Optimization Task ASHA Validation Task
ASHA
RS-ES
RS-NS
ASHA
RS-ES
RS-NS
Ablation
35
PClassname survivesex sexname survive
Replacing the Maggy Optimizer with
an Ablator:
• Feature Ablation using
the Feature Store
• Leave-One-Layer-Out Ablation
• Leave-One-Component-Out
(LOCO)
Ablation API
36
Ablation API
37
Conclusion
● Avoid iterative Hyperparameter Optimization
● Black box optimization is hard
● State-of-the-art algorithms can be deployed asynchronously
● Maggy: platform support for automated hyperparameter
optimization and ablation studies
● Save resources with asynchronism
● Early stopping for sensible models
38
What next?
39
• More algorithms
• Comparability of experiments
• Implicit Provenance
• Support for PyTorch
Thank you!
Box 1263, Isafjordsgatan 22
Kista, Stockholm
https://www.logicalclocks.com
Register for a free account at
www.hops.site
Twitter
@logicalclocks
@hopsworks
GitHub
https://github.com/hopshadoop/maggy
https://maggy.readthedocs.io/en/latest/
https://github.com/logicalclocks/hopsworks
Acknowledgements and References
Thanks to the entire Logical Clocks Team J
Contributions from colleagues:
Robin Andersson @robzor92
Sina Sheikholeslami @cutlash
Kim Hammar @KimHammar1
Alex Ormenisan @alex_ormenisan
• Maggy
https://github.com/logicalclocks/maggy or https://maggy.readthedocs.io/en/latest/
• Feature Store: the missing data layer in ML pipelines?
https://www.logicalclocks.com/feature-store/
• Hopsworks white paper.
https://www.logicalclocks.com/whitepapers/hopsworks
• ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PDF
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
PDF
Vectorized R Execution in Apache Spark
Databricks
 
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
PDF
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
PDF
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
PDF
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
PDF
Running Apache Spark Jobs Using Kubernetes
Databricks
 
PDF
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
PDF
Spark Uber Development Kit
Jen Aman
 
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PPTX
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
PDF
Spark at Airbnb
Hao Wang
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
PDF
Data Warehousing with Spark Streaming at Zalando
Databricks
 
PDF
Productionizing Machine Learning with a Microservices Architecture
Databricks
 
PDF
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Databricks
 
Vectorized R Execution in Apache Spark
Databricks
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
Powering Custom Apps at Facebook using Spark Script Transformation
Databricks
 
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
Running Apache Spark Jobs Using Kubernetes
Databricks
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
Spark Uber Development Kit
Jen Aman
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
Spark at Airbnb
Hao Wang
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Data Warehousing with Spark Streaming at Zalando
Databricks
 
Productionizing Machine Learning with a Microservices Architecture
Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 

Similar to Asynchronous Hyperparameter Optimization with Apache Spark (20)

PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
PDF
The Bitter Lesson of ML Pipelines
Jim Dowling
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PDF
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
PPTX
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
PPTX
Spark ML Pipeline serving
Stepan Pushkarev
 
PDF
Machine learning model to production
Georg Heiler
 
PDF
Splice Machine's use of Apache Spark and MLflow
Databricks
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PDF
Scaling Analytics with Apache Spark
QuantUniversity
 
PDF
Proposal for google summe of code 2016
Mahesh Dananjaya
 
PDF
Monitoring AI with AI
Stepan Pushkarev
 
PDF
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Provectus
 
PDF
Big Data Meets Learning Science: Keynote by Al Essa
Spark Summit
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
The Bitter Lesson of ML Pipelines
Jim Dowling
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
Spark ML Pipeline serving
Stepan Pushkarev
 
Machine learning model to production
Georg Heiler
 
Splice Machine's use of Apache Spark and MLflow
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Scaling Analytics with Apache Spark
QuantUniversity
 
Proposal for google summe of code 2016
Mahesh Dananjaya
 
Monitoring AI with AI
Stepan Pushkarev
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Provectus
 
Big Data Meets Learning Science: Keynote by Al Essa
Spark Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 

Asynchronous Hyperparameter Optimization with Apache Spark

  • 1. Jim Dowling, Logical Clocks AB and KTH Moritz Meister, Logical Clocks AB Asynchronous Hyperparameter Optimization with Apache Spark #UnifiedDataAnalytics #SparkAISummit @jim_dowling @morimeister
  • 2. The Bitter Lesson (of AI)* “Methods that scale with computation are the future of AI”** 2 ** https://www.youtube.com/watch?v=EeMCEQa85tw Rich Sutton (Father of Reinforcement Learning) * http://www.incompleteideas.net/IncIdeas/BitterLesson.html “The two (general purpose) methods that seem to scale ... are search and learning.”*
  • 3. Spark scales with available compute => Spark is the answer! This talk is about why bulk- synchronous parallel compute (Spark) does not scale efficiently for search and how we made Spark efficient for directed search (task-based asynchronous parallel compute). 3
  • 4. Inner and Outer Loop of Deep Learning Inner Loop Outer Loop Training Data worker1 worker2 workerN … ∆ 1 ∆ 2 ∆ N Synchronization Metric Search Method HParams http://tiny.cc/51yjdz
  • 5. Inner and Outer Loop of Deep Learning Inner Loop Outer Loop Training Data worker1 worker2 workerN … ∆ 1 ∆ 2 ∆ N Synchronization Metric Search Method HParams http://tiny.cc/51yjdz LEARNINGSEARCH
  • 6. 6 Hopsworks – a platform for Data-Intensive AI
  • 7. Hopsworks Technical Milestones 7 World’s first Hadoop platform to support GPUs-as-a-Resource World’s fastest HDFS Published at USENIX FAST with Oracle and Spotify World’s First Open Source Feature Store for Machine Learning World’s First Distributed Filesystem to store small files in metadata on NVMe disks Winner of IEEE Scale Challenge 2017 with HopsFS - 1.2m ops/sec 2017 World’s most scalable POSIX-like Hierarchical Filesystem with Multi Data Center Availability with 1.6m ops/sec on GCP 2018 2019 First non-Google ML Platform with TensorFlow Extended (TFX) support through Beam/Flink World’s first Unified Hyperparam and Ablation Study Framework
  • 8. The Complexity of Deep Learning 8 Data validation Distributed Training Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) [Adapted from Schulley et al “Technical Debt of ML” ]
  • 9. The Complexity of Deep Learning 9 Data validation Distributed Training Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) Hopsworks Feature Store Hopsworks REST API [Adapted from Schulley et al “Technical Debt of ML” ]
  • 10. 10
  • 11. 11
  • 12. 12 Datasources Applications API Dashboards Hopsworks Apache Beam Apache Spark Pip Conda Tensorflow scikit-learn Keras J upyter Notebooks Tensorboard Apache Beam Apache Spark Apache Flink Kubernetes Batch Distributed ML &DL Model Serving Hopsworks Feature Store Kafka + Spark Streaming Model Monitoring Orchestration in Airflow Data Preparation &Ingestion Experimentation &Model Training Deploy &Productionalize Streaming Filesystem and Metadata storage HopsFS
  • 13. “AI is the new Electricity” – Andrew Ng 14 What engine should we use?
  • 14. Engines Matter! 15 [Image from https://twitter.com/_youhadonejob1/status/1143968337359187968?s=20]
  • 15. Engines Really Matter! 16 Photo by Zbynek Burival on Unsplash
  • 16. Hopsworks Engine: ML Pipelines 17 Data Pipelines Ingest & Prep Feature Store Machine Learning Experiments Data Parallel Training Model Serving Ablation Studies Hyperparameter Optimization Bottleneck, due to • iterative nature • human interaction Horizontal Scalability at all Stages
  • 17. Iterative Model Development • Trial and Error is slow • Iterative approach is greedy • Search spaces are usually large • Sensitivity and interaction of hyperparameters 18 Set Hyper- parameters Train Model Evaluate Performance
  • 18. Black Box Optimization 19 Learning Black Box Metric Meta-level learning & optimization Search space
  • 19. Parallel Black Box Optimization 20 Which algorithm to use for search? How to monitor progress? Fault Tolerance?How to aggregate results? Learning Black Box Metric Meta-level learning & optimization Parallel WorkersQueue Trial Trial Search space
  • 20. Parallel Black Box Optimization 21 Which algorithm to use for search? How to monitor progress? Fault Tolerance?How to aggregate results? Learning Black Box Metric Meta-level learning & optimization Parallel WorkersQueue Trial Trial Search space This should be managed with platform support!
  • 21. Maggy A flexible framework for running different black-box optimization algorithms on Hopsworks: ASHA, Bayesian Optimization, Random Search, Grid Search and more to come… 22
  • 24. Performance Enhancement 25 Early Stopping: ● Median Stopping Rule ● Performance curve prediction Multi-fidelity Methods: ● Successive Halving Algorithm ● Hyperband Figure: Successive Halving Algorithm
  • 25. 26 Synchronous Successive Halving Kevin G. Jamieson et al. “Non-stochastic Best Arm Identification and Hyperparameter Optimization” (2015). Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/
  • 26. 27 Asynchronous Successive Halving Liam Li et al. “Massively Parallel Hyperparameter Tuning” (2018). Animation: https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/
  • 27. Challenge How can we fit this into the bulk synchronous execution model of Spark? Mismatch: Spark Tasks and Stages vs. Trials 28 Databricks’ approach: Project Hydrogen (barrier execution mode) & SparkTrials in Hyperopt
  • 32. Results 34 Hyperparameter Optimization Task ASHA Validation Task ASHA RS-ES RS-NS ASHA RS-ES RS-NS
  • 33. Ablation 35 PClassname survivesex sexname survive Replacing the Maggy Optimizer with an Ablator: • Feature Ablation using the Feature Store • Leave-One-Layer-Out Ablation • Leave-One-Component-Out (LOCO)
  • 36. Conclusion ● Avoid iterative Hyperparameter Optimization ● Black box optimization is hard ● State-of-the-art algorithms can be deployed asynchronously ● Maggy: platform support for automated hyperparameter optimization and ablation studies ● Save resources with asynchronism ● Early stopping for sensible models 38
  • 37. What next? 39 • More algorithms • Comparability of experiments • Implicit Provenance • Support for PyTorch
  • 38. Thank you! Box 1263, Isafjordsgatan 22 Kista, Stockholm https://www.logicalclocks.com Register for a free account at www.hops.site Twitter @logicalclocks @hopsworks GitHub https://github.com/hopshadoop/maggy https://maggy.readthedocs.io/en/latest/ https://github.com/logicalclocks/hopsworks
  • 39. Acknowledgements and References Thanks to the entire Logical Clocks Team J Contributions from colleagues: Robin Andersson @robzor92 Sina Sheikholeslami @cutlash Kim Hammar @KimHammar1 Alex Ormenisan @alex_ormenisan • Maggy https://github.com/logicalclocks/maggy or https://maggy.readthedocs.io/en/latest/ • Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/ • Hopsworks white paper. https://www.logicalclocks.com/whitepapers/hopsworks • ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata, CCGrid, 2019
  • 40. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT