SlideShare a Scribd company logo
Distributed ML with
Dask & Kubernetes
Ray Hilton, Eliiza
@rayh @EliizaAI
Distributed ML with Dask and Kubernetes
Distributed ML with Dask and Kubernetes
What is Machine Learning?
COMPUTE
LOGIC
DATA
OUTPUT
COMPUTE
(LOTS AND LOTS OF)
OUTPUT
DATA
LOGIC
Traditional Software
Machine Learning
COMPUTE
(LOTS AND
LOTS OF)
LABELS/OUTPUT
TRAINING DATA
LOGIC
Learning
COMPUTE
(NOT MUCH
OF)
OUTPUT
RUNTIME DATA
Inference
DATA
SCIENTIST’S
BRAIN
ENGINEER’S
BRAIN
REQUIREMENTS LOGIC
Engineering
COMPUTE OUTPUT
RUNTIME DATA
RuntimeBUSINESS
REQUIREMENTS
Make predictions based on
previous experience
Distributed ML with Dask and Kubernetes
What is Dask?
It’s like Spark,
but idiomatically Python
“Dask uses existing Python
APIs and data structures to
make it easy to switch
between Numpy, Pandas,
Scikit-learn to their Dask-
powered equivalents.”
Distributed ML with Dask and Kubernetes
f(df)
f(df1)
f(df2)
f(df3)
f(df4)
f(df5)
MAP
REDUCE
result
What functions do we apply
where?
Distributed ML with Dask and Kubernetes
Directed
Acyclic
Graph
Basic DAG
@delayed
def add(x, y):
return x + y
four = add(
add(1, 1),
add(1, 1)
)
four.compute()
Complex DAG
Dask makes scaling data
operations easy*
*YMMV
Distributed ML with Dask and Kubernetes
Why?
● Open Source
● Defacto Standard
● Proven at Scale
● Infrastructure as Configuration
● Modular & Extensible
● Efficiency
Example Architecture
Node1 Node2 Node3 Node4 Node5
Kubernetes
Jupyter
Airflow
Dask
Grafana
Airflow
Dask
Grafana
Spark
Dask
Grafana
Spark
Dask
Jupyter
Spark
Dask
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
Kubernetes makes deployment
and orchestration easy and
efficient
Distributed ML with Dask and Kubernetes
Dask Cluster
worker:
image:
repository: eliiza/dsp-dask
tag: latest
pullPolicy: Always
replicas: 10
resources:
limits:
cpu: 2
memory: 6G
requests:
cpu: 2
memory: 6G
scheduler:
image:
repository: eliiza/dsp-dask
tag: latest
pullPolicy: Always
jupyter:
enabled: false
$ helm upgrade --install dask-cluster stable/dask -f config.yml
Demo Cluster
Demo: Dataframes
Distributed ML with Dask and Kubernetes
2GB
Local Cluster Speed Up
Counts 56.23 10.46 5.38
Market Share 50.60 9.46 5.35
10GB
Local Cluster Speed Up
Counts 429.69 73.74 5.83
Market Share 382.01 64.60 5.91
Demo: Monte Carlo Simulation
Distributed ML with Dask and Kubernetes
Distributed ML with Dask and Kubernetes
Distributed ML with Dask and Kubernetes
Demo: Random Forest
Distributed ML with Dask and Kubernetes
Question:
How do you know which model
architecture to use?
Distributed ML with Dask and Kubernetes
Answer:
Try random shit until shit looks
right
Answer:
Hyperparameter Search
Demo: RandomSearch & Dask
Distributed ML with Dask and Kubernetes
Learnings
Why no TensorFlow Love?
Xgboost
Fewer large nodes > many
small nodes
Diagnosing Graphs
What’s Next?
RAPIDS
Distributed ML with Dask and Kubernetes
Distributed ML with Dask and Kubernetes
Distributed ML with Dask and Kubernetes
TensorFlow
TF2 & AutoKeras:
Watch This Space
Thank You
@rayh
elz.ai/dask-ml
Questions?

More Related Content

What's hot (20)

PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
PDF
PySaprk
Giivee The
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PDF
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Deep Learning to Production with MLflow & RedisAI
Databricks
 
PDF
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Databricks
 
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
PDF
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
PDF
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
PySaprk
Giivee The
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
High Performance Python on Apache Spark
Wes McKinney
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Deep Learning to Production with MLflow & RedisAI
Databricks
 
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Databricks
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 

Similar to Distributed ML with Dask and Kubernetes (20)

PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PDF
DASK and Apache Spark
Databricks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Spark Programming Basic Training Handout
yanuarsinggih1
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
New Developments in Spark
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Fossasia 2018-chetan-khatri
Chetan Khatri
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
DASK and Apache Spark
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark Programming Basic Training Handout
yanuarsinggih1
 
20170126 big data processing
Vienna Data Science Group
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
New Developments in Spark
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Enabling exploratory data science with Spark and R
Databricks
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Ad

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Ad

Distributed ML with Dask and Kubernetes

Editor's Notes

  • #2: Thanks Derek & Melbourn Distributed
  • #3: So, in this talk, I’ll briefly explain: Machine learning Dask, and how it works Kubernetes And then we will work through some examples (demo gods permitting) I’ll be touching on a lot of disparate areas so i will try and keep it relatively high level But I’m going to assume at least some passing knowledge of these areas Feel free to ask for clarification along the way - but please save the bigger ones for the end
  • #4: First...
  • #7: So what does this mean in practice?
  • #8: Traditionally, Humans create the logic, in ML, humans curate the data and the desired output state - and the machines derive the logic As a side note, this doesnt remove the need for humans from the development process, it just shifts their role to one of data wrangling and curation and modelling of expected system output. The logic that is output by the ML training process can then be used
  • #9: More generally, the difference is that an engineer turns requirements into logic A data scientist turns requirements into training/test data and expected output (labels) This approach can be applied to problems that are too hard for mere mortal engineers, such as object detection in images and robust reading (formally known as OCR)
  • #10: Essentially, the power of machine learning that it enables us to make predictions based on previous experience, without us humans having to necessarily understand the underlying relationships
  • #11: First...
  • #12: Dask is a distributed processing library for Python. It provides a pandas-compatible API to easily perform operations on massive dataframes across many nodes
  • #13: But it doesnt support SQL, hdfs, hive, etc
  • #14: You don't have to completely rewrite your code or retrain to scale up.
  • #15: Imagine we have a set of panda dataframes, you can think of them as sets of structured data, and they are broken up by date. These dataframes could be processed by many threads or processes at once, perhaps across many machines. With appropriate partitioning, this would allow for massive concurrency. So how can we process data in parallel
  • #16: Image we have some linear function f() that we want to apply to all the data. … that is to say, a function that is applied per element and has no side-effects or dependencies We could send this function to each each dataframe and apply, or “map” it in parallel. Once all those functions have been applied, we can gather, or “reduce” the results
  • #17: So, how do we work out what functions to apply? Let’s start with what a DAG is
  • #19: Directed: flows in one direction Acyclic: it doesn’t have any loops Graph: a general topology primitives A directed acyclic graph (DAG) is commonly used to solve task scheduling problems. By breaking complex tasks into a DAG, a scheduler can scale work across a cluster Dask is a library for delayed task computation that makes use of directed graphs at its core.
  • #20: # FROM https://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask @delayed def add(x, y): return x + y four = add(add(1, 1), add(1, 1)) four.compute()
  • #21: Here we can see a larger DAG - it’s clear that there is an opportunity for concurrency at the bottom, where operations have no or less dependencies. As the task nears completion, it is performing a simpler set of operations on a larger set of data And there is less opportunity for concurrency Ideally, we want to avoid “reducing” until as late as possible
  • #22: If you take advantage of dask primitives (bags, arrays, dataframes, delayed functions) - and keep in mind how your operation will be decomposed and distributed - you can, in some cases, effectively linear scaling (see monte carlo)
  • #23: First...
  • #24: Google released this to the community, since then many people have contributed work to it, or to it’s ecosystem It’s becoming an defacto standard, every cloud provider has some kind of managed Kubernetes service Kube can scale to large numbers of nodes and complex configurations. Desired Infrastructure state is described in simple YAML files - kube attempts to satisfy that state If kubernetes doesnt support something “out of the box” it can be extended through things like CRD/Operator, CSI, etc Instead of deploying & managing many clusters for different purposes (EMR, storage, API/web hosting, batch jobs) - we can use a single underlying cluster and make more efficient use of the resources
  • #25: We’re running Dask on Kubernetes here. This allows us to use the same underlying compute cluster for a variety of tasks such as notebooks (such as what you will see soon) and other compute (such as TensorFlow, Spark, etc)
  • #26: Node resources can be used for many purposes
  • #27: Now we get to the awesome
  • #29: This is the helm config for deploying the dask cluster You can see we specify memory/cpu limits as well as the number of nodes we want The underlying cluster will autoscale to accomodate for the desired compute We also have our custom dask image here, which has a lot of python packages pre-installed as well as things like CUDA drivers, etc
  • #30: Deploying using helm is pretty simple
  • #31: 10 nodes, 2 CPUs and 6GB each
  • #34: We could make this go even faster: More cores Convert from CSV to parquet
  • #36: This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
  • #37: This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
  • #38: 100,000 iterations, not 10,000 This took over a thousand seconds on a local low-power machine, but came down to 11s when running on a 128 core cluster (c5.4xlarge instances). The linearity fell off towards the end as the time taken to distributed tasks and gather results took about 8 seconds
  • #39: Logistic and XGBoost
  • #43: But that doesnt sound too good, so we use the fancier term
  • #44: Or hyper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk) No free lunch theorm
  • #45: Or hYper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk)
  • #49: TensorFlow support is dask has been abandoned! Tensorflow is quite hard to scale as we have to be quite explicit about how the graph scales onto multiple CPUs and GPUs With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results.
  • #50: Dask xgboost is broken on kubernetes right now. While trying to get this to work, I realised the issue is being actively discussed, the last comment from just a few days ago…. this is bleeding edge stuff
  • #51: Running many pods on one large machine gives greater opportunity to burst to use under-utilised resources, where as smaller nodes tend remain under-utilised as you can only fit a couple of pods on them
  • #52: can be hard to understand how close maps to graphs Have to try different approaches (sre month Carlo)
  • #55: Matthew Rocklin, who made dask now works for Nvidia And Nvidia have created an “open” ecosystem for doing ML on GPUs cuDNN sounds very intet
  • #59: We are quite heavy users of Keras & TensorFlow So...
  • #60: With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results. With AutoKeras we have a way of using performing search across TF architectures - this is generically much easier to parallelize than the model itself It currenl uses pytorch.multiprocessing as a backend and it seems possible to refactor this to use joblib, and thus dask
  • #61: https://rapids.ai/index.html https://github.com/nvidia/nvidia-docker https://github.com/rapidsai/cudf