Distributed ML with Dask and Kubernetes

Download as PPTX, PDF

1 like994 views

The document discusses the use of Dask with Kubernetes for distributed machine learning, highlighting Dask's compatibility with Python libraries and its efficiency in scaling data operations. It outlines architectural examples, deployment processes, and performance metrics, emphasizing the ease of orchestration with Kubernetes. Additionally, it touches on model architecture selection, hyperparameter searches, and future technologies like Rapids and TensorFlow integration.

Technology

Distributed ML with
Dask & Kubernetes
Ray Hilton, Eliiza
@rayh @EliizaAI

COMPUTE
LOGIC
DATA
OUTPUT
COMPUTE
(LOTS AND LOTS OF)
OUTPUT
DATA
LOGIC
Traditional Software
Machine Learning

COMPUTE
(LOTS AND
LOTS OF)
LABELS/OUTPUT
TRAINING DATA
LOGIC
Learning
COMPUTE
(NOT MUCH
OF)
OUTPUT
RUNTIME DATA
Inference
DATA
SCIENTIST’S
BRAIN
ENGINEER’S
BRAIN
REQUIREMENTS LOGIC
Engineering
COMPUTE OUTPUT
RUNTIME DATA
RuntimeBUSINESS
REQUIREMENTS

Make predictions based on
previous experience

It’s like Spark,
but idiomatically Python

“Dask uses existing Python
APIs and data structures to
make it easy to switch
between Numpy, Pandas,
Scikit-learn to their Dask-
powered equivalents.”

f(df)
f(df1)
f(df2)
f(df3)
f(df4)
f(df5)
MAP
REDUCE
result

Basic DAG
@delayed
def add(x, y):
return x + y
four = add(
add(1, 1),
add(1, 1)
)
four.compute()

Dask makes scaling data
operations easy*
*YMMV

Why?
● Open Source
● Defacto Standard
● Proven at Scale
● Infrastructure as Configuration
● Modular & Extensible
● Efficiency

Example Architecture
Node1 Node2 Node3 Node4 Node5
Kubernetes
Jupyter
Airflow
Dask
Grafana
Airflow
Dask
Grafana
Spark
Dask
Grafana
Spark
Dask
Jupyter
Spark
Dask
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU
CPU
DISK
GPU

Kubernetes makes deployment
and orchestration easy and
efficient

worker:
image:
repository: eliiza/dsp-dask
tag: latest
pullPolicy: Always
replicas: 10
resources:
limits:
cpu: 2
memory: 6G
requests:
cpu: 2
memory: 6G
scheduler:
image:
repository: eliiza/dsp-dask
tag: latest
pullPolicy: Always
jupyter:
enabled: false

$ helm upgrade --install dask-cluster stable/dask -f config.yml

2GB
Local Cluster Speed Up
Counts 56.23 10.46 5.38
Market Share 50.60 9.46 5.35
10GB
Local Cluster Speed Up
Counts 429.69 73.74 5.83
Market Share 382.01 64.60 5.91

Question:
How do you know which model
architecture to use?

Answer:
Try random shit until shit looks
right

Distributed ML with Dask and Kubernetes

1. Distributed ML with Dask & Kubernetes Ray Hilton, Eliiza @rayh @EliizaAI

4. What is Machine Learning?

5. COMPUTE LOGIC DATA OUTPUT COMPUTE (LOTS AND LOTS OF) OUTPUT DATA LOGIC Traditional Software Machine Learning

6. COMPUTE (LOTS AND LOTS OF) LABELS/OUTPUT TRAINING DATA LOGIC Learning COMPUTE (NOT MUCH OF) OUTPUT RUNTIME DATA Inference DATA SCIENTIST’S BRAIN ENGINEER’S BRAIN REQUIREMENTS LOGIC Engineering COMPUTE OUTPUT RUNTIME DATA RuntimeBUSINESS REQUIREMENTS

7. Make predictions based on previous experience

9. What is Dask?

10. It’s like Spark, but idiomatically Python

11. “Dask uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask- powered equivalents.”

13. f(df) f(df1) f(df2) f(df3) f(df4) f(df5) MAP REDUCE result

14. What functions do we apply where?

16. Directed Acyclic Graph

17. Basic DAG @delayed def add(x, y): return x + y four = add( add(1, 1), add(1, 1) ) four.compute()

18. Complex DAG

19. Dask makes scaling data operations easy* *YMMV

21. Why? ● Open Source ● Defacto Standard ● Proven at Scale ● Infrastructure as Configuration ● Modular & Extensible ● Efficiency

22. Example Architecture Node1 Node2 Node3 Node4 Node5 Kubernetes Jupyter Airflow Dask Grafana Airflow Dask Grafana Spark Dask Grafana Spark Dask Jupyter Spark Dask CPU DISK GPU CPU DISK GPU CPU DISK GPU CPU DISK GPU CPU DISK GPU

23. Kubernetes makes deployment and orchestration easy and efficient

25. Dask Cluster

26. worker: image: repository: eliiza/dsp-dask tag: latest pullPolicy: Always replicas: 10 resources: limits: cpu: 2 memory: 6G requests: cpu: 2 memory: 6G scheduler: image: repository: eliiza/dsp-dask tag: latest pullPolicy: Always jupyter: enabled: false

27. $ helm upgrade --install dask-cluster stable/dask -f config.yml

28. Demo Cluster

29. Demo: Dataframes

31. 2GB Local Cluster Speed Up Counts 56.23 10.46 5.38 Market Share 50.60 9.46 5.35 10GB Local Cluster Speed Up Counts 429.69 73.74 5.83 Market Share 382.01 64.60 5.91

32. Demo: Monte Carlo Simulation

36. Demo: Random Forest

38. Question: How do you know which model architecture to use?

40. Answer: Try random shit until shit looks right

41. Answer: Hyperparameter Search

42. Demo: RandomSearch & Dask

44. Learnings

45. Why no TensorFlow Love?

46. Xgboost

47. Fewer large nodes > many small nodes

48. Diagnosing Graphs

49. What’s Next?

50. RAPIDS

54. TensorFlow

55. TF2 & AutoKeras: Watch This Space

56. Thank You @rayh elz.ai/dask-ml

57. Questions?

Editor's Notes

#2: Thanks Derek & Melbourn Distributed
#3: So, in this talk, I’ll briefly explain: Machine learning Dask, and how it works Kubernetes And then we will work through some examples (demo gods permitting) I’ll be touching on a lot of disparate areas so i will try and keep it relatively high level But I’m going to assume at least some passing knowledge of these areas Feel free to ask for clarification along the way - but please save the bigger ones for the end
#4: First...
#7: So what does this mean in practice?
#8: Traditionally, Humans create the logic, in ML, humans curate the data and the desired output state - and the machines derive the logic As a side note, this doesnt remove the need for humans from the development process, it just shifts their role to one of data wrangling and curation and modelling of expected system output. The logic that is output by the ML training process can then be used
#9: More generally, the difference is that an engineer turns requirements into logic A data scientist turns requirements into training/test data and expected output (labels) This approach can be applied to problems that are too hard for mere mortal engineers, such as object detection in images and robust reading (formally known as OCR)
#10: Essentially, the power of machine learning that it enables us to make predictions based on previous experience, without us humans having to necessarily understand the underlying relationships
#11: First...
#12: Dask is a distributed processing library for Python. It provides a pandas-compatible API to easily perform operations on massive dataframes across many nodes
#13: But it doesnt support SQL, hdfs, hive, etc
#14: You don't have to completely rewrite your code or retrain to scale up.
#15: Imagine we have a set of panda dataframes, you can think of them as sets of structured data, and they are broken up by date. These dataframes could be processed by many threads or processes at once, perhaps across many machines. With appropriate partitioning, this would allow for massive concurrency. So how can we process data in parallel
#16: Image we have some linear function f() that we want to apply to all the data. … that is to say, a function that is applied per element and has no side-effects or dependencies We could send this function to each each dataframe and apply, or “map” it in parallel. Once all those functions have been applied, we can gather, or “reduce” the results
#17: So, how do we work out what functions to apply? Let’s start with what a DAG is
#19: Directed: flows in one direction Acyclic: it doesn’t have any loops Graph: a general topology primitives A directed acyclic graph (DAG) is commonly used to solve task scheduling problems. By breaking complex tasks into a DAG, a scheduler can scale work across a cluster Dask is a library for delayed task computation that makes use of directed graphs at its core.
#20: # FROM https://matthewrocklin.com/blog/work/2018/02/09/credit-models-with-dask @delayed def add(x, y): return x + y four = add(add(1, 1), add(1, 1)) four.compute()
#21: Here we can see a larger DAG - it’s clear that there is an opportunity for concurrency at the bottom, where operations have no or less dependencies. As the task nears completion, it is performing a simpler set of operations on a larger set of data And there is less opportunity for concurrency Ideally, we want to avoid “reducing” until as late as possible
#22: If you take advantage of dask primitives (bags, arrays, dataframes, delayed functions) - and keep in mind how your operation will be decomposed and distributed - you can, in some cases, effectively linear scaling (see monte carlo)
#23: First...
#24: Google released this to the community, since then many people have contributed work to it, or to it’s ecosystem It’s becoming an defacto standard, every cloud provider has some kind of managed Kubernetes service Kube can scale to large numbers of nodes and complex configurations. Desired Infrastructure state is described in simple YAML files - kube attempts to satisfy that state If kubernetes doesnt support something “out of the box” it can be extended through things like CRD/Operator, CSI, etc Instead of deploying & managing many clusters for different purposes (EMR, storage, API/web hosting, batch jobs) - we can use a single underlying cluster and make more efficient use of the resources
#25: We’re running Dask on Kubernetes here. This allows us to use the same underlying compute cluster for a variety of tasks such as notebooks (such as what you will see soon) and other compute (such as TensorFlow, Spark, etc)
#26: Node resources can be used for many purposes
#27: Now we get to the awesome
#29: This is the helm config for deploying the dask cluster You can see we specify memory/cpu limits as well as the number of nodes we want The underlying cluster will autoscale to accomodate for the desired compute We also have our custom dask image here, which has a lot of python packages pre-installed as well as things like CUDA drivers, etc
#30: Deploying using helm is pretty simple
#31: 10 nodes, 2 CPUs and 6GB each
#34: We could make this go even faster: More cores Convert from CSV to parquet
#36: This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
#37: This shows how the dask structured the DAG: Apply lambda (i.e. run simulation) Get Item Count & Sum Then the reduce steps: Aggregate Counts & Sums Mean
#38: 100,000 iterations, not 10,000 This took over a thousand seconds on a local low-power machine, but came down to 11s when running on a 128 core cluster (c5.4xlarge instances). The linearity fell off towards the end as the time taken to distributed tasks and gather results took about 8 seconds
#39: Logistic and XGBoost
#43: But that doesnt sound too good, so we use the fancier term
#44: Or hyper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk) No free lunch theorm
#45: Or hYper parameter optimisation There are a number of different algorithms, but for the general case, nothing beats RandomSearch (see Patrick’s Talk)
#49: TensorFlow support is dask has been abandoned! Tensorflow is quite hard to scale as we have to be quite explicit about how the graph scales onto multiple CPUs and GPUs With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results.
#50: Dask xgboost is broken on kubernetes right now. While trying to get this to work, I realised the issue is being actively discussed, the last comment from just a few days ago…. this is bleeding edge stuff
#51: Running many pods on one large machine gives greater opportunity to burst to use under-utilised resources, where as smaller nodes tend remain under-utilised as you can only fit a couple of pods on them
#52: can be hard to understand how close maps to graphs Have to try different approaches (sre month Carlo)
#55: Matthew Rocklin, who made dask now works for Nvidia And Nvidia have created an “open” ecosystem for doing ML on GPUs cuDNN sounds very intet
#59: We are quite heavy users of Keras & TensorFlow So...
#60: With TF2, we have distribution strategies that will make it easy to copy the graph to many nodes, and process batches of training data on each node, and then combine the results. With AutoKeras we have a way of using performing search across TF architectures - this is generically much easier to parallelize than the model itself It currenl uses pytorch.multiprocessing as a backend and it seems possible to refactor this to use joblib, and thus dask
#61: https://rapids.ai/index.html https://github.com/nvidia/nvidia-docker https://github.com/rapidsai/cudf

Distributed ML with Dask and Kubernetes

More Related Content

What's hot (20)

Similar to Distributed ML with Dask and Kubernetes (20)

Recently uploaded (20)

Distributed ML with Dask and Kubernetes

Editor's Notes