SlideShare a Scribd company logo
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Arun Gupta, @arungupta
Machine Learning using Kubeflow and
Kubernetes
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://dilbert.com/strip/2013-02-02
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Supervised/unsupervised/reinforcement learning …
Data sourcing, cleanup, tagging & classification
Linear/logistic regression, Random forest, Decision tree, …
Linear algebra, Statistics, Probability
TensorFlow, PyTorch, MXNet, Caffe2, Keras, SciKit-Learn, …
Python, Julia, R, …
Training and evaluating models
Distributed training
IntelliJ, VSCode, PyCharm, Jupyter notebook
Hyperparameter Tuning
GPU or CPU
MLOps
https://happykaty.com/2018/05/15/drinking-from-a-fire-hose/
Machine Learning is Hard!
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning 101
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FRAMEWORKS INTERFACES INFRASTRUCTURE
AI Services
Broadest and deepest set of capabilities
T H E AW S M L S TA C K
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML Services
ML Frameworks + Infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
L E X F O R E C A S TR E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4
E C 2 C 5
I N F E R E N T I AG R E E N G R A S S E L A S T I C
I N F E R E N C E
D L
C O N T A I N E R S
& A M I s
E L A S T I C
K U B E R N E T E S
S E R V I C E
E L A S T I C
C O N T A I N E R
S E R V I C E
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FRAMEWORKS INTERFACES INFRASTRUCTURE
AI Services
Broadest and deepest set of capabilities
T H E AW S M L S TA C K
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML Services
ML Frameworks + Infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
L E X F O R E C A S TR E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4
E C 2 C 5
I N F E R E N T I AG R E E N G R A S S E L A S T I C
I N F E R E N C E
D L
C O N T A I N E R S
& A M I s
E L A S T I C
K U B E R N E T E S
S E R V I C E
E L A S T I C
C O N T A I N E R
S E R V I C E
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FRAMEWORKS INTERFACES INFRASTRUCTURE
AI Services
Broadest and deepest set of capabilities
T H E AW S M L S TA C K
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML Services
ML Frameworks + Infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
L E X F O R E C A S TR E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4
E C 2 C 5
I N F E R E N T I AG R E E N G R A S S E L A S T I C
I N F E R E N C E
D L
C O N T A I N E R S
& A M I s
E L A S T I C
K U B E R N E T E S
S E R V I C E
E L A S T I C
C O N T A I N E R
S E R V I C E
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FRAMEWORKS INTERFACES INFRASTRUCTURE
AI Services
Broadest and deepest set of capabilities
T H E AW S M L S TA C K
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML Services
ML Frameworks + Infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
L E X F O R E C A S TR E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting
F P G A SE C 2 P 3
& P 3 D N
E C 2 G 4
E C 2 C 5
I N F E R E N T I AG R E E N G R A S S E L A S T I C
I N F E R E N C E
D L
C O N T A I N E R S
& A M I s
E L A S T I C
K U B E R N E T E S
S E R V I C E
E L A S T I C
C O N T A I N E R
S E R V I C E
Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
M A C H I N E L E A R N I N G
S T O R A G E
Amazon Redshift
+ Redshift Spectrum
Amazon
QuickSight
Amazon EMR
Hadoop, Spark, Presto,
Pig, Hive…19 total
Amazon
Athena
Amazon
Kinesis
Amazon
Elasticsearch
Service
AWS Glue
A N A L Y T I C S
Amazon S3
Standard-IA
Amazon S3
Standard
Amazon S3
One Zone-IA
Amazon
Glacier
Amazon S3
Intelligent-
Tiering
N E W
Amazon
EBS
Amazon S3
Glacier Deep
Archive
N E W
Storage and Analytics for Machine Learning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Machine Learning on Kubernetes?
Composability Portability Scalability
O N - P R E M I S E S C L O U D
http://www.shutterstock.com/gallery-635827p1.html
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on K8s: Without KubeFlow
@aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning on K8s: With KubeFlow
@aronchik
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What’s in
KubeFlow?
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EKS: run Kubernetes in cloud
Managed Kubernetes control plane, attach data plane
Native upstream Kubernetes experience
Platform for enterprises to run production-grade workloads
Integrates with additional AWS services
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started with Amazon EKS
eksctl CLI—create Amazon EKS clusters (eksctl.io)
Creates all resources needed for the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Train Inference
Set up K8s for ML: Option 1
Trained
model
2 3 4
Data
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Set up K8s for ML: Option 2a
Train & inference
Trained
model
2
3
4
role: train
role: train
role: train role: inference
role: inference
Data
1
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Set up K8s for ML: Option 2b
Train, inference, & applications
role: train
role: train
role: train role: inference
role: inference
role: apps
role: apps
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scaling the cluster
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow Requirements
4 CPU, 12 GB memory, 50 GB storage
kubeflow.org/docs/started/k8s/overview/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow on Desktop
MiniKF: Local Kubeflow deployment using VirtualBox and
Vagrant
• Minikube -> Kubernetes
• MiniKF -> Kubeflow (includes minikube)
Runs on macOS, Linux, and Windows
Does not require k8s-specific knowledge
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow on Cloud
Major cloud providers supported
Choices on Amazon Web Services
• Self-managed k8s on EC2: Kops, CloudFormation, Terraform
• Amazon EKS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started with Kubeflow
on Amazon EKS
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jupyter Notebook
Web application to build, deploy, and train ML models
Create and share documents that contain live code,
equations, visualizations, and narrative text
40+ programming languages
Use cases
data cleaning and transformation
numerical simulation
data visualization
machine learning
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Training using Jupyter Notebook
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow Fairing
Python SDK to build, train, and deploy ML models remotely
Goals:
• Easily package ML training jobs
• Train ML models in the cloud
• Streamline the process of deploying a trained model
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb
Setup Kubeflow Fairing for training and prediction
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Train an XGBoost model remotely on Kubeflow
Deploy the trained model to Kubeflow for prediction
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Call the prediction endpoint
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hyperparameter Tuning using Katib
Hyperparameter are parameters external to the model to
control the training
e.g. Learning rate, batch size, epochs
Tuning finds a set of hyperparameters that optimizes an
objective function
e.g. Find the optimal batch size and learning rate to maximize prediction
accuracy
Katib enables hyperparameter tuning in Kubeflow
Credits @Richard Liu @Johnu (Kubeflow slack)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Katib Concepts
Extensible
Framework agnostic: TensorFlow, PyTorch, MXNet, …
Customizable algorithm backend
Experiment: “optimization loop” for some specific problem
Suggestion: a proposed solution to the problem
Trial: one iteration of the loop
Job: evaluate a trial and calculate objective value
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Katib System Architecture
Experiment
Controller
Hyperparameters
Trial
Controller
Suggestion
Controller
Trial
Trial
Trial
Create
Experiment
Metrics
Trial CR
Experiment = CreateExperimentCR()
while not Experiment.Objective reached:
Suggestion = CreateSuggestionCR()
HyperParameters = Suggestion.Assignments
Metrics = CreateTrialCR(HyperParameters)
ReportMetrics(Metrics)
Suggestion CR
Metric DB
Experiment CR
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Experiment CR Hyperparameters
https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/08_Hyperparameter_Tuning/random-search-example.yaml
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Trial template
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow KFServing
Simple and pluggable platform for ML inference
Intuitive and consistent experience
Serving models on arbitrary frameworks
e.g. TensorFlow, XGBoost, SciKitLearn
Encapsulates GPU auto-scaling, canary rollouts
Credits @ellis-bigelow (Kubeflow slack)
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing Custom Resource
S3 secret
attached to
Service
Account
Trained
model
https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pluggable Interface
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "InferenceService"
metadata:
name: "flowers-sample"
spec:
default:
tensorflow:
storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "KFService"
metadata:
name: "pytorch-cifar10"
spec:
default:
pytorch:
storageUri: "gs://kfserving-samples/models/pytorch/cifar10"
modelClassName: "Net"
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
KFServing Interface – Scikit Learn
apiVersion: "serving.kubeflow.org/v1alpha1"
kind: "KFService"
metadata:
name: "sklearn-iris"
spec:
default:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
canaryTrafficPercent: 25
canary:
sklearn:
storageUri: "gs://kfserving-samples/models/sklearn/iris-v2"
serviceAccount: inferencing-robot
minReplicas: 3
maxReplicas: 10
resources:
requests:
cpu: 2
gpu: 1
memory: 10Gi
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Distributed Training using Horovod
Created by Uber, hosted by LF AI Foundation
Distributed training framework for TensorFlow, Keras,
PyTorch, and MXNet
Compared to distributed TensorFlow
• Far less code changes
• ~2x faster
Examples at Uber: Self-driving vehicles, fraud detection,
and trip forecasting
Named after a traditional Russian dance
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow Pipelines
Compose, deploy, and manage end-to-end ML workflows
• End-to-end orchestration
• Easy, rapid, and reliable experimentation
• Easy re-use
Built using Pipelines SDK
• kfp.compiler, kfp.components, kfp.Client
Uses Argo under the hood to orchestrate resources
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kubeflow Pipelines Platform
UI for managing and tracking experiments, jobs, and runs
Engine for scheduling multi-step ML workflows
SDK for defining and manipulating pipelines and
components
• kubeflow-pipelines.readthedocs.io/en/latest/
Notebooks for interacting with the system using the SDK
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Creating Kubeflow Pipeline Components
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consumer Loan Acceptance Scoring
Objective
• Putting hundreds of data products live
• Single development -> deployment -> delivery environment
• First go batch, then real-time
Analytics environment
• AWS
• High security and compliance with regulation
Typical modeling context
• Structured data
• Supervised learning
• Internalizing interpretable models and hybrid pipelines
www.credo.be
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Requirements
Data Scientists
• Hybrid, integrated, cloud-based dev env
• Python
• PySpark (locally + remotely on Spark cluster)
• R
• SQL
• Version control (scripts & artifacts)
ML DevOps
• Seamless deployment of hybrid pipelines
• Trigger-based scheduling & orchestration of runs
• Monitoring & dashboard
• Version control (runs & pipelines)
www.credo.be
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
AWS
Infrastructure, connections, security
S3, Spark cluster, VMs, …
Amazon EKS Amazon ECR
Kubeflow 0.6
Notebook dev environment
Pipelines for dev & delivery
ElasticStack
Dashboarding
Custom notebook servers
www.credo.be
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage: FSx for Lustre
High performance file system for processing Amazon S3
or on-premises data
Low latency and high throughput
Works natively with Amazon S3
Container Storage Interface (CSI) driver
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices for Optimizing Distributed Deep
Learning Performance on Amazon EKS
https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Advantages of KubeFlow on AWS
EKS cluster provision with
External traffic with
to manage Lustre file system
Centralized and unified K8s logs in
TLS and Auth with and
for your K8s API server endpoint
Detect GPU instance and install
kubeflow.org/docs/aws
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kanban Board
github.com/orgs/kubeflow/projects/25
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Application Requirements 1.0
https://github.com/kubeflow/community/blob/master/g
uidelines/application_requirements.md
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning pipeline
Choose and
Optimize your
ML algorithm
Setup and
manage
environments
for training
Deploy model
in production
Collect &
prepare
training data
Train and
tune model
(trial and error)
Scale &
manage
environment
in production
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Learning pipeline for Kubernetes on AWS
Linear
regression,
decision tree,
BYOA
GPU- and CPU-
based clusters,
*operators
(TensorFlow,
MXNet, …)
TensorFlow
Serving, MXNet
Model Server,
Seldon, …
EMR,
Redshift, S3
TensorFlow,
Horovod,
MXNet,
PyTorch, Keras,
…
EKS
or
Self-managed
K8s
Choose and
Optimize your
ML algorithm
Setup and
manage
environments
for training
Deploy model
in production
Collect &
prepare
training data
Train and
tune model
(trial and error)
Scale &
manage
environment
in production
© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
References
Workshop: eksworkshop.com/kubeflow
Jupyter notebooks: github.com/aws-samples/eks-
kubeflow-workshop/
Optimizing Machine Learning performance:
aws.amazon.com/blogs/opensource/optimizing-
distributed-deep-learning-performance-amazon-eks/

More Related Content

What's hot (20)

PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
PDF
What is MLOps
Henrik Skogström
 
PPTX
From Data Science to MLOps
Carl W. Handlin
 
PDF
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
PDF
Gitops: the kubernetes way
sparkfabrik
 
PDF
Kubernetes: A Short Introduction (2019)
Megan O'Keefe
 
PDF
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Sunnyvale
 
PPTX
MLOps - The Assembly Line of ML
Jordan Birdsell
 
PPTX
Google Vertex AI
VikasBisoi
 
PDF
Kubernetes Basics
Eueung Mulyana
 
PDF
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
PPTX
Training And Serving ML Model Using Kubeflow by Jayesh Sharma
CodeOps Technologies LLP
 
PDF
Introduction to MLflow
Databricks
 
PDF
Introduction to Kubernetes Workshop
Bob Killen
 
PDF
MLOps Using MLflow
Databricks
 
PDF
Modern Data Platforms
Arne Roßmann
 
PDF
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
PDF
eBPF - Observability In Deep
Mydbops
 
PDF
Databricks Overview for MLOps
Databricks
 
PPTX
introduction Azure OpenAI by Usama wahab khan
Usama Wahab Khan Cloud, Data and AI
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 
What is MLOps
Henrik Skogström
 
From Data Science to MLOps
Carl W. Handlin
 
MLOps Bridging the gap between Data Scientists and Ops.
Knoldus Inc.
 
Gitops: the kubernetes way
sparkfabrik
 
Kubernetes: A Short Introduction (2019)
Megan O'Keefe
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Sunnyvale
 
MLOps - The Assembly Line of ML
Jordan Birdsell
 
Google Vertex AI
VikasBisoi
 
Kubernetes Basics
Eueung Mulyana
 
The A-Z of Data: Introduction to MLOps
DataPhoenix
 
Training And Serving ML Model Using Kubeflow by Jayesh Sharma
CodeOps Technologies LLP
 
Introduction to MLflow
Databricks
 
Introduction to Kubernetes Workshop
Bob Killen
 
MLOps Using MLflow
Databricks
 
Modern Data Platforms
Arne Roßmann
 
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
eBPF - Observability In Deep
Mydbops
 
Databricks Overview for MLOps
Databricks
 
introduction Azure OpenAI by Usama wahab khan
Usama Wahab Khan Cloud, Data and AI
 

Similar to Machine Learning using Kubeflow and Kubernetes (20)

PDF
Machine learning using Kubernetes
Arun Gupta
 
PDF
Amir sadoughi developing large-scale machine learning algorithms on amazon ...
MLconf
 
PPTX
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 
PPTX
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Julien SIMON
 
PDF
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Julien SIMON
 
PPTX
Deep Dive Amazon SageMaker
Cobus Bernard
 
PDF
Mcl345 re invent_sagemaker_dmbanga
Dan Romuald Mbanga
 
PDF
Unleash the Power of ML with AWS | AWS Summit Tel Aviv 2019
AWS Summits
 
PPTX
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
PPTX
An Introduction to Amazon SageMaker (October 2018)
Julien SIMON
 
PDF
Build Machine Learning Models with Amazon SageMaker (April 2019)
Julien SIMON
 
PDF
Webinar: Ask the Experts - AIML (Español)
Amazon Web Services LATAM
 
PDF
Innovation Track AWS Cloud Experience Argentina - Democratizing Artificial In...
Amazon Web Services LATAM
 
PDF
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
GoDataDriven
 
PPTX
WhereML a Serverless ML Powered Location Guessing Twitter Bot
Randall Hunt
 
PPTX
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Julien SIMON
 
PPTX
AWS re:Invent 2018 - Machine Learning recap (December 2018)
Julien SIMON
 
PDF
AWS Summit Singapore 2019 | Build, Train and Deploy Deep Learning Models on A...
AWS Summits
 
PPTX
Getting started with AWS Machine Learning
Cobus Bernard
 
PDF
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
Amazon Web Services Korea
 
Machine learning using Kubernetes
Arun Gupta
 
Amir sadoughi developing large-scale machine learning algorithms on amazon ...
MLconf
 
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 
Deep Learning with TensorFlow and Apache MXNet on Amazon SageMaker (March 2019)
Julien SIMON
 
Deep Learning with Tensorflow and Apache MXNet on AWS (April 2019)
Julien SIMON
 
Deep Dive Amazon SageMaker
Cobus Bernard
 
Mcl345 re invent_sagemaker_dmbanga
Dan Romuald Mbanga
 
Unleash the Power of ML with AWS | AWS Summit Tel Aviv 2019
AWS Summits
 
Build, train and deploy ML models with SageMaker (October 2019)
Julien SIMON
 
An Introduction to Amazon SageMaker (October 2018)
Julien SIMON
 
Build Machine Learning Models with Amazon SageMaker (April 2019)
Julien SIMON
 
Webinar: Ask the Experts - AIML (Español)
Amazon Web Services LATAM
 
Innovation Track AWS Cloud Experience Argentina - Democratizing Artificial In...
Amazon Web Services LATAM
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
GoDataDriven
 
WhereML a Serverless ML Powered Location Guessing Twitter Bot
Randall Hunt
 
Build, Train and Deploy Machine Learning Models at Scale (April 2019)
Julien SIMON
 
AWS re:Invent 2018 - Machine Learning recap (December 2018)
Julien SIMON
 
AWS Summit Singapore 2019 | Build, Train and Deploy Deep Learning Models on A...
AWS Summits
 
Getting started with AWS Machine Learning
Cobus Bernard
 
엔터프라이즈를 위한 머신러닝 그리고 AWS (김일호 솔루션즈 아키텍트, AWS) :: AWS Techforum 2018
Amazon Web Services Korea
 
Ad

More from Arun Gupta (20)

PDF
5 Skills To Force Multiply Technical Talents.pdf
Arun Gupta
 
PPTX
Secure and Fast microVM for Serverless Computing using Firecracker
Arun Gupta
 
PPTX
Building Java in the Open - j.Day at OSCON 2019
Arun Gupta
 
PPTX
Why Amazon Cares about Open Source
Arun Gupta
 
PDF
Building Cloud Native Applications
Arun Gupta
 
PDF
Chaos Engineering with Kubernetes
Arun Gupta
 
PDF
How to be a mentor to bring more girls to STEAM
Arun Gupta
 
PDF
Java in a World of Containers - DockerCon 2018
Arun Gupta
 
PPTX
The Serverless Tidal Wave - SwampUP 2018 Keynote
Arun Gupta
 
PDF
Introduction to Amazon EKS - KubeCon 2018
Arun Gupta
 
PDF
Mastering Kubernetes on AWS - Tel Aviv Summit
Arun Gupta
 
PDF
Top 10 Technology Trends Changing Developer's Landscape
Arun Gupta
 
PDF
Container Landscape in 2017
Arun Gupta
 
PDF
Java EE and NoSQL using JBoss EAP 7 and OpenShift
Arun Gupta
 
PDF
Docker, Kubernetes, and Mesos recipes for Java developers
Arun Gupta
 
PDF
Thanks Managers!
Arun Gupta
 
PDF
Migrate your traditional VM-based Clusters to Containers
Arun Gupta
 
PDF
NoSQL - Vital Open Source Ingredient for Modern Success
Arun Gupta
 
PDF
Package your Java EE Application using Docker and Kubernetes
Arun Gupta
 
PDF
Nuts and Bolts of WebSocket Devoxx 2014
Arun Gupta
 
5 Skills To Force Multiply Technical Talents.pdf
Arun Gupta
 
Secure and Fast microVM for Serverless Computing using Firecracker
Arun Gupta
 
Building Java in the Open - j.Day at OSCON 2019
Arun Gupta
 
Why Amazon Cares about Open Source
Arun Gupta
 
Building Cloud Native Applications
Arun Gupta
 
Chaos Engineering with Kubernetes
Arun Gupta
 
How to be a mentor to bring more girls to STEAM
Arun Gupta
 
Java in a World of Containers - DockerCon 2018
Arun Gupta
 
The Serverless Tidal Wave - SwampUP 2018 Keynote
Arun Gupta
 
Introduction to Amazon EKS - KubeCon 2018
Arun Gupta
 
Mastering Kubernetes on AWS - Tel Aviv Summit
Arun Gupta
 
Top 10 Technology Trends Changing Developer's Landscape
Arun Gupta
 
Container Landscape in 2017
Arun Gupta
 
Java EE and NoSQL using JBoss EAP 7 and OpenShift
Arun Gupta
 
Docker, Kubernetes, and Mesos recipes for Java developers
Arun Gupta
 
Thanks Managers!
Arun Gupta
 
Migrate your traditional VM-based Clusters to Containers
Arun Gupta
 
NoSQL - Vital Open Source Ingredient for Modern Success
Arun Gupta
 
Package your Java EE Application using Docker and Kubernetes
Arun Gupta
 
Nuts and Bolts of WebSocket Devoxx 2014
Arun Gupta
 
Ad

Recently uploaded (20)

PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 

Machine Learning using Kubeflow and Kubernetes

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Arun Gupta, @arungupta Machine Learning using Kubeflow and Kubernetes
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://dilbert.com/strip/2013-02-02
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Supervised/unsupervised/reinforcement learning … Data sourcing, cleanup, tagging & classification Linear/logistic regression, Random forest, Decision tree, … Linear algebra, Statistics, Probability TensorFlow, PyTorch, MXNet, Caffe2, Keras, SciKit-Learn, … Python, Julia, R, … Training and evaluating models Distributed training IntelliJ, VSCode, PyCharm, Jupyter notebook Hyperparameter Tuning GPU or CPU MLOps https://happykaty.com/2018/05/15/drinking-from-a-fire-hose/ Machine Learning is Hard!
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning 101
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment HostingAmazon SageMaker F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E AW S M L S TA C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S TR E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting F P G A SE C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I AG R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E Amazon SageMaker
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. M A C H I N E L E A R N I N G S T O R A G E Amazon Redshift + Redshift Spectrum Amazon QuickSight Amazon EMR Hadoop, Spark, Presto, Pig, Hive…19 total Amazon Athena Amazon Kinesis Amazon Elasticsearch Service AWS Glue A N A L Y T I C S Amazon S3 Standard-IA Amazon S3 Standard Amazon S3 One Zone-IA Amazon Glacier Amazon S3 Intelligent- Tiering N E W Amazon EBS Amazon S3 Glacier Deep Archive N E W Storage and Analytics for Machine Learning
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why Machine Learning on Kubernetes? Composability Portability Scalability O N - P R E M I S E S C L O U D http://www.shutterstock.com/gallery-635827p1.html
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: Without KubeFlow @aronchik
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning on K8s: With KubeFlow @aronchik
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s in KubeFlow?
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EKS: run Kubernetes in cloud Managed Kubernetes control plane, attach data plane Native upstream Kubernetes experience Platform for enterprises to run production-grade workloads Integrates with additional AWS services
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started with Amazon EKS eksctl CLI—create Amazon EKS clusters (eksctl.io) Creates all resources needed for the cluster
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Train Inference Set up K8s for ML: Option 1 Trained model 2 3 4 Data 1
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2a Train & inference Trained model 2 3 4 role: train role: train role: train role: inference role: inference Data 1
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Set up K8s for ML: Option 2b Train, inference, & applications role: train role: train role: train role: inference role: inference role: apps role: apps
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Scaling the cluster
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Requirements 4 CPU, 12 GB memory, 50 GB storage kubeflow.org/docs/started/k8s/overview/
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Desktop MiniKF: Local Kubeflow deployment using VirtualBox and Vagrant • Minikube -> Kubernetes • MiniKF -> Kubeflow (includes minikube) Runs on macOS, Linux, and Windows Does not require k8s-specific knowledge
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow on Cloud Major cloud providers supported Choices on Amazon Web Services • Self-managed k8s on EC2: Kops, CloudFormation, Terraform • Amazon EKS
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started with Kubeflow on Amazon EKS
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jupyter Notebook Web application to build, deploy, and train ML models Create and share documents that contain live code, equations, visualizations, and narrative text 40+ programming languages Use cases data cleaning and transformation numerical simulation data visualization machine learning
  • 26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Training using Jupyter Notebook
  • 27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Fairing Python SDK to build, train, and deploy ML models remotely Goals: • Easily package ML training jobs • Train ML models in the cloud • Streamline the process of deploying a trained model
  • 28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/02_Fairing/02_06_fairing_e2e.ipynb Setup Kubeflow Fairing for training and prediction
  • 29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Train an XGBoost model remotely on Kubeflow Deploy the trained model to Kubeflow for prediction
  • 30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Call the prediction endpoint
  • 31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hyperparameter Tuning using Katib Hyperparameter are parameters external to the model to control the training e.g. Learning rate, batch size, epochs Tuning finds a set of hyperparameters that optimizes an objective function e.g. Find the optimal batch size and learning rate to maximize prediction accuracy Katib enables hyperparameter tuning in Kubeflow Credits @Richard Liu @Johnu (Kubeflow slack)
  • 32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib Concepts Extensible Framework agnostic: TensorFlow, PyTorch, MXNet, … Customizable algorithm backend Experiment: “optimization loop” for some specific problem Suggestion: a proposed solution to the problem Trial: one iteration of the loop Job: evaluate a trial and calculate objective value
  • 33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Katib System Architecture Experiment Controller Hyperparameters Trial Controller Suggestion Controller Trial Trial Trial Create Experiment Metrics Trial CR Experiment = CreateExperimentCR() while not Experiment.Objective reached: Suggestion = CreateSuggestionCR() HyperParameters = Suggestion.Assignments Metrics = CreateTrialCR(HyperParameters) ReportMetrics(Metrics) Suggestion CR Metric DB Experiment CR
  • 34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Experiment CR Hyperparameters https://github.com/aws-samples/eks-kubeflow-workshop/blob/master/notebooks/08_Hyperparameter_Tuning/random-search-example.yaml
  • 35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Trial template
  • 36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow KFServing Simple and pluggable platform for ML inference Intuitive and consistent experience Serving models on arbitrary frameworks e.g. TensorFlow, XGBoost, SciKitLearn Encapsulates GPU auto-scaling, canary rollouts Credits @ellis-bigelow (Kubeflow slack)
  • 38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 39. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Custom Resource S3 secret attached to Service Account Trained model https://github.com/kubeflow/kfserving/blob/master/docs/samples/s3/tensorflow_s3.yaml
  • 40. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pluggable Interface apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "InferenceService" metadata: name: "flowers-sample" spec: default: tensorflow: storageUri: "gs://kfserving-samples/models/tensorflow/flowers" apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "pytorch-cifar10" spec: default: pytorch: storageUri: "gs://kfserving-samples/models/pytorch/cifar10" modelClassName: "Net"
  • 41. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. KFServing Interface – Scikit Learn apiVersion: "serving.kubeflow.org/v1alpha1" kind: "KFService" metadata: name: "sklearn-iris" spec: default: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi canaryTrafficPercent: 25 canary: sklearn: storageUri: "gs://kfserving-samples/models/sklearn/iris-v2" serviceAccount: inferencing-robot minReplicas: 3 maxReplicas: 10 resources: requests: cpu: 2 gpu: 1 memory: 10Gi
  • 42. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Distributed Training using Horovod Created by Uber, hosted by LF AI Foundation Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet Compared to distributed TensorFlow • Far less code changes • ~2x faster Examples at Uber: Self-driving vehicles, fraud detection, and trip forecasting Named after a traditional Russian dance
  • 43. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Pipelines Compose, deploy, and manage end-to-end ML workflows • End-to-end orchestration • Easy, rapid, and reliable experimentation • Easy re-use Built using Pipelines SDK • kfp.compiler, kfp.components, kfp.Client Uses Argo under the hood to orchestrate resources
  • 44. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kubeflow Pipelines Platform UI for managing and tracking experiments, jobs, and runs Engine for scheduling multi-step ML workflows SDK for defining and manipulating pipelines and components • kubeflow-pipelines.readthedocs.io/en/latest/ Notebooks for interacting with the system using the SDK
  • 45. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Creating Kubeflow Pipeline Components
  • 46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Consumer Loan Acceptance Scoring Objective • Putting hundreds of data products live • Single development -> deployment -> delivery environment • First go batch, then real-time Analytics environment • AWS • High security and compliance with regulation Typical modeling context • Structured data • Supervised learning • Internalizing interpretable models and hybrid pipelines www.credo.be
  • 47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Requirements Data Scientists • Hybrid, integrated, cloud-based dev env • Python • PySpark (locally + remotely on Spark cluster) • R • SQL • Version control (scripts & artifacts) ML DevOps • Seamless deployment of hybrid pipelines • Trigger-based scheduling & orchestration of runs • Monitoring & dashboard • Version control (runs & pipelines) www.credo.be
  • 48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture AWS Infrastructure, connections, security S3, Spark cluster, VMs, … Amazon EKS Amazon ECR Kubeflow 0.6 Notebook dev environment Pipelines for dev & delivery ElasticStack Dashboarding Custom notebook servers www.credo.be
  • 49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Storage: FSx for Lustre High performance file system for processing Amazon S3 or on-premises data Low latency and high throughput Works natively with Amazon S3 Container Storage Interface (CSI) driver
  • 50. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Optimizing Distributed Deep Learning Performance on Amazon EKS https://aws.amazon.com/blogs/opensource/optimizing-distributed-deep-learning-performance-amazon-eks/
  • 51. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Advantages of KubeFlow on AWS EKS cluster provision with External traffic with to manage Lustre file system Centralized and unified K8s logs in TLS and Auth with and for your K8s API server endpoint Detect GPU instance and install kubeflow.org/docs/aws
  • 52. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 53. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kanban Board github.com/orgs/kubeflow/projects/25
  • 54. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application Requirements 1.0 https://github.com/kubeflow/community/blob/master/g uidelines/application_requirements.md
  • 55. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production
  • 56. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning pipeline for Kubernetes on AWS Linear regression, decision tree, BYOA GPU- and CPU- based clusters, *operators (TensorFlow, MXNet, …) TensorFlow Serving, MXNet Model Server, Seldon, … EMR, Redshift, S3 TensorFlow, Horovod, MXNet, PyTorch, Keras, … EKS or Self-managed K8s Choose and Optimize your ML algorithm Setup and manage environments for training Deploy model in production Collect & prepare training data Train and tune model (trial and error) Scale & manage environment in production
  • 57. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. References Workshop: eksworkshop.com/kubeflow Jupyter notebooks: github.com/aws-samples/eks- kubeflow-workshop/ Optimizing Machine Learning performance: aws.amazon.com/blogs/opensource/optimizing- distributed-deep-learning-performance-amazon-eks/