SlideShare a Scribd company logo
How To Build Efficient ML Pipelines
From the Startup Perspective
Jaeman An <jaeman@aitrics.com>
GPU Technology Conference, 2019
Machine Learning Pipelines
Challenges that many fast-growing startups face
Solutions we came up with
Several tools and tips that may be useful for you : kubernetes, polyaxon,
kubeflow, terraform, ...
Way to build your own training farm by step by step
How to deploy & manage trained model by step by step
What you can get from this talk
01 Why we built a ML pipeline
02 Brief introduction to kubernetes
03 Model building & training phase
- Building training farm from zero (step by step)

- Terraform, Polyaxon
04 Model deployment & production phase
- Building inference farm from zero (step by step)

- Several ways to make microservices

- Kubeflow
05 Conclusion
06 What's next?
Why we built a ML pipeline
Buy GPU machines
Build (Explore) your own models
Train models
Freeze and deploy as as service
Conduct fitting and re-training
Earn money and exit
Very simple way to start machine learning startup
Model building
Training
Deploying
Fitting, re-training
Data refining
Buy GPU machines
Build (Explore) your own models
Train models
Freeze and deploy as as service
Conduct fitting and re-training
Earn money and exit
Very simple way to start machine learning startup
Model building
Training
Deploying
Fitting, re-training
Data refining
Mostly time-consuming job
Sometimes we need to do large-scale data processing
Use Apache Spark! 

(This won’t be covered in this talk)
We've not handle real-time data *yet*
Kafka Streams is feasible solution 

(This won’t be covered in this talk)
Have to manage several data versions
due to sampling policies and operational definitions (labeling)
Can use Git-like solutions
It'll be great to import data easily in the training phase like
./train --data=images_v1
Permission Control
What's going on in data refining phase
Model building
Training
Deploying
Fitting, re-training
Data refining
Referring tons of precedent research
Pick a simple model for baseline with small set of data
Check minimal accuracy and debug our model
(if data matters) refining data more precisely
(if model matters) iteratively improve our model
Mostly only need GPU instance or notebook and small
datasets; don't want to care about other stuffs!
./run-notebook tf-v12-gpu --gpu=4 --
data=images_v1
./ssh tf-v12-gpu --gpu=2 --data=images_v1
What's going on in model building phase
Model building
Training
Deploying
Fitting, re-training
Data refining
Training on large datasets
Researchers have to "hunt" idle GPU resources by
accessing 10+ servers one by one
Scalability: Sometimes there’s no idle GPU resources
(depends on product timeline / paper deadline)
Access Control: Sometimes all resources are
occupied by outside collaborators
Data accessibility: Fetching / moving training data
servers to servers is very painful!
Monitoring: Want to know how our experiments are
going and what's going on our resources
What's going on in training phase
Model building
Training
Deploying
Fitting, re-training
Data refining
In the middle of machine learning
engineering and software engineering
Want to manage model independently for
the product
Build micro-services that inference test data
synchronously / asynchronously
Have to consider high availability on
production usage
What's going on in deploying phase
Model building
Training
Deploying
Fitting, re-training
Data refining
Data distribution always changes; therefore, have to
keep fitting the model with the real data
Want to easily change the model code interactively
Try to build online-learning model or re-training model
in certain schedule
Sometimes need to create real time data flow with
Kafka
Have to manage several model versions
As new models are developed
As the usage varies
What's going on to us in fitting phase
Model building
Training
Deploying
Fitting, re-training
Data refining
Model building & training phase:
We need to know the status of resources without access to our
physical servers one by one.
We want to use easily idle GPU with proper training datasets
We have to control permissions of our resources and datasets
We only want to mainly focus on our research: developing innovative
models, conducting experiments and such, ... not infrastructures
Problems and requirements
Model deploying & updating phase:
It's hard to control because it is in the middle of machine learning
engineering and software engineering
We want to create simple micro-services that don't need much management
There are many models with different purposes; 

- some models need real-time inference

- some models do not require real-time, but they need inference in the
certain time range
We have to consider high availability configuration
Models must be fitted and re-trained easily
We have to manage several versions of models
Problems and requirements
Managing resources over multiple servers, deploying microservices,
permission controls, ...
These can be solved with orchestration solutions.
We are going to build training farm using kubernetes.
Before that, what is kubernetes?
How to solve
Kubernetes in 5 minutes
Kubernetes (k8s) is an open-source
system for automating deployment,
scaling, and management of
containerized applications.
It orchestrates computing, networking,
and storage infrastructure on behalf of
user workloads.
NVIDIA GPU also can be orchestrated
through NVIDIA's k8s device plugin
Kubernetes
k8s
Master
Storages
RW W R R
k8s
Minion
ContainerPod
Service
Ingress NodePort
Internet
k8s
Minion
Storages
Attach
Give me 4 CPU, 1 Memory, 1 GPU
I’m Jaeman An, and I’m in team A
namespace
With 4 External Port
With abcd.aitrics.com hostname
With latest gpu tensorflow image
With 100GB writable volumes and data
from readable source
Kubernetes
OK, Here you are
No, you have no permission
No, you've already use resources
that you can
No, there's no idle resources, please wait
k8s
Master
Storages
RW W R R
k8s
Minion
ContainerPod
Service
Ingress NodePort
Internet
K8s
Minion
Storages
Attach
Kubernetes
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
Namespace
Role & Authorization
Resource Quota
<Objects> <Meta & Policies>
k8s
Master
Storages
RW W R R
k8s
Minion
ContainerPod
Service
Ingress NodePort
Internet
K8s
Minion
Storages
Attach
Kubernetes
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
A Pod is the basic building block of Kubernetes ‒
the smallest and simplest unit in the Kubernetes
object model that you create or deploy. A Pod
represents a running process on your cluster.
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-base
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
command: ["nvidia-smi"]
Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/
Kubernetes
A Service is an abstraction which defines a logical
set of Pods and a policy by which to access them -
sometimes called a micro-service.
kind: Service
apiVersion: v1
metadata:
name: my-service
spec:
selector:
app: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
Ref: https://kubernetes.io/docs/concepts/services-networking/service/
Kubernetes
Ingress exposes HTTP and HTTPS routes from
outside the cluster to services within the cluster.
Traffic routing is controlled by rules defined on
the Ingress resource.
kind: Ingress
metadata:
name: test-ingress
spec:
rules:
- host: foo.bar.com
- http:
paths:
- backend:
serviceName: MyService
servicePort: 80
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
Ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
Kubernetes
A PersistentVolume (PV) is a piece of storage in
the cluster that has been provisioned by an
administrator. It is a resource in the cluster just
like a node is a cluster resource.
kind: PersistentVolume
metadata:
name: pv0003
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
nfs:
path: /tmp
server: 172.17.0.2
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
Ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
Kubernetes
A PersistentVolumeClaim (PVC) is a request for
storage by a user. Claims can request specific size
and access modes (e.g., can be mounted once
read/write or many times read-only).
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 8Gi
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
Ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims
Kubernetes
A Job creates one or more Pods and ensures that
a specified number of them successfully
terminate. As pods successfully complete,
the Job tracks the successful completions.
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-
wle", "print bpi(2000)"]
Workload & Services
Pod
Service
Ingress
Deployment
Replication Controller
...
Storage Class
PersistentVolume
PersistentVolumeClaim
...
Workload Controllers
Job
CronJob
ReplicaSet
RepliactionController
DaemonSet
...
Ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims
Kubernetes
Kubernetes supports multiple virtual clusters backed by
the same physical cluster. These virtual clusters are
called namespaces. Those are intended for use in
environments with many users spread across multiple
teams, or projects.
$ kubectl get namespaces
NAME STATUS AGE
default Active 1d
kube-system Active 1d
kube-public Active 1d
Policies & Others
Namespace
Resource Quota
Role & Authorization
Ref: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
Kubernetes
A resource quota, defined by a ResourceQuota
object, provides constraints that limit aggregate
resource consumption per namespace.
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
requests.nvidia.com/gpu: 1
Policies & Others
Namespace
Resource Quota
Role & Authorization
Ref: https://kubernetes.io/docs/concepts/policy/resource-quotas/
Kubernetes
In Kubernetes, you must be authenticated
(logged in) before your request can be authorized
(granted permission to access).
Kubernetes uses client certificates, bearer tokens,
an authenticating proxy, or HTTP basic auth to
authenticate API requests through authentication
plugins.
Policies & Others
Namespace
Resource Quota
Role & Authorization
Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
Kubernetes
Role-based access control (RBAC) is a method of
regulating access to computer or network
resources based on the roles of individual users
within an enterprise.
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""]
group:
resources: ["pods"]
verbs: ["get", "watch", "list"]
Policies & Others
Namespace
Resource Quota
Role & Authorization
Ref: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Kubernetes
Role-based access control (RBAC) is a method of
regulating access to computer or network
resources based on the roles of individual users
within an enterprise.
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: read-pods
namespace: default
subjects:
- kind: User
name: jane
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Policies & Others
Namespace
Resource Quota
Role & Authorization
Ref: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Model building & training phase
- Building training farm from zero (step by step)
- Polyaxon
- Terraform
We need to know GPU resource status without accessing our physical
servers one by one.
We want to easily use idle GPU with proper training datasets
We have to control permissions of our resources and datasets
We only want to focus on our research: building models, doing the
experiments, ... not infrastructures!
./run-notebook tf-v12-gpu --gpu=4 --data=images_v1
./train tf-v12-gpu model.py --gpu=4 --data=images_v1
./ssh tf-v12-gpu --gpu=4 --data=images_v1 --exposes-
port=4
RECAP: Our requirements
Blueprint
Blueprint
Step 1. Install Kubernetes master on AWS
Step 2. Install Kubernetes as nodes in physical servers
Step 3. Run hello world training containers
Step 4. RBAC Authorization & resource quota
Step 5. Expand GPU servers on demand with AWS
Step 6. Attach training data
Step 7. Web dashboard or cli tools to run training container
Step 8. With other tools (Polyaxon)
Instructions
There are several ways to install kubernetes
Use kubeadm in this session.
Other options: conjure-up, kops
Network option: flannel (https://github.com/coreos/flannel)
Server configuration that I've used in k8s master:
AWS t3.large: 2 vCPUs, 8GB Memory
Ubuntu 18.04, docker version 18.09
Step 1. Install Kubernetes master on AWS
Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
Step 1. Install Kubernetes master on AWS
# Install kubeadm
# https://kubernetes.io/docs/setup/independent/install-kubeadm/ 
$ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg 

| apt-key add -
$ cat <<EOF > /etc/apt/sources.list.d/kubernetes.list

deb https://apt.kubernetes.io/ kubernetes-xenial main

EOF
$ apt-get install -y kubelet kubeadm kubectl
Ref: https://kubernetes.io/docs/setup/independent/install-kubeadm/
Step 1. Install Kubernetes master on AWS
# Initialize with Flannel (https://github.com/coreos/flannel)
$ kubeadm init --pod-network-cidr=10.244.0.0/16
Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
Step 1. Install Kubernetes master on AWS
# Initialize with Flannel (https://github.com/coreos/flannel)
$ kubeadm init --pod-network-cidr=10.244.0.0/16
Your kubernetes master has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You can now join any number of machines by running the following on each node
as root:
kubeadm join 172.31.30.194:6443 --token *** --discovery-token-ca-cert-hash ***
Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
Step 1. Install Kubernetes master on AWS
# Initialize with Flannel (https://github.com/coreos/flannel)
$ kubectl -n kube-system apply -f https://raw.githubusercontent.com/
coreos/flannel/62e44c867a2846fefb68bd5f178daf4da3095ccb/
Documentation/kube-flannel.yml
Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
Step 1. Install Kubernetes master on AWS
# Install NVIDIA k8s-device-plugin
# https://github.com/NVIDIA/k8s-device-plugin
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-
device-plugin/v1.11/nvidia-device-plugin.yml
Ref: https://github.com/NVIDIA/k8s-device-plugin
In this step,
install nvidia-docker
join to kubernetes master
use kubeadm join command
install NVIDIA's k8s-device-plugin
create kubernetes dashboard to check resources
Server configuration that I've used in k8s node:
32 CPU core, 128GB Memory
4 GPU (Titan Xp), Driver version: 396.44
Ubuntu 16.04, docker version 18.09
Step 2. Install kubernetes as nodes in physical servers
Step 2. Install kubernetes as nodes in physical servers
# Install nvidia-docker (https://github.com/NVIDIA/nvidia-docker)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key
add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-
docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
$ apt-get update
$ apt-get install -y nvidia-docker2
Ref: https://github.com/NVIDIA/nvidia-docker
Step 2. Install kubernetes as nodes in physical servers
# change docker default runtime to nvidia-docker
$ vi /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": “nvidia-container-runtime",
"runtimeArgs": []
}
}
}
$ systemctl restart docker
Ref: https://github.com/NVIDIA/nvidia-docker
Step 2. Install kubernetes as nodes in physical servers
# test nvidia-docker is successfully installed
$ docker run --rm -it nvidia/cuda nvidia-smi
Ref: https://github.com/NVIDIA/nvidia-docker
Step 2. Install kubernetes as nodes in physical servers
# test nvidia-docker is successfully installed
$ docker run --rm -it nvidia/cuda nvidia-smi
+----------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 CUDA Version: 10.0 |
|----------------------------------------------------------------------|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|=============================+=================+======================|
| 0 Titan Xp On | 00 :00:1E.0 Off | 0 |
+-----------------------------+-----------------+-------- -------------+
+----------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|======================================================================|
| No running processes found |
+----------------------------------------------------------------------+
Ref: https://github.com/NVIDIA/nvidia-docker
Step 2. Install kubernetes as nodes in physical servers
# join to kubernetes master with kubeadm
$ kubeadm join 172.31.30.194:6443 --token *** --discovery-token-ca-
cert-hash ***
Step 2. Install kubernetes as nodes in physical servers
# join to kubernetes master with kubeadm
$ kubeadm join 172.31.30.194:6443 --token *** --discovery-token-ca-
cert-hash ***
...
This node has joined the cluster.
* Certificate signing request was sent to apiserver and a response was
received
* The Kubelet was informed of the new secure connection details
Run 'kubectl get nodes' on the master to see this node join the
cluster.
Step 2. Install kubernetes as nodes in physical servers
# check the node join the cluster
# run this on the master
$ kubectl get nodes
Step 2. Install kubernetes as nodes in physical servers
# check if the node (named as 'stark') join the cluster
# run this command on the master
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-99-9 Ready master 99d v1.12.2
stark Ready <none> 99d v1.12.2
Step 2. Install kubernetes as nodes in physical servers
# create kubernetes dashboard
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/
dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml
$ kubectl proxy
Ref: https://github.com/kubernetes/dashboard
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon Valley 2019)
Write pod definition
Run nvidia-smi with cuda image
Train MNIST with tensorflow and save model in S3
Step 3. Run hello-world container
Example: nvidia-smi
# run nvidia-smi in container
# pod.yml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
command: ["nvidia-smi"]
Example: nvidia-smi
# create pod from definition
$ kubectl create -f pod.yml
Example: nvidia-smi
# create pod from definition
$ kubectl create -f pod.yml
pod/gpu-pod created
Example: nvidia-smi
# create pod from definition
$ kubectl logs gpu-pod
+----------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 CUDA Version: 10.0 |
|----------------------------------------------------------------------|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|=============================+=================+======================|
| 0 Titan Xp On | 00 :00:1E.0 Off | 0 |
+-----------------------------+-----------------+-------- -------------+
+----------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|======================================================================|
| No running processes found |
+----------------------------------------------------------------------+
Example: MNIST
# train_mnist.py
import tensorflow as tf
def main(args):
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=args.epoch)
model.evaluate(x_test, y_test)
saved_model_path = tf.contrib.saved_model.save_keras_model(model, args.save_dir)
Example: MNIST
# Dockerfile
FROM tensorflow/tensorflow:latest-gpu-py3
WORKDIR /train_demo/
COPY . /train_demo/
RUN pip --no-cache-dir install --upgrade awscli
ENTRYPOINT ["/train_demo/run.sh"]
# run.sh
python train_mnist.py --epoch 1
aws s3 sync saved_models/ $MODEL_S3_PATH
Example: MNIST
# pod definition
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: aitrics/train-mnist:1.0
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
env:
- name: MODEL_S3_PATH
value: "s3://aitrics-model-bucket/saved_model"
Example: MNIST
# create pod from definition
$ kubectl create -f pod.yml
pod/gpu-pod created
It works!
Example: MNIST
Now we have,
Minimally working proof of concept
Researchers can train on kubernetes with kubectl
We have to do,
RBAC (Role based access control) between researchers, engineers, and outside
collaborators.
Training data & output volume attachment
Researchers don't want to know what kubernetes is. They only need
a instance which are accessible via SSH (with frameworks and training data)
or nice webview and jupyter notebook
or automatic hyperparameter searching...
Summary
Instructions:
Create user (team) namespace
Create user credentials with cluster CA key
default CA key location: /etc/kubernetes/pki
Create role and role binding with proper permissions
Create resource quota per namespace
References:
https://docs.bitnami.com/kubernetes/how-to/configure-rbac-in-your-kubernetes-
cluster/
https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Step 4. Role Based Access Control & Resource Quota
Step 4. Role Based Access Control & Resource Quota
# create user (team) namespace
$ kubectl create namespace team-a
Step 4. Role Based Access Control & Resource Quota
# create user (team) namespace
$ kubectl get namespaces
NAME STATUS AGE
default Active 99d
team-a Active 4s
kube-public Active 99d
kube-system Active 99d
Step 4. Role Based Access Control & Resource Quota
# create user credentials
$ openssl genrsa -out jaeman.key 2048
$ openssl req -new -key jaeman.key -out user.csr -subj "/CN=jaeman/
O=aitrics"
$ openssl x509 -req -in jaeman.csr -CA CA_LOCATION/ca.crt -CAkey
CA_LOCATION/ca.key -CAcreateserial -out jaeman.crt -days 500
Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
Step 4. Role Based Access Control & Resource Quota
# create Role definition
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: team-a
name: software-engineer-role
rules:
- apiGroups: ["", "extensions", "apps"]
resources: ["deployments", "replicasets", "pods", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "patch",
"delete"] # You can also use ["*"]
Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
Step 4. Role Based Access Control & Resource Quota
# create ClusterRoleBinding definition
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: team-a
name: jaeman-software-engineer-role-binding
subjects:
- kind: User
name: jaeman
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: software-engineer-role
apiGroup: rbac.authorization.k8s.io
Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
Step 4. Role Based Access Control & Resource Quota
# create resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
requests.nvidia.com/gpu: 1
Store kubeadm join script in S3
Write userdata (instance bootstrap script)
install kubeadm, nvidia-docker
join
Add AutoScaling Group
Step 5. Expand GPU servers on AWS
Step 5. Expand GPU servers on AWS
# save master join command in AWS S3
# s3://k8s-training-cluster/join.sh
kubeadm join 172.31.75.62:6443 --token *** --discovery-token-ca-cert-
hash ***
Step 5. Expand GPU servers on AWS
# userdata script file
# RECAP: install kubernetes as a node to join master (step 2)
# install kubernetes
apt-get install -y kubelet kubeadm kubectl
# install nvidia-docker
apt-get install -y nvidia-docker2
...
$(aws s3 cp s3://k8s-training-cluster/join.sh -)
Step 5. Expand GPU servers on AWS
Step 5. Expand GPU servers on AWS
Step 5. Expand GPU servers on AWS
# check bootstrapping log
$ tail -f /var/log/cloud-init-output.log
Step 5. Expand GPU servers on AWS
# check bootstrapping log
$ tail -f /var/log/cloud-init-output.log
...
++ aws s3 cp s3://k8s-training-cluster/join.sh -
+ kubeadm join 172.31.75.62:6443 --token *** --discovery-token-ca-cert-
hash ***
[preflight] Running pre-flight checks
[discovery] Trying to connect to API Server "172.31.75.62:6443"
[discovery] Created cluster-info discovery client, requesting info from
"https://172.31.75.62:6443"
[discovery] Requesting info from "https://172.31.75.62:6443" again to
validate TLS against the pinned public key
...
Initially store training data in S3 (with encryption)
Option 1: Download training data when pod starts
training data is usually big
same training data are often used, so it would be very inefficient
caching to host machine volumes --> occupied easily
use storage server and mount volumes that!
Option 2: Create NFS on AWS EC2 or storage server (e.g. NAS)
Sync all data with S3
Mount as Persistent Volume with ReadOnlyMany / ReadWriteMany
Option 3: shared storage with s3fs
https://icicimov.github.io/blog/virtualization/Kubernetes-shared-storage-with-S3-backend/
Step 6. Training data attachment
Step 6. Training data attachment
# make nfs server on EC2 (or physical storage server)
# https://www.digitalocean.com/community/tutorials/how-to-set-up-an-
nfs-mount-on-ubuntu-16-04
$ apt-get update
$ apt-get install nfs-kernel-server
$ mkdir /var/nfs -p

$ cat <<EOF > /etc/exports

/var/nfs 172.31.75.62(rw,sync,no_subtree_check)

EOF

$ systemctl restart nfs-kernel-server
Step 6. Training data attachment
# define persistent volume
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs
spec:
capacity:
storage: 3Gi
accessModes:
- ReadWriteMany
nfs:
server: <server ip>
path: "/var/nfs"
Step 6. Training data attachment
# define persistent volume claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
resources:
requests:
storage: 3Gi
Step 6. Training data attachment
# mount volume in pod
apiVersion: v1
kind: Pod
metadata:
name: pvpod
spec:
volumes:
- name: testpv
persistentVolumeClaim:
claimName: nfs-pvc
containers:
- name: test
image: python:3.7.2
volumeMounts:
- name: testpv
mountPath: /data/test
Make script like
./kono ssh --image tensorflow/tensorflow --expose-
ports 4
./kono train --image tensorflow/tensorflow --
entrypoint main.py .
Create web dashboard
Step 7. Web dashboard or cli tools to run training container
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono login
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono login



Username: jaeman

Password: [hidden]
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono train 

--image tensorflow/tensorflow:latest-gpu 

--gpu 1 

--script train.py 

--input-data /var/project-a-data/:/opt/project-a-data/ 

--output-dir /opt/outputs/:./outputs/ 

-- 

--epoch=1 --checkpoint=/opt/outputs/ckpts/

Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono train 

--image tensorflow/tensorflow:latest-gpu 

--gpu 1 

--script train.py 

--input-data /var/project-a-data/:/opt/project-a-data/ 

--output-dir /opt/outputs/:./outputs/ 

-- 

--epoch=1 --checkpoint=/opt/outputs/ckpts/



...

...

training completed!

Sending output directory to s3... [>>>>>>>>>>>>>>>>>>>>>>>] 100%

Pulling output directory to local... [>>>>>>>>>>>>>>>>>>>>>>>] 100%

Check your directory ./outputs/
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono ssh 

--image tensorflow/tensorflow:latest-gpu 

--gpu 1 

--expose-ports 4 

--input-data /var/project-a-data/:/opt/project-a-data/

Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono ssh 

--image tensorflow/tensorflow:latest-gpu 

--gpu 1 

--expose-ports 4 

--input-data /var/project-a-data/:/opt/project-a-data/



...

...

...



Your container is ready!

ssh ubuntu@k8s.aitrics.com -p 31546
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono terminate-all --force
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono terminate-all --force



terminate all your containers? [Y/n]: Y
Step 7. Web dashboard or cli tools to run training container
# cli tool to use our cluster
$ kono terminate-all --force



terminate all your containers? [Y/n]: Y



...

...

...

Success!
Step 7. Web dashboard or cli tools to run training container
We are still working on it
Check our improvements or contribute to us
https://github.com/AITRICS/kono
Step 7. Web dashboard or cli tools to run training container
A platform for reproducing and managing the whole life cycle of
machine learning and deep learning applications.
https://polyaxon.com/
Most feasible tools 

to our training cluster
Can be installed on

kubernetes easily
Step 8. Use other tools (polyaxon)
Ref: https://www.polyaxon.com/
Polyaxon usage
# Polyaxon usage
# Create a project
$ polyaxon project create --name=quick-start --description='Polyaxon
quick start.’

# Initialize
$ polyaxon init quick-start

# Upload code and start experiments
$ polyaxon run -u
Ref: https://github.com/polyaxon/polyaxon
Polyaxon usage
Polyaxon usage
Polyaxon is a platform for managing the whole lifecycle of large scale deep
learning and machine learning applications, and it supports all the major
deep learning frameworks such as Tensorflow, MXNet, Caffe, Torch, etc.
Features
Powerful workspace
Reproducible results
Developer-friendly API
Built-in Optimization engine
Plugins & integrations
Roles & permissions
Polyaxon
Ref: https://docs.polyaxon.com/concepts/features/
Polyaxon architecture
Ref: https://docs.polyaxon.com/concepts/architecture/
1. Create project on polyaxon
polyaxon project create --name=quick-start
2. Initialize the project
polyaxon init quick-start
3. Create polyaxonfile.yml
See next slide
4. Upload your code and start an experiment with it
How to run my experiment on polyaxon?
Polyaxon usage
# polyaxonfile.yml
version: 1
kind: experiment
build:
image: tensorflow/tensorflow:1.4.1-py3
build_steps:
- pip3 install polyaxon-client
run:
cmd: python model.py
Ref: https://docs.polyaxon.com/concepts/quick-start-internal-repo/
Polyaxon usage
# model.py
# https://github.com/polyaxon/polyaxon-quick-start/blob/master/model.py
from polyaxon_client.tracking import Experiment, get_data_paths, get_outputs_path
data_paths = list(get_data_paths().values())[0]
mnist = input_data.read_data_sets(data_paths, one_hot=False)
experiment = Experiment()
...
estimator = tf.estimator.Estimator(
get_model_fn(learning_rate=learning_rate, dropout=dropout, activation=activation),
model_dir=get_outputs_path())
estimator.train(input_fn, steps=num_steps)
...
experiment.log_metrics(loss=metrics['loss'],
accuracy=metrics['accuracy'],
precision=metrics['precision'])
Ref: https://github.com/polyaxon/polyaxon-quick-start/blob/master/model.py
Polyaxon usage
# Integrations in polyaxon
# Notebook
$ polyaxon notebook start -f polyaxon_notebook.yml
# Tensorboard
$ polyaxon tensorboard -xp 23 start

Ref: https://github.com/polyaxon/polyaxon
How to?
Make single file train.py that accepts 2 parameters
learning rate - lr
batch size - batch_size
Update the polyaxonfile.yml with matrix
Make experiment group
Experiment group search algorithm
grid search / random search / Hyperband / Bayesian Optimization
https://docs.polyaxon.com/references/polyaxon-optimization-engine/
Experiment Groups - Hyperparameter Optimization
Ref: https://docs.polyaxon.com/concepts/experiment-groups-hyperparameters-optimization/
Experiment Groups - Hyperparameter Optimization
# polyaxonfile.yml
version: 1
kind: group
declarations:
batch_size: 128
hptuning:
matrix:
lr:
logspace: 0.01:0.1:5
build:
image: tensorflow/tensorflow:1.4.1-py3
build_steps:
- pip install scikit-learn
run:
cmd: python3 train.py --batch-size={{ batch_size }} --lr={{ lr }}
Ref: https://docs.polyaxon.com/concepts/experiment-groups-hyperparameters-optimization/
Experiment Groups - Hyperparameter Optimization
# polyaxonfile_override.yml
version: 1
hptuning:
concurrency: 2
random_search:
n_experiments: 4
early_stopping:
- metric: accuracy
value: 0.9
optimization: maximize
- metric: loss
value: 0.05
optimization: minimize
Ref: https://docs.polyaxon.com/concepts/experiment-groups-hyperparameters-optimization/
Instructions
Install helm - kubernetes application manager
Create polyaxon namespace
Write your own config for polyaxon
Run polyaxon with helm
How to install polyaxon?
How to install polyaxon?
# install helm (kubernetes package manager)
$ snap install helm --classic
$ helm init
Ref: https://github.com/polyaxon/polyaxon
How to install polyaxon?
# install polyaxon with helm
$ kubectl create namespace polyaxon
$ helm repo add polyaxon https://charts.polyaxon.com
$ helm repo update
Ref: https://github.com/polyaxon/polyaxon
How to install polyaxon?
# config.yaml
rbac:
enabled: true
ingress:
enabled: true
serviceType: LoadBalancer
persistent:
data:
training-data-a-s3:
store: s3
bucket: s3://aitrics-training-data
data-pvc1:
mountPath: "/data-pvc/1"
existingClaim: "data-pvc-1"
outputs:
devtest-s3:
store: s3
bucket: s3://aitrics-dev-test
integrations:
slack:
- url: https://hooks.slack.com/services/***/***
channel: research-feed
Ref: https://github.com/polyaxon/polyaxon
How to install polyaxon?
# install polyaxon with helm
$ helm install polyaxon/polyaxon 

--name=polyaxon 

--namespace=polyaxon 

-f config.yml
How to install polyaxon?
# install polyaxon with helm
$ helm install polyaxon/polyaxon 

--name=polyaxon 

--namespace=polyaxon 

-f config.yml
1. Get the application URL by running these commands:
export POLYAXON_IP=$(kubectl get svc --namespace polyaxon polyaxon-polyaxon-
ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export POLYAXON_HTTP_PORT=80
export POLYAXON_WS_PORT=80
echo http://$POLYAXON_IP:$POLYAXON_HTTP_PORT
2. Setup your cli by running theses commands:
polyaxon config set --host=$POLYAXON_IP --http_port=$POLYAXON_HTTP_PORT —
ws_port=$POLYAXON_WS_PORT
Summary
S3
NAS
NFS
GPU Nodes
Auto scaling
Master Storage Kubernetes minion
AWS
Physical server
kono-cli
RBAC & Resource Quota
namespace
web
Kono-Web Polyaxon
Service Plane
k8s
Service
k8s
Ingress ELB
Control plane
Training farm
kono-web polyaxon
single EC2
or
multiple EC2
Need to know GPU resource status without accessing our physical servers one
by one.
Use web dashboard or other monitoring tools like Prometheus +
cAdvisor
Want to easily use idle GPU with proper training datasets
Use kubernetes objects to get resources and to mount volumes
Have to control permissions of our resources and datasets
RBAC / Resource quota in kubernetes
Want to focus on our research: building models, doing the experiments, ... not
infrastructures!
Use kono / polyaxon
RECAP: Our requirements
Make it as reusable component
Use Terraform
Too many steps to build my own cluster!
Infrastructure as a code
Terraform
Infrastructure as a code
Terraform
resource "aws_instance" "master" {
ami = "ami-593801f1"
instance_type = "t3.small"
key_name = "aitrics-secret-master-key"
iam_instance_profile = "kubernetes-master-iam-role"
user_data = "${data.template_file.master.rendered}"
root_block_device = {
volume_size = "15"
}
}
$ terraform apply
Infrastructure as a code
Terraform
resource "aws_instance" "master" {
ami = "ami-593801f1"
instance_type = "t3.small"
key_name = "aitrics-secret-master-key"
iam_instance_profile = "kubernetes-master-iam-role"
user_data = "${data.template_file.master.rendered}"
root_block_device = {
volume_size = "15"
}
}
We publish our infrastructure as a code
https://github.com/AITRICS/kono
Configure your settings and just type `terraform apply` to get your
own training cluster!
Terraform
Model deployment & production phase
- Building inference farm from zero (step by step)

- Several ways to make microservices

- Kubeflow
It's hard to control because it is in the middle of machine learning
engineering and software engineering
We want to create simple micro-services that don't need much
management
There are many models with different purposes; 

- some models need real-time inference

- some models do not require real-time, but they need inference in the
certain time range
We have to consider high availability configuration
Models must be fitted and re-trained easily
We have to manage several versions of models
RECAP: Our requirements
Step 1. Build another kubernetes cluster for production
Step 2. Make simple web-based micro services for trained models
2-1. HTTP API Server Example
2-2. Asynchronous inference farm example
Step 3. Deploy
3-1. on the kubernetes with ingress
3-2. standalone server with docker and auto scaling group
Step 4. Using TensorRT Inference Server
Step 5. Terraform
Case Study. Kubeflow
Instructions
Launch again like training cluster!
Step 1. Build production kubernetes cluster
2-1. For real time inference (synchronous)
Use simple web framework to build HTTP-based microservice!
We use bottle (or flask)
2-2. For asynchronous (inference farm)
with kubernetes job - has overheads to be executed
with celery - which I prefer
Step 2. Make simple web-based microservices for trained models
Example. Using bottle for HTTP based microservices
from bottle import run, get, post, request, response
from bottle import app as bottle_app
from aws import aws_client
@post('/v1/<location>/<prediction_type>/')
def inference(location, prediction_type):
model = select_model(location, prediction_type)
input_array = deserialize(request.json)
output_array = inference(input_array)
return serialize(output_array)
if __name__ == '__main__':
args = parse_args()
aws_client.download_model(args.model_path, args.model_version)
app = bottle_app()
run(app=app, host=args.host, port=args.port)
Example. Using kubernetes job for inference
# job.yml
apiVersion: batch/v1
kind: Job
metadata:
name: inference-job
spec:
template:
spec:
containers:
- name: inference
image: inference
command: ["python", "main.py", "s3://ps-images/images.png"]
restartPolicy: Never
backoffLimit: 4
Ref: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
Celery is an asynchronous task queue/job queue based on distributed
message passing. It is focused on real-time operation, but supports
scheduling as well.
Celery is used in production systems to process millions of tasks a day.
Celery
from celery import Celery
app = Celery('hello', broker='amqp://guest@localhost//')
@app.task
def hello():
return 'hello world'
Ref: http://www.celeryproject.org/
Example. Using celery for asynchronous inference farm
from celery import task
from aws import aws_client
from db import IdentifyResult
from aitrics.models import FasterRCNN
model = FasterRCNN(model_path=settings.MODEL_PATH)
@task
def task_identify_image_color_shape(id, s3_path):
image = aws_client.download_image(s3_path)
color, shape = model.inference(image)
IdentifyResult.objects.create(id, s3_path, color, shape)
on the kubernetes cluster
service & ingress to expose
use workload controller like deployments, replica set, replication
controller, don't use pod itself to get high availability.
on the AWS instance directly
simple docker run example
use auto scaling group and load balancers with userdata
Step 3. Deploy
Step 3-1. Deploy on kubernetes cluster (ingress)
kind: Ingress
metadata:
name: inference-ingress
spec:
rules:
- host: inference.aitrics.com
- http:
paths:
- backend:
serviceName: MyInferenceService
servicePort: 80
Ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
Step 3-1. Deploy on kubernetes cluster (deployment)
kind: Deployment
metadata:
name: inference-deployment
spec:
replicas: 3
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
containers:
- name: ps-inference
image: ps-inference:latest
ports:
- containerPort: 80
Ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
Step 3-2. Deploy on EC2 directly
#!/bin/bash
docker kill ps-inference || true
docker rm ps-inference || true
docker run -d -p 35000:8000 
--runtime=nvidia 
-e NVIDIA_VISIBLE_DEVICES=0 
docker-registry.aitrics.com/ps-inference:gpu 
--host=0.0.0.0 
--port=8000 
--sentry-dsn=http://somesecretstring@sentry.aitricsdev.com/13 
--gpus=0 
--character-model=best_model.params/faster_rcnn_renet101_v1b 
--shape-model=scnet_shape.params/ResNet50_v2 
--color-model=scnet_color.params/ResNet50_v2 
--s3-bucket=aitrics-research 
--s3-path=faster_rcnn/result/181109 
--model-path=.data/models 
--aws-access-key=*** 
--aws-secret-key=***
TensorRT is a high-performance deep learning inference optimizer and
runtime engine for production deployment of deep learning
applications.
Step 4. Using TensorRT Inference Server
Ref: https://developer.nvidia.com/tensorrt
Use Tensorflow or Caffe to apply TensorRT easily
Consider TensorRT when you build model
Some operations might not be supported
Add some TensorRT related code in Python script
Use TensorRT docker image to run inference server.
Step 4. Using TensorRT Inference Server
Step 4. Using TensorRT Inference Server
# TensorRT From ONNX with Python Example
import tensorrt as trt
with builder = trt.Builder(TRT_LOGGER) as builder, 
builder.create_network() as network, 
trt.OnnxParser(network, TRT_LOGGER) as parser:
with open(model_path, 'rb') as model:
parser.parse(model.read())
...
Ref: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#import_onnx_python
Step 4. Using TensorRT Inference Server
# Dockerfile
# https://github.com/NVIDIA/tensorrt-inference-server/blob/master/Dockerfile
FROM aitrics/tensorrt-inference-server:cuda9-cudnn7-onnx
ADD . /ps-inference/
ENTRYPOINT ["/ps-inference/run.sh"]
Ref: https://github.com/onnx/onnx-tensorrt/blob/master/Dockerfile
You can also find our inference cluster as a code!
https://github.com/AITRICS/kono
Configure your settings and test example microservices and inference
farm with terraform!
Step 5. Terraform
The Kubeflow project is dedicated to making deployments of machine learning (ML)
workflows on Kubernetes simple, portable and scalable.
https://www.kubeflow.org/
When to use
You want to train/serve TensorFlow models in different environments (e.g. local, on
prem, and cloud)
You want to use Jupyter notebooks to manage TensorFlow training jobs
You want to launch training jobs that use resources ‒ such as additional CPUs or
GPUs ‒ that aren’t available on your personal computer
You want to combine TensorFlow with other processes
For example, you may want to use tensorflow/agents to run simulations to
generate data for training reinforcement learning models.
Case Study. Kubeflow
Ref: https://www.kubeflow.org/
Re-define a machine learning workflow object with kubernetes
object
Run training, inferencing, serving, and other things on kubernetes
Need ksonnet, configuration management tools for kubernets manifests
https://www.kubeflow.org/docs/components/ksonnet/
Only works well with tensorflow (support for PyTorch, MPI, MXNet is on
alpha/beta stage)
Some functions only works on GKE cluster
Very early stage product (less than 1 year)
Case Study. Kubeflow
TF Job
# TF Job
# https://www.kubeflow.org/docs/components/tftraining/
apiVersion: kubeflow.org/v1beta1
kind: TFJob
metadata:
labels:
experiment: experiment10
name: tfjob
namespace: kubeflow
spec:
tfReplicaSpecs:
Ps:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
...
Ref: https://www.kubeflow.org/docs/components/tftraining/
Pipelines
Ref: https://www.kubeflow.org/docs/components/tftraining/
Conclusion
You can build your own training cluster!
You also can build your own inference cluster!
If you do not want to get your hands dirty, you can use our terraform
code and cli.
https://github.com/AITRICS/kono
Summary
What's next?
Monitoring resources
Prometheus + cAdvisor
https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/
Training models from real-time data streaming
Real-time one Kafka Stream (+ Spark Streaming) + Online learning
https://github.com/kaiwaehner/kafka-streams-machine-learning-
examples
Large-scale data preprocessing
Apache Spark
What's next topic (which is not covered)?
Distributed training
Polyaxon supports: https://github.com/polyaxon/polyaxon-
examples/blob/master/in_cluster/tensorflow/cifar10/
polyaxonfile_distributed.yml
Use horovod: https://github.com/horovod/horovod
Model & Data Versioning
https://github.com/iterative/dvc
What's next topic (which is not covered)?
contact@aitrics.com
Tel. +82 2 569 5507 Fax. +82 2 569 5508
www.aitrics.com
Thank you!
Jaeman An <jaeman@aitrics.com>
Contact:
Jaeman An <jaeman@aitrics.com>
Yongseon Lee <yongseon@aitrics.com>
Tony Kim <tonykim@aitrics.com>

More Related Content

Similar to How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon Valley 2019) (20)

PPTX
Machine learning in the wild deployment
Birger Moell
 
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
PDF
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
Edge AI and Vision Alliance
 
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Tyrone Systems
 
PDF
Democratizing machine learning on kubernetes
Docker, Inc.
 
PPTX
Leonid Kuligin "Training ML models with Cloud"
Lviv Startup Club
 
PDF
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
PPTX
Kubernetes for machine learning
Akash Agrawal
 
PDF
Introduction to DL platform
xiaogaozi
 
PPTX
Getting Started with TensorFlow on Google Cloud
Mariam Aslam
 
PDF
Hydrosphere.io for ODSC: Webinar on Kubeflow
Rustem Zakiev
 
PDF
MLOps pipelines using MLFlow - From training to production
Fabian Hadiji
 
PDF
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly
 
PDF
How To Build Efficient ML Pipelines From The Startup Perspective (OpenInfraDa...
Jaeman An
 
PPTX
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Seldon
 
PPTX
Kubernetes data science and machine learning
Kublr
 
PDF
Machine learning using Kubernetes
Arun Gupta
 
PDF
Machine Learning using Kubeflow and Kubernetes
Arun Gupta
 
PPTX
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 
Machine learning in the wild deployment
Birger Moell
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
“Parallelizing Machine Learning Applications in the Cloud with Kubernetes: A ...
Edge AI and Vision Alliance
 
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Tyrone Systems
 
Democratizing machine learning on kubernetes
Docker, Inc.
 
Leonid Kuligin "Training ML models with Cloud"
Lviv Startup Club
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
Kubernetes for machine learning
Akash Agrawal
 
Introduction to DL platform
xiaogaozi
 
Getting Started with TensorFlow on Google Cloud
Mariam Aslam
 
Hydrosphere.io for ODSC: Webinar on Kubeflow
Rustem Zakiev
 
MLOps pipelines using MLFlow - From training to production
Fabian Hadiji
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Chris Fregly
 
How To Build Efficient ML Pipelines From The Startup Perspective (OpenInfraDa...
Jaeman An
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Seldon
 
Kubernetes data science and machine learning
Kublr
 
Machine learning using Kubernetes
Arun Gupta
 
Machine Learning using Kubeflow and Kubernetes
Arun Gupta
 
Machine Learning using Kubernetes - AI Conclave 2019
Arun Gupta
 

Recently uploaded (20)

PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Python basic programing language for automation
DanialHabibi2
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Ad

How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon Valley 2019)

  • 1. How To Build Efficient ML Pipelines From the Startup Perspective Jaeman An <[email protected]> GPU Technology Conference, 2019
  • 2. Machine Learning Pipelines Challenges that many fast-growing startups face Solutions we came up with Several tools and tips that may be useful for you : kubernetes, polyaxon, kubeflow, terraform, ... Way to build your own training farm by step by step How to deploy & manage trained model by step by step What you can get from this talk
  • 3. 01 Why we built a ML pipeline 02 Brief introduction to kubernetes 03 Model building & training phase - Building training farm from zero (step by step)
 - Terraform, Polyaxon 04 Model deployment & production phase - Building inference farm from zero (step by step)
 - Several ways to make microservices
 - Kubeflow 05 Conclusion 06 What's next?
  • 4. Why we built a ML pipeline
  • 5. Buy GPU machines Build (Explore) your own models Train models Freeze and deploy as as service Conduct fitting and re-training Earn money and exit Very simple way to start machine learning startup Model building Training Deploying Fitting, re-training Data refining
  • 6. Buy GPU machines Build (Explore) your own models Train models Freeze and deploy as as service Conduct fitting and re-training Earn money and exit Very simple way to start machine learning startup Model building Training Deploying Fitting, re-training Data refining
  • 7. Mostly time-consuming job Sometimes we need to do large-scale data processing Use Apache Spark! 
 (This won’t be covered in this talk) We've not handle real-time data *yet* Kafka Streams is feasible solution 
 (This won’t be covered in this talk) Have to manage several data versions due to sampling policies and operational definitions (labeling) Can use Git-like solutions It'll be great to import data easily in the training phase like ./train --data=images_v1 Permission Control What's going on in data refining phase Model building Training Deploying Fitting, re-training Data refining
  • 8. Referring tons of precedent research Pick a simple model for baseline with small set of data Check minimal accuracy and debug our model (if data matters) refining data more precisely (if model matters) iteratively improve our model Mostly only need GPU instance or notebook and small datasets; don't want to care about other stuffs! ./run-notebook tf-v12-gpu --gpu=4 -- data=images_v1 ./ssh tf-v12-gpu --gpu=2 --data=images_v1 What's going on in model building phase Model building Training Deploying Fitting, re-training Data refining
  • 9. Training on large datasets Researchers have to "hunt" idle GPU resources by accessing 10+ servers one by one Scalability: Sometimes there’s no idle GPU resources (depends on product timeline / paper deadline) Access Control: Sometimes all resources are occupied by outside collaborators Data accessibility: Fetching / moving training data servers to servers is very painful! Monitoring: Want to know how our experiments are going and what's going on our resources What's going on in training phase Model building Training Deploying Fitting, re-training Data refining
  • 10. In the middle of machine learning engineering and software engineering Want to manage model independently for the product Build micro-services that inference test data synchronously / asynchronously Have to consider high availability on production usage What's going on in deploying phase Model building Training Deploying Fitting, re-training Data refining
  • 11. Data distribution always changes; therefore, have to keep fitting the model with the real data Want to easily change the model code interactively Try to build online-learning model or re-training model in certain schedule Sometimes need to create real time data flow with Kafka Have to manage several model versions As new models are developed As the usage varies What's going on to us in fitting phase Model building Training Deploying Fitting, re-training Data refining
  • 12. Model building & training phase: We need to know the status of resources without access to our physical servers one by one. We want to use easily idle GPU with proper training datasets We have to control permissions of our resources and datasets We only want to mainly focus on our research: developing innovative models, conducting experiments and such, ... not infrastructures Problems and requirements
  • 13. Model deploying & updating phase: It's hard to control because it is in the middle of machine learning engineering and software engineering We want to create simple micro-services that don't need much management There are many models with different purposes; 
 - some models need real-time inference
 - some models do not require real-time, but they need inference in the certain time range We have to consider high availability configuration Models must be fitted and re-trained easily We have to manage several versions of models Problems and requirements
  • 14. Managing resources over multiple servers, deploying microservices, permission controls, ... These can be solved with orchestration solutions. We are going to build training farm using kubernetes. Before that, what is kubernetes? How to solve
  • 15. Kubernetes in 5 minutes
  • 16. Kubernetes (k8s) is an open-source system for automating deployment, scaling, and management of containerized applications. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads. NVIDIA GPU also can be orchestrated through NVIDIA's k8s device plugin Kubernetes k8s Master Storages RW W R R k8s Minion ContainerPod Service Ingress NodePort Internet k8s Minion Storages Attach
  • 17. Give me 4 CPU, 1 Memory, 1 GPU I’m Jaeman An, and I’m in team A namespace With 4 External Port With abcd.aitrics.com hostname With latest gpu tensorflow image With 100GB writable volumes and data from readable source Kubernetes OK, Here you are No, you have no permission No, you've already use resources that you can No, there's no idle resources, please wait k8s Master Storages RW W R R k8s Minion ContainerPod Service Ingress NodePort Internet K8s Minion Storages Attach
  • 18. Kubernetes Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... Namespace Role & Authorization Resource Quota <Objects> <Meta & Policies> k8s Master Storages RW W R R k8s Minion ContainerPod Service Ingress NodePort Internet K8s Minion Storages Attach
  • 19. Kubernetes Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... A Pod is the basic building block of Kubernetes ‒ the smallest and simplest unit in the Kubernetes object model that you create or deploy. A Pod represents a running process on your cluster. kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-base resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU command: ["nvidia-smi"] Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/
  • 20. Kubernetes A Service is an abstraction which defines a logical set of Pods and a policy by which to access them - sometimes called a micro-service. kind: Service apiVersion: v1 metadata: name: my-service spec: selector: app: MyApp ports: - protocol: TCP port: 80 targetPort: 9376 Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... Ref: https://kubernetes.io/docs/concepts/services-networking/service/
  • 21. Kubernetes Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. Traffic routing is controlled by rules defined on the Ingress resource. kind: Ingress metadata: name: test-ingress spec: rules: - host: foo.bar.com - http: paths: - backend: serviceName: MyService servicePort: 80 Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... Ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
  • 22. Kubernetes A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator. It is a resource in the cluster just like a node is a cluster resource. kind: PersistentVolume metadata: name: pv0003 spec: capacity: storage: 5Gi volumeMode: Filesystem accessModes: - ReadWriteOnce nfs: path: /tmp server: 172.17.0.2 Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... Ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
  • 23. Kubernetes A PersistentVolumeClaim (PVC) is a request for storage by a user. Claims can request specific size and access modes (e.g., can be mounted once read/write or many times read-only). kind: PersistentVolumeClaim apiVersion: v1 metadata: name: myclaim spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 8Gi Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... Ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims
  • 24. Kubernetes A Job creates one or more Pods and ensures that a specified number of them successfully terminate. As pods successfully complete, the Job tracks the successful completions. kind: Job metadata: name: pi spec: template: spec: containers: - name: pi image: perl command: ["perl", "-Mbignum=bpi", "- wle", "print bpi(2000)"] Workload & Services Pod Service Ingress Deployment Replication Controller ... Storage Class PersistentVolume PersistentVolumeClaim ... Workload Controllers Job CronJob ReplicaSet RepliactionController DaemonSet ... Ref: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims
  • 25. Kubernetes Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces. Those are intended for use in environments with many users spread across multiple teams, or projects. $ kubectl get namespaces NAME STATUS AGE default Active 1d kube-system Active 1d kube-public Active 1d Policies & Others Namespace Resource Quota Role & Authorization Ref: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
  • 26. Kubernetes A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per namespace. kind: ResourceQuota metadata: name: compute-resources spec: hard: requests.nvidia.com/gpu: 1 Policies & Others Namespace Resource Quota Role & Authorization Ref: https://kubernetes.io/docs/concepts/policy/resource-quotas/
  • 27. Kubernetes In Kubernetes, you must be authenticated (logged in) before your request can be authorized (granted permission to access). Kubernetes uses client certificates, bearer tokens, an authenticating proxy, or HTTP basic auth to authenticate API requests through authentication plugins. Policies & Others Namespace Resource Quota Role & Authorization Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
  • 28. Kubernetes Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within an enterprise. kind: Role metadata: namespace: default name: pod-reader rules: - apiGroups: [""] group: resources: ["pods"] verbs: ["get", "watch", "list"] Policies & Others Namespace Resource Quota Role & Authorization Ref: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
  • 29. Kubernetes Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within an enterprise. kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: read-pods namespace: default subjects: - kind: User name: jane apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: pod-reader apiGroup: rbac.authorization.k8s.io Policies & Others Namespace Resource Quota Role & Authorization Ref: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
  • 30. Model building & training phase - Building training farm from zero (step by step) - Polyaxon - Terraform
  • 31. We need to know GPU resource status without accessing our physical servers one by one. We want to easily use idle GPU with proper training datasets We have to control permissions of our resources and datasets We only want to focus on our research: building models, doing the experiments, ... not infrastructures! ./run-notebook tf-v12-gpu --gpu=4 --data=images_v1 ./train tf-v12-gpu model.py --gpu=4 --data=images_v1 ./ssh tf-v12-gpu --gpu=4 --data=images_v1 --exposes- port=4 RECAP: Our requirements
  • 34. Step 1. Install Kubernetes master on AWS Step 2. Install Kubernetes as nodes in physical servers Step 3. Run hello world training containers Step 4. RBAC Authorization & resource quota Step 5. Expand GPU servers on demand with AWS Step 6. Attach training data Step 7. Web dashboard or cli tools to run training container Step 8. With other tools (Polyaxon) Instructions
  • 35. There are several ways to install kubernetes Use kubeadm in this session. Other options: conjure-up, kops Network option: flannel (https://github.com/coreos/flannel) Server configuration that I've used in k8s master: AWS t3.large: 2 vCPUs, 8GB Memory Ubuntu 18.04, docker version 18.09 Step 1. Install Kubernetes master on AWS Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
  • 36. Step 1. Install Kubernetes master on AWS # Install kubeadm # https://kubernetes.io/docs/setup/independent/install-kubeadm/  $ curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg 
 | apt-key add - $ cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
 deb https://apt.kubernetes.io/ kubernetes-xenial main
 EOF $ apt-get install -y kubelet kubeadm kubectl Ref: https://kubernetes.io/docs/setup/independent/install-kubeadm/
  • 37. Step 1. Install Kubernetes master on AWS # Initialize with Flannel (https://github.com/coreos/flannel) $ kubeadm init --pod-network-cidr=10.244.0.0/16 Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
  • 38. Step 1. Install Kubernetes master on AWS # Initialize with Flannel (https://github.com/coreos/flannel) $ kubeadm init --pod-network-cidr=10.244.0.0/16 Your kubernetes master has initialized successfully! To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config You can now join any number of machines by running the following on each node as root: kubeadm join 172.31.30.194:6443 --token *** --discovery-token-ca-cert-hash *** Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
  • 39. Step 1. Install Kubernetes master on AWS # Initialize with Flannel (https://github.com/coreos/flannel) $ kubectl -n kube-system apply -f https://raw.githubusercontent.com/ coreos/flannel/62e44c867a2846fefb68bd5f178daf4da3095ccb/ Documentation/kube-flannel.yml Ref: https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/
  • 40. Step 1. Install Kubernetes master on AWS # Install NVIDIA k8s-device-plugin # https://github.com/NVIDIA/k8s-device-plugin $ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s- device-plugin/v1.11/nvidia-device-plugin.yml Ref: https://github.com/NVIDIA/k8s-device-plugin
  • 41. In this step, install nvidia-docker join to kubernetes master use kubeadm join command install NVIDIA's k8s-device-plugin create kubernetes dashboard to check resources Server configuration that I've used in k8s node: 32 CPU core, 128GB Memory 4 GPU (Titan Xp), Driver version: 396.44 Ubuntu 16.04, docker version 18.09 Step 2. Install kubernetes as nodes in physical servers
  • 42. Step 2. Install kubernetes as nodes in physical servers # Install nvidia-docker (https://github.com/NVIDIA/nvidia-docker) $ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add - $ curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia- docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list $ apt-get update $ apt-get install -y nvidia-docker2 Ref: https://github.com/NVIDIA/nvidia-docker
  • 43. Step 2. Install kubernetes as nodes in physical servers # change docker default runtime to nvidia-docker $ vi /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": “nvidia-container-runtime", "runtimeArgs": [] } } } $ systemctl restart docker Ref: https://github.com/NVIDIA/nvidia-docker
  • 44. Step 2. Install kubernetes as nodes in physical servers # test nvidia-docker is successfully installed $ docker run --rm -it nvidia/cuda nvidia-smi Ref: https://github.com/NVIDIA/nvidia-docker
  • 45. Step 2. Install kubernetes as nodes in physical servers # test nvidia-docker is successfully installed $ docker run --rm -it nvidia/cuda nvidia-smi +----------------------------------------------------------------------+ | NVIDIA-SMI 396.44 Driver Version: 396.44 CUDA Version: 10.0 | |----------------------------------------------------------------------| | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |=============================+=================+======================| | 0 Titan Xp On | 00 :00:1E.0 Off | 0 | +-----------------------------+-----------------+-------- -------------+ +----------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |======================================================================| | No running processes found | +----------------------------------------------------------------------+ Ref: https://github.com/NVIDIA/nvidia-docker
  • 46. Step 2. Install kubernetes as nodes in physical servers # join to kubernetes master with kubeadm $ kubeadm join 172.31.30.194:6443 --token *** --discovery-token-ca- cert-hash ***
  • 47. Step 2. Install kubernetes as nodes in physical servers # join to kubernetes master with kubeadm $ kubeadm join 172.31.30.194:6443 --token *** --discovery-token-ca- cert-hash *** ... This node has joined the cluster. * Certificate signing request was sent to apiserver and a response was received * The Kubelet was informed of the new secure connection details Run 'kubectl get nodes' on the master to see this node join the cluster.
  • 48. Step 2. Install kubernetes as nodes in physical servers # check the node join the cluster # run this on the master $ kubectl get nodes
  • 49. Step 2. Install kubernetes as nodes in physical servers # check if the node (named as 'stark') join the cluster # run this command on the master $ kubectl get nodes NAME STATUS ROLES AGE VERSION ip-172-31-99-9 Ready master 99d v1.12.2 stark Ready <none> 99d v1.12.2
  • 50. Step 2. Install kubernetes as nodes in physical servers # create kubernetes dashboard $ kubectl apply -f https://raw.githubusercontent.com/kubernetes/ dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml $ kubectl proxy Ref: https://github.com/kubernetes/dashboard
  • 52. Write pod definition Run nvidia-smi with cuda image Train MNIST with tensorflow and save model in S3 Step 3. Run hello-world container
  • 53. Example: nvidia-smi # run nvidia-smi in container # pod.yml apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU command: ["nvidia-smi"]
  • 54. Example: nvidia-smi # create pod from definition $ kubectl create -f pod.yml
  • 55. Example: nvidia-smi # create pod from definition $ kubectl create -f pod.yml pod/gpu-pod created
  • 56. Example: nvidia-smi # create pod from definition $ kubectl logs gpu-pod +----------------------------------------------------------------------+ | NVIDIA-SMI 396.44 Driver Version: 396.44 CUDA Version: 10.0 | |----------------------------------------------------------------------| | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |=============================+=================+======================| | 0 Titan Xp On | 00 :00:1E.0 Off | 0 | +-----------------------------+-----------------+-------- -------------+ +----------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |======================================================================| | No running processes found | +----------------------------------------------------------------------+
  • 57. Example: MNIST # train_mnist.py import tensorflow as tf def main(args): mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=args.epoch) model.evaluate(x_test, y_test) saved_model_path = tf.contrib.saved_model.save_keras_model(model, args.save_dir)
  • 58. Example: MNIST # Dockerfile FROM tensorflow/tensorflow:latest-gpu-py3 WORKDIR /train_demo/ COPY . /train_demo/ RUN pip --no-cache-dir install --upgrade awscli ENTRYPOINT ["/train_demo/run.sh"] # run.sh python train_mnist.py --epoch 1 aws s3 sync saved_models/ $MODEL_S3_PATH
  • 59. Example: MNIST # pod definition apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: aitrics/train-mnist:1.0 resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU env: - name: MODEL_S3_PATH value: "s3://aitrics-model-bucket/saved_model"
  • 60. Example: MNIST # create pod from definition $ kubectl create -f pod.yml pod/gpu-pod created
  • 62. Now we have, Minimally working proof of concept Researchers can train on kubernetes with kubectl We have to do, RBAC (Role based access control) between researchers, engineers, and outside collaborators. Training data & output volume attachment Researchers don't want to know what kubernetes is. They only need a instance which are accessible via SSH (with frameworks and training data) or nice webview and jupyter notebook or automatic hyperparameter searching... Summary
  • 63. Instructions: Create user (team) namespace Create user credentials with cluster CA key default CA key location: /etc/kubernetes/pki Create role and role binding with proper permissions Create resource quota per namespace References: https://docs.bitnami.com/kubernetes/how-to/configure-rbac-in-your-kubernetes- cluster/ https://kubernetes.io/docs/reference/access-authn-authz/rbac/ Step 4. Role Based Access Control & Resource Quota
  • 64. Step 4. Role Based Access Control & Resource Quota # create user (team) namespace $ kubectl create namespace team-a
  • 65. Step 4. Role Based Access Control & Resource Quota # create user (team) namespace $ kubectl get namespaces NAME STATUS AGE default Active 99d team-a Active 4s kube-public Active 99d kube-system Active 99d
  • 66. Step 4. Role Based Access Control & Resource Quota # create user credentials $ openssl genrsa -out jaeman.key 2048 $ openssl req -new -key jaeman.key -out user.csr -subj "/CN=jaeman/ O=aitrics" $ openssl x509 -req -in jaeman.csr -CA CA_LOCATION/ca.crt -CAkey CA_LOCATION/ca.key -CAcreateserial -out jaeman.crt -days 500 Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
  • 67. Step 4. Role Based Access Control & Resource Quota # create Role definition kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: team-a name: software-engineer-role rules: - apiGroups: ["", "extensions", "apps"] resources: ["deployments", "replicasets", "pods", "configmaps"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # You can also use ["*"] Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
  • 68. Step 4. Role Based Access Control & Resource Quota # create ClusterRoleBinding definition kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: team-a name: jaeman-software-engineer-role-binding subjects: - kind: User name: jaeman apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: software-engineer-role apiGroup: rbac.authorization.k8s.io Ref: https://kubernetes.io/docs/reference/access-authn-authz/authentication/
  • 69. Step 4. Role Based Access Control & Resource Quota # create resource quota apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: requests.nvidia.com/gpu: 1
  • 70. Store kubeadm join script in S3 Write userdata (instance bootstrap script) install kubeadm, nvidia-docker join Add AutoScaling Group Step 5. Expand GPU servers on AWS
  • 71. Step 5. Expand GPU servers on AWS # save master join command in AWS S3 # s3://k8s-training-cluster/join.sh kubeadm join 172.31.75.62:6443 --token *** --discovery-token-ca-cert- hash ***
  • 72. Step 5. Expand GPU servers on AWS # userdata script file # RECAP: install kubernetes as a node to join master (step 2) # install kubernetes apt-get install -y kubelet kubeadm kubectl # install nvidia-docker apt-get install -y nvidia-docker2 ... $(aws s3 cp s3://k8s-training-cluster/join.sh -)
  • 73. Step 5. Expand GPU servers on AWS
  • 74. Step 5. Expand GPU servers on AWS
  • 75. Step 5. Expand GPU servers on AWS # check bootstrapping log $ tail -f /var/log/cloud-init-output.log
  • 76. Step 5. Expand GPU servers on AWS # check bootstrapping log $ tail -f /var/log/cloud-init-output.log ... ++ aws s3 cp s3://k8s-training-cluster/join.sh - + kubeadm join 172.31.75.62:6443 --token *** --discovery-token-ca-cert- hash *** [preflight] Running pre-flight checks [discovery] Trying to connect to API Server "172.31.75.62:6443" [discovery] Created cluster-info discovery client, requesting info from "https://172.31.75.62:6443" [discovery] Requesting info from "https://172.31.75.62:6443" again to validate TLS against the pinned public key ...
  • 77. Initially store training data in S3 (with encryption) Option 1: Download training data when pod starts training data is usually big same training data are often used, so it would be very inefficient caching to host machine volumes --> occupied easily use storage server and mount volumes that! Option 2: Create NFS on AWS EC2 or storage server (e.g. NAS) Sync all data with S3 Mount as Persistent Volume with ReadOnlyMany / ReadWriteMany Option 3: shared storage with s3fs https://icicimov.github.io/blog/virtualization/Kubernetes-shared-storage-with-S3-backend/ Step 6. Training data attachment
  • 78. Step 6. Training data attachment # make nfs server on EC2 (or physical storage server) # https://www.digitalocean.com/community/tutorials/how-to-set-up-an- nfs-mount-on-ubuntu-16-04 $ apt-get update $ apt-get install nfs-kernel-server $ mkdir /var/nfs -p
 $ cat <<EOF > /etc/exports
 /var/nfs 172.31.75.62(rw,sync,no_subtree_check)
 EOF
 $ systemctl restart nfs-kernel-server
  • 79. Step 6. Training data attachment # define persistent volume apiVersion: v1 kind: PersistentVolume metadata: name: nfs spec: capacity: storage: 3Gi accessModes: - ReadWriteMany nfs: server: <server ip> path: "/var/nfs"
  • 80. Step 6. Training data attachment # define persistent volume claim apiVersion: v1 kind: PersistentVolumeClaim metadata: name: nfs-pvc spec: accessModes: - ReadWriteMany storageClassName: "" resources: requests: storage: 3Gi
  • 81. Step 6. Training data attachment # mount volume in pod apiVersion: v1 kind: Pod metadata: name: pvpod spec: volumes: - name: testpv persistentVolumeClaim: claimName: nfs-pvc containers: - name: test image: python:3.7.2 volumeMounts: - name: testpv mountPath: /data/test
  • 82. Make script like ./kono ssh --image tensorflow/tensorflow --expose- ports 4 ./kono train --image tensorflow/tensorflow -- entrypoint main.py . Create web dashboard Step 7. Web dashboard or cli tools to run training container
  • 83. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono login
  • 84. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono login
 
 Username: jaeman
 Password: [hidden]
  • 85. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono train 
 --image tensorflow/tensorflow:latest-gpu 
 --gpu 1 
 --script train.py 
 --input-data /var/project-a-data/:/opt/project-a-data/ 
 --output-dir /opt/outputs/:./outputs/ 
 -- 
 --epoch=1 --checkpoint=/opt/outputs/ckpts/

  • 86. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono train 
 --image tensorflow/tensorflow:latest-gpu 
 --gpu 1 
 --script train.py 
 --input-data /var/project-a-data/:/opt/project-a-data/ 
 --output-dir /opt/outputs/:./outputs/ 
 -- 
 --epoch=1 --checkpoint=/opt/outputs/ckpts/
 
 ...
 ...
 training completed!
 Sending output directory to s3... [>>>>>>>>>>>>>>>>>>>>>>>] 100%
 Pulling output directory to local... [>>>>>>>>>>>>>>>>>>>>>>>] 100%
 Check your directory ./outputs/
  • 87. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono ssh 
 --image tensorflow/tensorflow:latest-gpu 
 --gpu 1 
 --expose-ports 4 
 --input-data /var/project-a-data/:/opt/project-a-data/

  • 88. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono ssh 
 --image tensorflow/tensorflow:latest-gpu 
 --gpu 1 
 --expose-ports 4 
 --input-data /var/project-a-data/:/opt/project-a-data/
 
 ...
 ...
 ...
 
 Your container is ready!
 ssh [email protected] -p 31546
  • 89. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono terminate-all --force
  • 90. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono terminate-all --force
 
 terminate all your containers? [Y/n]: Y
  • 91. Step 7. Web dashboard or cli tools to run training container # cli tool to use our cluster $ kono terminate-all --force
 
 terminate all your containers? [Y/n]: Y
 
 ...
 ...
 ...
 Success!
  • 92. Step 7. Web dashboard or cli tools to run training container
  • 93. We are still working on it Check our improvements or contribute to us https://github.com/AITRICS/kono Step 7. Web dashboard or cli tools to run training container
  • 94. A platform for reproducing and managing the whole life cycle of machine learning and deep learning applications. https://polyaxon.com/ Most feasible tools 
 to our training cluster Can be installed on
 kubernetes easily Step 8. Use other tools (polyaxon) Ref: https://www.polyaxon.com/
  • 95. Polyaxon usage # Polyaxon usage # Create a project $ polyaxon project create --name=quick-start --description='Polyaxon quick start.’
 # Initialize $ polyaxon init quick-start
 # Upload code and start experiments $ polyaxon run -u Ref: https://github.com/polyaxon/polyaxon
  • 98. Polyaxon is a platform for managing the whole lifecycle of large scale deep learning and machine learning applications, and it supports all the major deep learning frameworks such as Tensorflow, MXNet, Caffe, Torch, etc. Features Powerful workspace Reproducible results Developer-friendly API Built-in Optimization engine Plugins & integrations Roles & permissions Polyaxon Ref: https://docs.polyaxon.com/concepts/features/
  • 100. 1. Create project on polyaxon polyaxon project create --name=quick-start 2. Initialize the project polyaxon init quick-start 3. Create polyaxonfile.yml See next slide 4. Upload your code and start an experiment with it How to run my experiment on polyaxon?
  • 101. Polyaxon usage # polyaxonfile.yml version: 1 kind: experiment build: image: tensorflow/tensorflow:1.4.1-py3 build_steps: - pip3 install polyaxon-client run: cmd: python model.py Ref: https://docs.polyaxon.com/concepts/quick-start-internal-repo/
  • 102. Polyaxon usage # model.py # https://github.com/polyaxon/polyaxon-quick-start/blob/master/model.py from polyaxon_client.tracking import Experiment, get_data_paths, get_outputs_path data_paths = list(get_data_paths().values())[0] mnist = input_data.read_data_sets(data_paths, one_hot=False) experiment = Experiment() ... estimator = tf.estimator.Estimator( get_model_fn(learning_rate=learning_rate, dropout=dropout, activation=activation), model_dir=get_outputs_path()) estimator.train(input_fn, steps=num_steps) ... experiment.log_metrics(loss=metrics['loss'], accuracy=metrics['accuracy'], precision=metrics['precision']) Ref: https://github.com/polyaxon/polyaxon-quick-start/blob/master/model.py
  • 103. Polyaxon usage # Integrations in polyaxon # Notebook $ polyaxon notebook start -f polyaxon_notebook.yml # Tensorboard $ polyaxon tensorboard -xp 23 start
 Ref: https://github.com/polyaxon/polyaxon
  • 104. How to? Make single file train.py that accepts 2 parameters learning rate - lr batch size - batch_size Update the polyaxonfile.yml with matrix Make experiment group Experiment group search algorithm grid search / random search / Hyperband / Bayesian Optimization https://docs.polyaxon.com/references/polyaxon-optimization-engine/ Experiment Groups - Hyperparameter Optimization Ref: https://docs.polyaxon.com/concepts/experiment-groups-hyperparameters-optimization/
  • 105. Experiment Groups - Hyperparameter Optimization # polyaxonfile.yml version: 1 kind: group declarations: batch_size: 128 hptuning: matrix: lr: logspace: 0.01:0.1:5 build: image: tensorflow/tensorflow:1.4.1-py3 build_steps: - pip install scikit-learn run: cmd: python3 train.py --batch-size={{ batch_size }} --lr={{ lr }} Ref: https://docs.polyaxon.com/concepts/experiment-groups-hyperparameters-optimization/
  • 106. Experiment Groups - Hyperparameter Optimization # polyaxonfile_override.yml version: 1 hptuning: concurrency: 2 random_search: n_experiments: 4 early_stopping: - metric: accuracy value: 0.9 optimization: maximize - metric: loss value: 0.05 optimization: minimize Ref: https://docs.polyaxon.com/concepts/experiment-groups-hyperparameters-optimization/
  • 107. Instructions Install helm - kubernetes application manager Create polyaxon namespace Write your own config for polyaxon Run polyaxon with helm How to install polyaxon?
  • 108. How to install polyaxon? # install helm (kubernetes package manager) $ snap install helm --classic $ helm init Ref: https://github.com/polyaxon/polyaxon
  • 109. How to install polyaxon? # install polyaxon with helm $ kubectl create namespace polyaxon $ helm repo add polyaxon https://charts.polyaxon.com $ helm repo update Ref: https://github.com/polyaxon/polyaxon
  • 110. How to install polyaxon? # config.yaml rbac: enabled: true ingress: enabled: true serviceType: LoadBalancer persistent: data: training-data-a-s3: store: s3 bucket: s3://aitrics-training-data data-pvc1: mountPath: "/data-pvc/1" existingClaim: "data-pvc-1" outputs: devtest-s3: store: s3 bucket: s3://aitrics-dev-test integrations: slack: - url: https://hooks.slack.com/services/***/*** channel: research-feed Ref: https://github.com/polyaxon/polyaxon
  • 111. How to install polyaxon? # install polyaxon with helm $ helm install polyaxon/polyaxon 
 --name=polyaxon 
 --namespace=polyaxon 
 -f config.yml
  • 112. How to install polyaxon? # install polyaxon with helm $ helm install polyaxon/polyaxon 
 --name=polyaxon 
 --namespace=polyaxon 
 -f config.yml 1. Get the application URL by running these commands: export POLYAXON_IP=$(kubectl get svc --namespace polyaxon polyaxon-polyaxon- ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}') export POLYAXON_HTTP_PORT=80 export POLYAXON_WS_PORT=80 echo http://$POLYAXON_IP:$POLYAXON_HTTP_PORT 2. Setup your cli by running theses commands: polyaxon config set --host=$POLYAXON_IP --http_port=$POLYAXON_HTTP_PORT — ws_port=$POLYAXON_WS_PORT
  • 113. Summary S3 NAS NFS GPU Nodes Auto scaling Master Storage Kubernetes minion AWS Physical server kono-cli RBAC & Resource Quota namespace web Kono-Web Polyaxon Service Plane k8s Service k8s Ingress ELB Control plane Training farm kono-web polyaxon single EC2 or multiple EC2
  • 114. Need to know GPU resource status without accessing our physical servers one by one. Use web dashboard or other monitoring tools like Prometheus + cAdvisor Want to easily use idle GPU with proper training datasets Use kubernetes objects to get resources and to mount volumes Have to control permissions of our resources and datasets RBAC / Resource quota in kubernetes Want to focus on our research: building models, doing the experiments, ... not infrastructures! Use kono / polyaxon RECAP: Our requirements
  • 115. Make it as reusable component Use Terraform Too many steps to build my own cluster!
  • 116. Infrastructure as a code Terraform
  • 117. Infrastructure as a code Terraform resource "aws_instance" "master" { ami = "ami-593801f1" instance_type = "t3.small" key_name = "aitrics-secret-master-key" iam_instance_profile = "kubernetes-master-iam-role" user_data = "${data.template_file.master.rendered}" root_block_device = { volume_size = "15" } } $ terraform apply
  • 118. Infrastructure as a code Terraform resource "aws_instance" "master" { ami = "ami-593801f1" instance_type = "t3.small" key_name = "aitrics-secret-master-key" iam_instance_profile = "kubernetes-master-iam-role" user_data = "${data.template_file.master.rendered}" root_block_device = { volume_size = "15" } }
  • 119. We publish our infrastructure as a code https://github.com/AITRICS/kono Configure your settings and just type `terraform apply` to get your own training cluster! Terraform
  • 120. Model deployment & production phase - Building inference farm from zero (step by step)
 - Several ways to make microservices
 - Kubeflow
  • 121. It's hard to control because it is in the middle of machine learning engineering and software engineering We want to create simple micro-services that don't need much management There are many models with different purposes; 
 - some models need real-time inference
 - some models do not require real-time, but they need inference in the certain time range We have to consider high availability configuration Models must be fitted and re-trained easily We have to manage several versions of models RECAP: Our requirements
  • 122. Step 1. Build another kubernetes cluster for production Step 2. Make simple web-based micro services for trained models 2-1. HTTP API Server Example 2-2. Asynchronous inference farm example Step 3. Deploy 3-1. on the kubernetes with ingress 3-2. standalone server with docker and auto scaling group Step 4. Using TensorRT Inference Server Step 5. Terraform Case Study. Kubeflow Instructions
  • 123. Launch again like training cluster! Step 1. Build production kubernetes cluster
  • 124. 2-1. For real time inference (synchronous) Use simple web framework to build HTTP-based microservice! We use bottle (or flask) 2-2. For asynchronous (inference farm) with kubernetes job - has overheads to be executed with celery - which I prefer Step 2. Make simple web-based microservices for trained models
  • 125. Example. Using bottle for HTTP based microservices from bottle import run, get, post, request, response from bottle import app as bottle_app from aws import aws_client @post('/v1/<location>/<prediction_type>/') def inference(location, prediction_type): model = select_model(location, prediction_type) input_array = deserialize(request.json) output_array = inference(input_array) return serialize(output_array) if __name__ == '__main__': args = parse_args() aws_client.download_model(args.model_path, args.model_version) app = bottle_app() run(app=app, host=args.host, port=args.port)
  • 126. Example. Using kubernetes job for inference # job.yml apiVersion: batch/v1 kind: Job metadata: name: inference-job spec: template: spec: containers: - name: inference image: inference command: ["python", "main.py", "s3://ps-images/images.png"] restartPolicy: Never backoffLimit: 4 Ref: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
  • 127. Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. Celery is used in production systems to process millions of tasks a day. Celery from celery import Celery app = Celery('hello', broker='amqp://guest@localhost//') @app.task def hello(): return 'hello world' Ref: http://www.celeryproject.org/
  • 128. Example. Using celery for asynchronous inference farm from celery import task from aws import aws_client from db import IdentifyResult from aitrics.models import FasterRCNN model = FasterRCNN(model_path=settings.MODEL_PATH) @task def task_identify_image_color_shape(id, s3_path): image = aws_client.download_image(s3_path) color, shape = model.inference(image) IdentifyResult.objects.create(id, s3_path, color, shape)
  • 129. on the kubernetes cluster service & ingress to expose use workload controller like deployments, replica set, replication controller, don't use pod itself to get high availability. on the AWS instance directly simple docker run example use auto scaling group and load balancers with userdata Step 3. Deploy
  • 130. Step 3-1. Deploy on kubernetes cluster (ingress) kind: Ingress metadata: name: inference-ingress spec: rules: - host: inference.aitrics.com - http: paths: - backend: serviceName: MyInferenceService servicePort: 80 Ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
  • 131. Step 3-1. Deploy on kubernetes cluster (deployment) kind: Deployment metadata: name: inference-deployment spec: replicas: 3 selector: matchLabels: app: inference template: metadata: labels: app: inference spec: containers: - name: ps-inference image: ps-inference:latest ports: - containerPort: 80 Ref: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
  • 132. Step 3-2. Deploy on EC2 directly #!/bin/bash docker kill ps-inference || true docker rm ps-inference || true docker run -d -p 35000:8000 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 docker-registry.aitrics.com/ps-inference:gpu --host=0.0.0.0 --port=8000 --sentry-dsn=http://[email protected]/13 --gpus=0 --character-model=best_model.params/faster_rcnn_renet101_v1b --shape-model=scnet_shape.params/ResNet50_v2 --color-model=scnet_color.params/ResNet50_v2 --s3-bucket=aitrics-research --s3-path=faster_rcnn/result/181109 --model-path=.data/models --aws-access-key=*** --aws-secret-key=***
  • 133. TensorRT is a high-performance deep learning inference optimizer and runtime engine for production deployment of deep learning applications. Step 4. Using TensorRT Inference Server Ref: https://developer.nvidia.com/tensorrt
  • 134. Use Tensorflow or Caffe to apply TensorRT easily Consider TensorRT when you build model Some operations might not be supported Add some TensorRT related code in Python script Use TensorRT docker image to run inference server. Step 4. Using TensorRT Inference Server
  • 135. Step 4. Using TensorRT Inference Server # TensorRT From ONNX with Python Example import tensorrt as trt with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser: with open(model_path, 'rb') as model: parser.parse(model.read()) ... Ref: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#import_onnx_python
  • 136. Step 4. Using TensorRT Inference Server # Dockerfile # https://github.com/NVIDIA/tensorrt-inference-server/blob/master/Dockerfile FROM aitrics/tensorrt-inference-server:cuda9-cudnn7-onnx ADD . /ps-inference/ ENTRYPOINT ["/ps-inference/run.sh"] Ref: https://github.com/onnx/onnx-tensorrt/blob/master/Dockerfile
  • 137. You can also find our inference cluster as a code! https://github.com/AITRICS/kono Configure your settings and test example microservices and inference farm with terraform! Step 5. Terraform
  • 138. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. https://www.kubeflow.org/ When to use You want to train/serve TensorFlow models in different environments (e.g. local, on prem, and cloud) You want to use Jupyter notebooks to manage TensorFlow training jobs You want to launch training jobs that use resources ‒ such as additional CPUs or GPUs ‒ that aren’t available on your personal computer You want to combine TensorFlow with other processes For example, you may want to use tensorflow/agents to run simulations to generate data for training reinforcement learning models. Case Study. Kubeflow Ref: https://www.kubeflow.org/
  • 139. Re-define a machine learning workflow object with kubernetes object Run training, inferencing, serving, and other things on kubernetes Need ksonnet, configuration management tools for kubernets manifests https://www.kubeflow.org/docs/components/ksonnet/ Only works well with tensorflow (support for PyTorch, MPI, MXNet is on alpha/beta stage) Some functions only works on GKE cluster Very early stage product (less than 1 year) Case Study. Kubeflow
  • 140. TF Job # TF Job # https://www.kubeflow.org/docs/components/tftraining/ apiVersion: kubeflow.org/v1beta1 kind: TFJob metadata: labels: experiment: experiment10 name: tfjob namespace: kubeflow spec: tfReplicaSpecs: Ps: replicas: 1 template: metadata: creationTimestamp: null spec: containers: - args: - python - tf_cnn_benchmarks.py ... Ref: https://www.kubeflow.org/docs/components/tftraining/
  • 143. You can build your own training cluster! You also can build your own inference cluster! If you do not want to get your hands dirty, you can use our terraform code and cli. https://github.com/AITRICS/kono Summary
  • 145. Monitoring resources Prometheus + cAdvisor https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/ Training models from real-time data streaming Real-time one Kafka Stream (+ Spark Streaming) + Online learning https://github.com/kaiwaehner/kafka-streams-machine-learning- examples Large-scale data preprocessing Apache Spark What's next topic (which is not covered)?
  • 146. Distributed training Polyaxon supports: https://github.com/polyaxon/polyaxon- examples/blob/master/in_cluster/tensorflow/cifar10/ polyaxonfile_distributed.yml Use horovod: https://github.com/horovod/horovod Model & Data Versioning https://github.com/iterative/dvc What's next topic (which is not covered)?
  • 147. [email protected] Tel. +82 2 569 5507 Fax. +82 2 569 5508 www.aitrics.com Thank you! Jaeman An <[email protected]> Contact: Jaeman An <[email protected]> Yongseon Lee <[email protected]> Tony Kim <[email protected]>