SlideShare a Scribd company logo
bit.ly/kubemaster1
1
GPU Enablement for Data
Science on OpenShift
Pete MacKinnon
Red Hat AI Center of Excellence
@pdmackinnon
● pmackinn@redhat.com
● Principal Engineer in the Red Hat AI Center of Excellence
● Kubeflow committer since project formation
● Open Data Hub and NVIDIA GPU Operator contributor
● KubeCon, TensorFlow World, GTC, ODSC, OpenShift
Commons, and SCaLE 17x presenter
● Technical Editor for upcoming Kubeflow publication
● Co-author of “Linux Unleashed”
● Thirty years of distributed computing consulting and
engineering experience
• Data science: data and models
• AI/ML lifecycle: training to inference
• Scalars, vectors, and tensors
• CPU and GPU
• Notebooks and frameworks
• The OpenShift GPU operator “family”
• The components of GPU enablement
• Installation and demo
Agenda
Data
Models
The AI/ML lifecycle
Inference/Serving
Training
Data collection
Feature
extraction
Labeling
Monitoring
Logging
Analysis
Transformation
Validation
Splitting
Model validation
Hyperparameter tuning
Algorithm selection or
development
Model Data and Model
in Production
Data
Scalars, vectors, and tensors
Scalar - a real number having magnitude that measures
something: volume, density, speed, energy, mass, time, etc.
Vector - a one-dimensional array of scalars: force, velocity,
momentum, etc.
Tensor - a higher-order algebraic object that could be a scalar, a
vector, a multidimensional array, a multilinear map, etc.
Modern CPU have advanced instruction sets for vector algebra
but modern GPU are built specifically to perform complex
tensor operations with a high degree of parallelism
Scalars, vectors, and tensors
How many matrix multiplications can be done in one clock cycle?
Image: https://iq.opengenus.org/
10¹ 10⁴ 10⁵
So, in one clock cycle...
CPU (scalar)
CPU/GPU
(vector)
GPU (tensor)
Or, DL with real world data...
Object
(scalar)
Movement
(vector)
Classification, velocity,
bearing, and much more
(tensor)
CPU and GPU
NVIDIA Ampere A100
• 6912 FP32 CUDA Cores
• 432 Gen3 Tensor Cores
but
• FP32 -> 19.5 TFLOPS
AMD EPYC 7702 (Rome)
• 64 CPU Cores
• 128 Threads
• 2.0GHz Base Clock
• FP32 -> 1-2 TFLOPS
A GPU notebook
Profit
380x speedup over CPU in basic CNN smoke test
(Intel Xeon E5-2686 vs. NVIDIA V100-SXM2-16Gi)
Special Resource Operator
(SRO)
● Community operator
● Reference
implementation for other
specialized hardware
○ NIC, FPGA
● Provided the code basis
for the NVIDIA GPU
Operator
● Deployed from
OperatorHub
GPU operators
NVIDIA GPU Operator
● Certified and supported on
OpenShift by NVIDIA and Red Hat
● Can be deployed from embedded
OperatorHub or with Helm
Both operators require node feature
discovery (NFD)
NVIDIA also provides the GPU feature
operator for enhanced labeling
Operator components
• Container-runtime-toolkit: The NVIDIA GPU Operator
supports docker and cri-o container runtimes. This daemonset
ensures the correct runtime setup for the GPU hook.
• Driver: A container deployed as a daemonset that holds all
userspace and kernelspace software to make the GPU device
work.
• Device plugin: A daemonset that monitors the health and
availability of the GPU on the node. Vital for pod scheduling.
• DCGM: Data Center GPU Monitoring - a node exporter that
captures GPU metrics for use by Prometheus.
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
Installation
Demo
Thank You

More Related Content

What's hot (20)

PDF
Operatorhub.io and your Kubernetes cluster | DevNation Tech Talk
Red Hat Developers
 
PDF
16. Cncf meetup-docker
Juraj Hantak
 
PPTX
OpenShift Application Development | DO288 | Red Hat OpenShift
Global Knowledge Technologies
 
PDF
Journey of Kubernetes Scaling
Opsta
 
PDF
What you have to know about Certified Kubernetes Administrator (CKA)
Opsta
 
PDF
Kubernetes Logging
Denys Havrysh
 
PDF
Kubernetes - A Rising Hero
Huynh Thai Bao
 
PDF
Containerd + buildkit breakout
Docker, Inc.
 
PDF
Cicd pixelfederation
Juraj Hantak
 
PDF
Managing kubernetes deployment with operators
Cloud Technology Experts
 
PDF
GlueCon kubernetes & container engine
brendandburns
 
PDF
Introduction to Kubernetes and GKE
Opsta
 
PDF
Multi-cloud Kubernetes BCDR with Velero
Kublr
 
PDF
Docker ee an architecture and operations overview
Docker, Inc.
 
PDF
Extended and embedding: containerd update & project use cases
Phil Estes
 
PDF
Implementing an Automated Staging Environment
Daniel Oliveira Filho
 
PDF
Five Lessons Learned from Large-scale Implementation of Kubernetes in the Ent...
DevOps.com
 
PDF
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Red Hat Developers
 
PDF
How Docker EE is Finnish Railway’s Ticket to App Modernization
Docker, Inc.
 
PDF
FOSDEM 2019: A containerd Project Update
Phil Estes
 
Operatorhub.io and your Kubernetes cluster | DevNation Tech Talk
Red Hat Developers
 
16. Cncf meetup-docker
Juraj Hantak
 
OpenShift Application Development | DO288 | Red Hat OpenShift
Global Knowledge Technologies
 
Journey of Kubernetes Scaling
Opsta
 
What you have to know about Certified Kubernetes Administrator (CKA)
Opsta
 
Kubernetes Logging
Denys Havrysh
 
Kubernetes - A Rising Hero
Huynh Thai Bao
 
Containerd + buildkit breakout
Docker, Inc.
 
Cicd pixelfederation
Juraj Hantak
 
Managing kubernetes deployment with operators
Cloud Technology Experts
 
GlueCon kubernetes & container engine
brendandburns
 
Introduction to Kubernetes and GKE
Opsta
 
Multi-cloud Kubernetes BCDR with Velero
Kublr
 
Docker ee an architecture and operations overview
Docker, Inc.
 
Extended and embedding: containerd update & project use cases
Phil Estes
 
Implementing an Automated Staging Environment
Daniel Oliveira Filho
 
Five Lessons Learned from Large-scale Implementation of Kubernetes in the Ent...
DevOps.com
 
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Red Hat Developers
 
How Docker EE is Finnish Railway’s Ticket to App Modernization
Docker, Inc.
 
FOSDEM 2019: A containerd Project Update
Phil Estes
 

Similar to GPU enablement for data science on OpenShift | DevNation Tech Talk (20)

PDF
Tesla Accelerated Computing Platform
inside-BigData.com
 
PPTX
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
PPTX
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Haidee McMahon
 
PPTX
Graphics Processing unit ppt
VictorAbhinav
 
PDF
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
PPT
Current Trends in HPC
Putchong Uthayopas
 
PDF
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
PDF
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
PDF
Infrastructure and Tooling - Full Stack Deep Learning
Sergey Karayev
 
PDF
Nvidia at SEMICon, Munich
Alison B. Lowndes
 
PDF
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
PDF
Programming Models for Heterogeneous Chips
Facultad de Informática UCM
 
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
PDF
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
PDF
"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation ...
Edge AI and Vision Alliance
 
PDF
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Unity Technologies
 
PDF
LCU13: GPGPU on ARM Experience Report
Linaro
 
PPTX
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
PDF
NVIDIA Rapids presentation
testSri1
 
PDF
Rapids: Data Science on GPUs
inside-BigData.com
 
Tesla Accelerated Computing Platform
inside-BigData.com
 
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Haidee McMahon
 
Graphics Processing unit ppt
VictorAbhinav
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Odinot Stanislas
 
Current Trends in HPC
Putchong Uthayopas
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
MuhammadAbdullah311866
 
Infrastructure and Tooling - Full Stack Deep Learning
Sergey Karayev
 
Nvidia at SEMICon, Munich
Alison B. Lowndes
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Lablup Inc.
 
Programming Models for Heterogeneous Chips
Facultad de Informática UCM
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
E-Commerce Brasil
 
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
"The Caffe2 Framework for Mobile and Embedded Deep Learning," a Presentation ...
Edge AI and Vision Alliance
 
Optimizing HDRP with NVIDIA Nsight Graphics – Unite Copenhagen 2019
Unity Technologies
 
LCU13: GPGPU on ARM Experience Report
Linaro
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
NVIDIA Rapids presentation
testSri1
 
Rapids: Data Science on GPUs
inside-BigData.com
 
Ad

More from Red Hat Developers (20)

PDF
DevNation Tech Talk: Getting GitOps
Red Hat Developers
 
PDF
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
PDF
GitHub Makeover | DevNation Tech Talk
Red Hat Developers
 
PDF
Quinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
Red Hat Developers
 
PDF
Extra micrometer practices with Quarkus | DevNation Tech Talk
Red Hat Developers
 
PDF
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
Red Hat Developers
 
PDF
Integrating Loom in Quarkus | DevNation Tech Talk
Red Hat Developers
 
PDF
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
Red Hat Developers
 
PDF
Containers without docker | DevNation Tech Talk
Red Hat Developers
 
PDF
Distributed deployment of microservices across multiple OpenShift clusters | ...
Red Hat Developers
 
PDF
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
Red Hat Developers
 
PDF
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
Red Hat Developers
 
PDF
11 CLI tools every developer should know | DevNation Tech Talk
Red Hat Developers
 
PDF
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
Red Hat Developers
 
PDF
GitHub Actions and OpenShift: ​​Supercharging your software development loops...
Red Hat Developers
 
PDF
To the moon and beyond with Java 17 APIs! | DevNation Tech Talk
Red Hat Developers
 
PDF
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
Red Hat Developers
 
PDF
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Red Hat Developers
 
PDF
Level-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
Red Hat Developers
 
PDF
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Red Hat Developers
 
DevNation Tech Talk: Getting GitOps
Red Hat Developers
 
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
GitHub Makeover | DevNation Tech Talk
Red Hat Developers
 
Quinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
Red Hat Developers
 
Extra micrometer practices with Quarkus | DevNation Tech Talk
Red Hat Developers
 
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
Red Hat Developers
 
Integrating Loom in Quarkus | DevNation Tech Talk
Red Hat Developers
 
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
Red Hat Developers
 
Containers without docker | DevNation Tech Talk
Red Hat Developers
 
Distributed deployment of microservices across multiple OpenShift clusters | ...
Red Hat Developers
 
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
Red Hat Developers
 
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
Red Hat Developers
 
11 CLI tools every developer should know | DevNation Tech Talk
Red Hat Developers
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
Red Hat Developers
 
GitHub Actions and OpenShift: ​​Supercharging your software development loops...
Red Hat Developers
 
To the moon and beyond with Java 17 APIs! | DevNation Tech Talk
Red Hat Developers
 
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
Red Hat Developers
 
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Red Hat Developers
 
Level-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
Red Hat Developers
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Red Hat Developers
 
Ad

Recently uploaded (20)

PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 

GPU enablement for data science on OpenShift | DevNation Tech Talk

  • 1. bit.ly/kubemaster1 1 GPU Enablement for Data Science on OpenShift Pete MacKinnon Red Hat AI Center of Excellence
  • 2. @pdmackinnon ● [email protected] ● Principal Engineer in the Red Hat AI Center of Excellence ● Kubeflow committer since project formation ● Open Data Hub and NVIDIA GPU Operator contributor ● KubeCon, TensorFlow World, GTC, ODSC, OpenShift Commons, and SCaLE 17x presenter ● Technical Editor for upcoming Kubeflow publication ● Co-author of “Linux Unleashed” ● Thirty years of distributed computing consulting and engineering experience
  • 3. • Data science: data and models • AI/ML lifecycle: training to inference • Scalars, vectors, and tensors • CPU and GPU • Notebooks and frameworks • The OpenShift GPU operator “family” • The components of GPU enablement • Installation and demo Agenda
  • 6. The AI/ML lifecycle Inference/Serving Training Data collection Feature extraction Labeling Monitoring Logging Analysis Transformation Validation Splitting Model validation Hyperparameter tuning Algorithm selection or development Model Data and Model in Production Data
  • 7. Scalars, vectors, and tensors Scalar - a real number having magnitude that measures something: volume, density, speed, energy, mass, time, etc. Vector - a one-dimensional array of scalars: force, velocity, momentum, etc. Tensor - a higher-order algebraic object that could be a scalar, a vector, a multidimensional array, a multilinear map, etc. Modern CPU have advanced instruction sets for vector algebra but modern GPU are built specifically to perform complex tensor operations with a high degree of parallelism
  • 8. Scalars, vectors, and tensors How many matrix multiplications can be done in one clock cycle? Image: https://iq.opengenus.org/ 10¹ 10⁴ 10⁵
  • 9. So, in one clock cycle... CPU (scalar) CPU/GPU (vector) GPU (tensor)
  • 10. Or, DL with real world data... Object (scalar) Movement (vector) Classification, velocity, bearing, and much more (tensor)
  • 11. CPU and GPU NVIDIA Ampere A100 • 6912 FP32 CUDA Cores • 432 Gen3 Tensor Cores but • FP32 -> 19.5 TFLOPS AMD EPYC 7702 (Rome) • 64 CPU Cores • 128 Threads • 2.0GHz Base Clock • FP32 -> 1-2 TFLOPS
  • 13. Profit 380x speedup over CPU in basic CNN smoke test (Intel Xeon E5-2686 vs. NVIDIA V100-SXM2-16Gi)
  • 14. Special Resource Operator (SRO) ● Community operator ● Reference implementation for other specialized hardware ○ NIC, FPGA ● Provided the code basis for the NVIDIA GPU Operator ● Deployed from OperatorHub GPU operators NVIDIA GPU Operator ● Certified and supported on OpenShift by NVIDIA and Red Hat ● Can be deployed from embedded OperatorHub or with Helm Both operators require node feature discovery (NFD) NVIDIA also provides the GPU feature operator for enhanced labeling
  • 15. Operator components • Container-runtime-toolkit: The NVIDIA GPU Operator supports docker and cri-o container runtimes. This daemonset ensures the correct runtime setup for the GPU hook. • Driver: A container deployed as a daemonset that holds all userspace and kernelspace software to make the GPU device work. • Device plugin: A daemonset that monitors the health and availability of the GPU on the node. Vital for pod scheduling. • DCGM: Data Center GPU Monitoring - a node exporter that captures GPU metrics for use by Prometheus. nodeSelector: feature.node.kubernetes.io/pci-10de.present: "true"
  • 17. Demo