SlideShare a Scribd company logo
From Idle to Optimal:
Maximize GPU Utilization
for Model Training
Beinan Wang
Tarik Bennett
Senior Staff Engineer @ Alluxio
Trino Contributor
PrestoDB Committer
Senior Solutions Engineer
@ Alluxio
Dr. Beinan Wang
Tarik Bennett
2
Source: https://www.wsj.com/articles/rush-to-use-generative-ai-pushes-companies-to-get-data-in-order
“Training large language models requires ready
access to vast amounts of data, whose storage,
processing, and protection can be costly.”
High Scalability
Training billions files
ESSENTIAL
High Availability
99.99%
ESSENTIAL
High Performance
Higher GPU utilization
ESSENTIAL
Always Increasing Expectations…
Icons created by kerismaker, HJ Studio - Flaticon
What Does Managing Data Involve?
Data Preprocessing
Improving the quality and reliability
of the data for model training
Model Training
Read training data, vision (image) or
NLP/LLM (text), for DL using GPUs
Model Deployment
Consumption of trained models for
online or offline inference
Feature Engineering
Selecting relevant and informative
features from raw data
PyTorch | Tensorflow | Spark
Spark PyTorch | Tensorflow | Spark
Model
Training Data Result
Model
Compute
Stage
Spark | Trino | Presto
Result
Curated Data
Not discussed today
- Security
- Privacy (PII)
- Data Cleaning
- Data Pipelines
- Data Governance
Curated /
Processed Data
100,000,000,000,000,000,000,000
bytes of data will be stored in the cloud by 2025
6
Source: Cybersecurity Ventures
Issues Managing Ultra-Large Datasets
Non-Functional Storage Requirements
High Performance
- Many options
Cost-Effective
- Commodity storage
10%
of your data is hot data
8
Source: Alluxio
9
Data
Caching
Helps
Boost
Performance
Save Costs
Prevent
Network
Congestion
Offload
Under
Storage
10
Data Caching at Uber Scale
3 Clusters, 1500 Nodes
Source: https://www.uber.com/blog/speed-up-presto-with-alluxio-local-cache/
50%
Input Read
Performance
10%
Data Read Traffic
to HDFS
Maximizing GPUs
11
GPUs are
scarce
GPUs are
expensive
Challenges as you try to scale
Low GPU
Utilization
Addressing Low
GPU Utilization with Caching
13
Architecture Overview
Online ML platform
Inference cluster
Models
Models
Training Data
Models
1
2
3
4
5
Offline training platform
Alluxio
Training cluster
Training Data
2
14
AI Training Test with Alluxio
15
Local Folder /dataset
Alluxio
GPU Training
Remote Storage
Kubernetes
Interactive
Notebook
Alluxio
Operator
Visualization
Dashboard
Test Setup
● Alluxio via Kubernetes - Provides caching for training data
● GPU server - AWS EC2/Kubernetes
● Deep learning algorithm (CV) - ResNet (one of the most popular CV algorithms)
● Deep learning framework - PyTorch
● Dataset - ImageNet (subset - ~35k images, each is ~100kB - 200kB)
● Dataset storage - S3 (single region)
● Mounting - FUSE
● Visualization - TensorBoard
● Code execution - Jupyter notebook
16
Training Test Steps
1. Loading the dataset into Alluxio
2. Running the training job
3. Reading the dataset from Alluxio through PyTorch
DataLoader in each epoch
4. Visualizing the GPU utilization and other metrics
17
18
Training Directly from Storage
- > 80% of total time is spent in DataLoader
- Result in Low GPU Utilization Rate (<20%)
Visualization Dashboard Results (Control)
19
Visualization Dashboard Results (Alluxio)
Training with Alluxio
- Reduced DataLoader Rate from 82% to 1% (82X)
- Increase GPU Utilization Rate from 17% to 93% (5X)
Source: https://developer.nvidia.com/blog/accelerating-analytics-and-ai-with-alluxio-and-nvidia-gpus/
“The benefits from GPU acceleration are limited
if data access dominates the execution time. “
Thank You
twitter.com/alluxio slackin.alluxio.io
linkedin.com/alluxio
www.alluxio.io
JOIN THE CONVERSATION
ON SLACK
ALLUXIO.IO/SLACK

More Related Content

Similar to Alluxio Webinar - Maximize GPU Utilization for Model Training (20)

PDF
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Iulian Pintoiu
 
PDF
Distributed TensorFlow on Hops (Papis London, April 2018)
Jim Dowling
 
PDF
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
PPTX
GPU and Deep learning best practices
Lior Sidi
 
PDF
Deep learning for FinTech
geetachauhan
 
PDF
BSC LMS DDL
Ganesan Narayanasamy
 
PDF
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
Henry Saputra
 
PDF
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
PDF
TensorFlow 16: Building a Data Science Platform
Seldon
 
PPTX
Innovation with ai at scale on the edge vt sept 2019 v0
Ganesan Narayanasamy
 
PDF
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
PDF
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
PPTX
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Intel® Software
 
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
PDF
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna
 
PPTX
Azure machine learning service
Ruth Yakubu
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Iulian Pintoiu
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Jim Dowling
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
GPU and Deep learning best practices
Lior Sidi
 
Deep learning for FinTech
geetachauhan
 
BSC LMS DDL
Ganesan Narayanasamy
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
Henry Saputra
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks
 
TensorFlow 16: Building a Data Science Platform
Seldon
 
Innovation with ai at scale on the edge vt sept 2019 v0
Ganesan Narayanasamy
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Intel® Software
 
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna
 
Azure machine learning service
Ruth Yakubu
 

More from Alluxio, Inc. (20)

PDF
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
PDF
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
Introduction to Apache Iceberg™ & Tableflow
Alluxio, Inc.
 
Optimizing Tiered Storage for Low-Latency Real-Time Analytics at AI Scale
Alluxio, Inc.
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
How Coupang Leverages Distributed Cache to Accelerate ML Model Training
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PDF
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
Troubleshooting Virtual Threads in Java!
Tier1 app
 
PDF
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
PDF
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Why Are More Businesses Choosing Partners Over Freelancers for Salesforce.pdf
Cymetrix Software
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Troubleshooting Virtual Threads in Java!
Tier1 app
 
AWS_Agentic_AI_in_Indian_BFSI_A_Strategic_Blueprint_for_Customer.pdf
siddharthnetsavvies
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Presentation about variables and constant.pptx
kr2589474
 
SAP GUI Installation Guide for macOS (iOS) | Connect to SAP Systems on Mac
SAP Vista, an A L T Z E N Company
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Salesforce Pricing Update 2025: Impact, Strategy & Smart Cost Optimization wi...
GetOnCRM Solutions
 
Infrastructure planning and resilience - Keith Hastings.pptx.pdf
Safe Software
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
What companies do with Pharo (ESUG 2025)
ESUG
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Ad

Alluxio Webinar - Maximize GPU Utilization for Model Training