SlideShare a Scribd company logo
3
Most read
10
Most read
13
Most read
How Coupang Leverages Distributed
Cache to Accelerate ML Model
Training
April 22, 2025
Hyun Jung Baek, Staff Backend Engineer @ Coupang
Coupang Confidential and Proprietary
Coupang is a technology and Fortune 200 company listed on
the New York Stock Exchange (NYSE: CPNG) that provides
retail, restaurant delivery, video streaming, and fintech services
to customers around the world under brands that include
Coupang, Coupang Eats, Coupang Play and Farfetch.
Coupang is a Technology and
Fortune 200 Company (NYSE: CPNG)
Coupang Confidential and Proprietary
Machine Learning Impacts Every Aspect of Commerce
Experiences of Coupang Customers
Product Catalog Search Pricing
Robotics Inventory Fulfilment
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Coupang Confidential and Proprietary
Core offerings
• Notebooks & ML Pipeline Authoring
• Model Training
• Model Inference
• Monitoring & Observability
Coupang’s ML Platform Overview
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Coupang Confidential and Proprietary
Both AWS Multi-Region & On-prem GPU
Clusters
● Cloud GPU clusters across AWS Asia-
Pacific & US regions
● On-prem data center (compute &
storage)
Requirements
● Resource efficiency
○ GPU utilization
● High I/O throughput
● Developer experience
● Cloud cost optimization
Hybrid & Multi-Region Compute & Storage Due to GPU Shortage
Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
Monitoring GPU utilization of Training cluster
Coupang Confidential and Proprietary
Previous Architecture
ap-region
On Prem
Local Storage
GPU Training Cluster
Data Copy
Data Lake
ap-region
Local Storage
GPU Training Cluster
us-region
Coupang Confidential and Proprietary
● Required preparation step (copy and validation) before training jobs
○ Added day-long delay before training on a dataset
● Challenges in fully utilizing GPU resources across regions
○ Difficult to run overflow training jobs in a different region if local cluster is peaked, as
the data may not be available or may exist in different paths
● Data Silos and Storage cost growing
● Operation overhead to maintain storage organized and under capacity
○ Required coordination across teams to manage and maintain local storage
Challenges of the Previous Architecture
Coupang Confidential and Proprietary
New Architecture with Distributed Cache
ap-region
Data Lake
On Prem
Distributed Cache
GPU Training Cluster
Only on Cache Miss
ap-region
GPU Training Cluster
us-region
Distributed Cache
Coupang Confidential and Proprietary
Inside Distributed Cache
Worker
Pod
Worker
Pod
Worker
Pod
etcd
Pod
etcd
Pod
etcd
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
FUSE
Pod
Training
Job Pod
hostpath:
/mnt/cache-fuse
I/O
Request
Mount Table
&
Membership
Distributed
Cache
Service
Data Lake
Cache Miss
Coupang Confidential and Proprietary
● Instant Data Availability
○ Eliminates lengthy data preparation
■ Training jobs can start immediately without waiting for data to be cached
○ Model developers can still pre-load datasets using the --skip-if-exists flag
■ If already cached, this step is a no-op
○ No coordination required across teams, simplifying the workflow
● Improve GPU Utilization Across Multi-Region
○ Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions
○ During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU
utilization across multiple regions
● Faster Training Jobs
○ Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training
time and boosting productivity
New Architecture: Benefits for Model Developers
Coupang Confidential and Proprietary
● Reduced Storage Costs & Operation Overhead
○ Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes
■ Data lake (many PBs) vs cache capacity (TB to PB)
○ No coordination required for cache space cleanup
● Easy Expansion & Operation
○ Seamlessly scale architecture to new GPU clusters without complex reconfiguration
○ Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments
New Architecture: Benefits for Platform Engineers
Coupang Confidential and Proprietary
THANK YOU
Copyright © 2024 Coupang, Inc. All rights reserved. All Coupang trademarks, Coupang logos and service marks displayed herein are property of Coupang, Inc. and/or its affiliates (collectively, "Coupang"),
registered in the U.S. and other countries. Any other company mentioned herein is merely for identification purposes. Coupang acknowledges that the company name may be a registered trademark of
the company and recognizes that any such trademark is owned solely and exclusively by such company. The information contained herein are based on the author, Hyun Jung Baek's own individual
experience as an employee and are not representative of any views or opinions of Coupang. Coupang has not verified, and it makes no representation as to, the adequacy, fairness, accuracy, or
completeness of any information contained herein.
How Coupang Leverages Distributed Cache to Accelerate ML Model Training

More Related Content

Similar to How Coupang Leverages Distributed Cache to Accelerate ML Model Training (20)

PDF
SACON NY 19: "Creating an effective developer experience for cloud-native apps"
Daniel Bryant
 
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
PDF
Multiplier Effect: Case Studies in Distributions for Publishers
Jon Peck
 
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
PDF
Leveraging open source for large scale analytics
South West Data Meetup
 
PPTX
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
Chris Kernaghan
 
PPTX
Does Big Data Spell Big Costs- Impetus Webinar
Impetus Technologies
 
PDF
Nextflow on Velsera: a data-driven journey from failure to cutting-edge
Jack DiGiovanna
 
PDF
HP Moonshot system
HP Enterprise Italia
 
PDF
CNCF Webinar Series: "Creating an Effective Developer Experience on Kubernetes"
Daniel Bryant
 
PDF
Velocity NY 2018 "The Cloud Native Developer Workflow"
Daniel Bryant
 
PPT
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
PDF
HP flash optimized storage - webcast
Calvin Zito
 
PPTX
Kubernetes for machine learning
Akash Agrawal
 
PDF
Accelerating Cloud Training With Alluxio
Alluxio, Inc.
 
PDF
Scalable Clusters On Demand
Bogdan Kyryliuk
 
PDF
Introduction To Apache Mesos
Timothy St. Clair
 
PDF
Capacity Planning Infrastructure for Web Applications (Drupal)
Ricardo Amaro
 
PDF
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
PDF
Introduction to DevOps and the Practical Use Cases at Credit OK
Kriangkrai Chaonithi
 
SACON NY 19: "Creating an effective developer experience for cloud-native apps"
Daniel Bryant
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
Multiplier Effect: Case Studies in Distributions for Publishers
Jon Peck
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
Leveraging open source for large scale analytics
South West Data Meetup
 
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
Chris Kernaghan
 
Does Big Data Spell Big Costs- Impetus Webinar
Impetus Technologies
 
Nextflow on Velsera: a data-driven journey from failure to cutting-edge
Jack DiGiovanna
 
HP Moonshot system
HP Enterprise Italia
 
CNCF Webinar Series: "Creating an Effective Developer Experience on Kubernetes"
Daniel Bryant
 
Velocity NY 2018 "The Cloud Native Developer Workflow"
Daniel Bryant
 
Enabling a hardware accelerated deep learning data science experience for Apa...
DataWorks Summit
 
HP flash optimized storage - webcast
Calvin Zito
 
Kubernetes for machine learning
Akash Agrawal
 
Accelerating Cloud Training With Alluxio
Alluxio, Inc.
 
Scalable Clusters On Demand
Bogdan Kyryliuk
 
Introduction To Apache Mesos
Timothy St. Clair
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Ricardo Amaro
 
From Data Preparation to Inference: How Alluxio Speeds Up AI
Alluxio, Inc.
 
Introduction to DevOps and the Practical Use Cases at Credit OK
Kriangkrai Chaonithi
 

More from Alluxio, Inc. (20)

PDF
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
PDF
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio, Inc.
 
PDF
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
Best Practice for LLM Serving in the Cloud
Alluxio, Inc.
 
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Alluxio, Inc.
 
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendat...
Alluxio, Inc.
 
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Alluxio, Inc.
 
AI/ML Infra Meetup | Optimizing ML Data Access with Alluxio: Preprocessing, ...
Alluxio, Inc.
 
AI/ML Infra Meetup | Deployment, Discovery and Serving of LLMs at Uber Scale
Alluxio, Inc.
 
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
Alluxio, Inc.
 
AI/ML Infra Meetup | A Faster and More Cost Efficient LLM Inference Stack
Alluxio, Inc.
 
AI/ML Infra Meetup | RAYvolution - The Last Mile: Mastering AI Deployment wit...
Alluxio, Inc.
 
Alluxio Webinar | Accelerate AI: Alluxio 101
Alluxio, Inc.
 
AI/ML Infra Meetup | The power of Ray in the era of LLM and multi-modality AI
Alluxio, Inc.
 
AI/ML Infra Meetup | Big Data and AI, Zoom Developers
Alluxio, Inc.
 
AI/ML Infra Meetup | TorchTitan, One-stop PyTorch native solution for product...
Alluxio, Inc.
 
Alluxio Webinar | Model Training Across Regions and Clouds – Challenges, Solu...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Experimentation Platform in Digital Marketplaces...
Alluxio, Inc.
 
AI/ML Infra Meetup | Scaling Vector Databases for E-Commerce Visual Search: A...
Alluxio, Inc.
 
Alluxio Webinar | Optimize, Don't Overspend: Data Caching Strategy for AI Wor...
Alluxio, Inc.
 
AI/ML Infra Meetup | Maximizing GPU Efficiency : Optimizing Model Training wi...
Alluxio, Inc.
 
AI/ML Infra Meetup | Preference Tuning and Fine Tuning LLMs
Alluxio, Inc.
 
Ad

Recently uploaded (20)

PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Ad

How Coupang Leverages Distributed Cache to Accelerate ML Model Training

  • 1. How Coupang Leverages Distributed Cache to Accelerate ML Model Training April 22, 2025 Hyun Jung Baek, Staff Backend Engineer @ Coupang
  • 2. Coupang Confidential and Proprietary Coupang is a technology and Fortune 200 company listed on the New York Stock Exchange (NYSE: CPNG) that provides retail, restaurant delivery, video streaming, and fintech services to customers around the world under brands that include Coupang, Coupang Eats, Coupang Play and Farfetch. Coupang is a Technology and Fortune 200 Company (NYSE: CPNG)
  • 3. Coupang Confidential and Proprietary Machine Learning Impacts Every Aspect of Commerce Experiences of Coupang Customers Product Catalog Search Pricing Robotics Inventory Fulfilment Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
  • 4. Coupang Confidential and Proprietary Core offerings • Notebooks & ML Pipeline Authoring • Model Training • Model Inference • Monitoring & Observability Coupang’s ML Platform Overview Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172
  • 5. Coupang Confidential and Proprietary Both AWS Multi-Region & On-prem GPU Clusters ● Cloud GPU clusters across AWS Asia- Pacific & US regions ● On-prem data center (compute & storage) Requirements ● Resource efficiency ○ GPU utilization ● High I/O throughput ● Developer experience ● Cloud cost optimization Hybrid & Multi-Region Compute & Storage Due to GPU Shortage Meet Coupang’s Machine Learning Platform: https://medium.com/coupang-engineering/meet-coupangs-machine-learning-platform-cd00e9ccc172 Monitoring GPU utilization of Training cluster
  • 6. Coupang Confidential and Proprietary Previous Architecture ap-region On Prem Local Storage GPU Training Cluster Data Copy Data Lake ap-region Local Storage GPU Training Cluster us-region
  • 7. Coupang Confidential and Proprietary ● Required preparation step (copy and validation) before training jobs ○ Added day-long delay before training on a dataset ● Challenges in fully utilizing GPU resources across regions ○ Difficult to run overflow training jobs in a different region if local cluster is peaked, as the data may not be available or may exist in different paths ● Data Silos and Storage cost growing ● Operation overhead to maintain storage organized and under capacity ○ Required coordination across teams to manage and maintain local storage Challenges of the Previous Architecture
  • 8. Coupang Confidential and Proprietary New Architecture with Distributed Cache ap-region Data Lake On Prem Distributed Cache GPU Training Cluster Only on Cache Miss ap-region GPU Training Cluster us-region Distributed Cache
  • 9. Coupang Confidential and Proprietary Inside Distributed Cache Worker Pod Worker Pod Worker Pod etcd Pod etcd Pod etcd Pod FUSE Pod FUSE Pod FUSE Pod FUSE Pod Training Job Pod hostpath: /mnt/cache-fuse I/O Request Mount Table & Membership Distributed Cache Service Data Lake Cache Miss
  • 10. Coupang Confidential and Proprietary ● Instant Data Availability ○ Eliminates lengthy data preparation ■ Training jobs can start immediately without waiting for data to be cached ○ Model developers can still pre-load datasets using the --skip-if-exists flag ■ If already cached, this step is a no-op ○ No coordination required across teams, simplifying the workflow ● Improve GPU Utilization Across Multi-Region ○ Maintains a consistent view of all data paths from the original data lake address, enabling seamless access across regions ○ During peak GPU hours, developers can submit training jobs to an overflow GPU cluster, unmodified, ensuring higher GPU utilization across multiple regions ● Faster Training Jobs ○ Provides higher performance compared to traditional HPC storage solutions (e.g., AWS FSx), significantly reducing training time and boosting productivity New Architecture: Benefits for Model Developers
  • 11. Coupang Confidential and Proprietary ● Reduced Storage Costs & Operation Overhead ○ Avoids full capacity storage purchases by eliminating duplicate datasets from data lakes ■ Data lake (many PBs) vs cache capacity (TB to PB) ○ No coordination required for cache space cleanup ● Easy Expansion & Operation ○ Seamlessly scale architecture to new GPU clusters without complex reconfiguration ○ Fully managed with Kubernetes (K8s) for simplified deployment, scaling, and maintenance across environments New Architecture: Benefits for Platform Engineers
  • 12. Coupang Confidential and Proprietary THANK YOU Copyright © 2024 Coupang, Inc. All rights reserved. All Coupang trademarks, Coupang logos and service marks displayed herein are property of Coupang, Inc. and/or its affiliates (collectively, "Coupang"), registered in the U.S. and other countries. Any other company mentioned herein is merely for identification purposes. Coupang acknowledges that the company name may be a registered trademark of the company and recognizes that any such trademark is owned solely and exclusively by such company. The information contained herein are based on the author, Hyun Jung Baek's own individual experience as an employee and are not representative of any views or opinions of Coupang. Coupang has not verified, and it makes no representation as to, the adequacy, fairness, accuracy, or completeness of any information contained herein.