Unified read-only cache
proposal
Design goals & current status
• A standalone SSD caching library that can be re-used between librbd
RGW
• Current status:
• Librbd read-only cache: caching block contents on SSD
• Librbd parent/clone images, caching parent rbd contents on SSD, all cloned image can read
from parent image cache before COW happen
• PR will be send out soon
• RGW immutable caching: caching rados objects on SSD
• A small CDN farm behind RGW cluster
• PR against Jewel ready(#13144) but need to clean up
General architecture
• Libcachefile: common lib that does
read/write on SSD
• Sparse-file based cache
• Policy: controlling on the cache
promotion/demotion, sizing of the
cache
• Simple LRU based
• librbd/librgw hooks: call API from
libcachefile
FileImageCache
RBD_0
SSD
libCacheStore
RGW_DataCache
librbd librgw
RGW_civetweb
RBD_1
RBD_2
RGW_civetweb
RGW_civetweb
RADOS
librbd librados
hooks hookspolicy policy
Shared rbd read-only SSD cache
Shared Read-only cache for RBD –rbd clone flow
RBD_0 RBD_0@snap1 RBD_1
RBD_2
RBD_N
…
Template image Protected snapshot
Cloned image
Cloned image
Cloned image
This is the shared image content
Shared Read-only cache for RBD -- overview
• There will be a shared
cache(from parent image) on
each compute node
• Cloned image will read from
the shared cache unless COW
happened Local Cache
Write I/O
Read I/O
SSD backend
Write I/O
Read I/O
…
…
Compute node
Local CacheShared Cache
Shared
Cache
…
…
Compute node
RADOS
OSD OSD OSD OSD OSD OSD OSD
SSD backend
Shared Read-only cache for RBD -- Cache
metadata
Each cloned image will have its COW cache mapping:
- For each read hit, either in shared cache, or in its own
cache
- Cache mapping bits for COWed data
- Updated when COW happen
2 bits :
not_in_cache,
In_shared_cache,
In_cache
62 bits:
block_id
Cache fileCache file
RBD_2(cloned)
librbd
FileImageCache
COW
data
librbd
FileImageCache
Shared Read-only cache for RBD – IO flow
RBD_0(parent)
image_store
Shared Cache file
(fully promoted on first cloned image open)
RADOS
1
RBD_1(cloned)
librbd
FileImageCache
Cache lookup
COW
data
2
in shared cache:
- Read from shared cache
2’
in cow cache:
- Read from cow cache
Compute node
read
SSD
COW Cache mapping
rbd_id lba length
Rbd_1 8192 4096
Rbd_1 1048576 4096
COW Cache mapping
rbd_id lba length
Rbd_2 8192 4096
Rbd_2 1048576 4096
librbd
FileImageCache
Cache fileCache file
RBD_2(cloned)
librbd
FileImageCache
COW
data
librbd
FileImageCache
RBD_0(parent)
image_store
Shared Cache file
(fully promoted on first cloned image open)
RADOS
1
COW Cache mapping
RBD_1(cloned)
Cache lookup
COW
data
2
in shared cache:
- Create entry in COW mapping table
- Write to RADOS
2’
in cow cache:
- Invalidate the chunk in the cache file
- Write to RADOS
Compute node
rbd_id lba length
rbd_1 8192 4096
rbd_1 1048576 4096
write
SSD
Shared Read-only cache for RBD – IO flow
COW Cache mapping
rbd_id lba length
rbd_2 81920 4096
rbd_2 1048576 4096
Shared Read-only cache for RBD – initial results
4k Rand Read Op_Size Op_Type QD Runtime(sec) IOPS BW(MB/s) Latency(ms) 99.99% Latency(ms)
Baseline(w/o cache) 4k randread qd32 300 12927 50.5MB/s 2.437ms 8.89ms
Read-only cache 4k randread qd32 300 52351 204.5MB/s 0.555ms 3.95ms
independent read-only cache(2
volumes)
4k randread qd32 300 70079 273.74MB/s 0.860ms 5.56ms
Shared read-only cache(2 volumes) 4k randread qd32 300 68612 268MB/s 0.875ms 2.98ms(?)
PR will be send out soon..
Shared RGW read-only SSD cache
Shared Read-only cache for RGW
chunk_id RGW instance id Cache_chunk_id
7e21a6b2-89b9-4de6-869e-
1ddc0198a82b.5228.1__shadow_.Tzk
bVV_syqJ2vumnFe8uAaiL9j6ghtC_34
Rgw_1 7e21a6b2-89b9-4de6-869e-
1ddc0198a82b.5228.1__shadow_.Tzk
bVV_syqJ2vumnFe8uAaiL9j6ghtC_34
• A CDN cluster behind the RGW clusters
• L1 cache: allow to read from SSD cache of local RGW instance
• L2 cache(configurable): allow to read from SSD cache on other remote RGW instances
• Each object/chunk has an unique ID
• Need a centralized distributed K/V to store the mapping as the chunks maybe spreaded
on different RGW instances
Shared Read-only cache for RGW
rgw_1 rgw_2
RADOS
Local
cache
Local
cache
librados
Immutable Cache
S3 API Swift API
rgw_frontend
rgw_rados
rgw_cache
datacachepolicy
Immutable Cache
L1 L2
Issues
• different caching semantics for block and object?
• Promoting at block level(default 8k) for librbd
• Promoting at object level for RGW
• #13144 is not compiling
• https://github.com/maniaabdi/engage1.git
• Jewel based, need to rebase against master
• Currently the logic is inside rgw_rados, need to be decupled to cope with our
design(libcachefile + policy)
RGW datacache (PR #13144)
rgw_1 rgw_2
RADOS
Local
cache
Local
cache
librados
Immutable Cache
S3 API Swift API
rgw_frontend
rgw_rados
rgw_cache
datacache
policy
Immutable Cache
L1 L2

More Related Content

PDF
Storage tiering and erasure coding in Ceph (SCaLE13x)
PPTX
Hadoop over rgw
PDF
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
PDF
Ceph Performance: Projects Leading up to Jewel
PPTX
Your 1st Ceph cluster
PDF
HKG15-401: Ceph and Software Defined Storage on ARM servers
PPTX
Bluestore
PDF
Ceph - A distributed storage system
Storage tiering and erasure coding in Ceph (SCaLE13x)
Hadoop over rgw
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Performance: Projects Leading up to Jewel
Your 1st Ceph cluster
HKG15-401: Ceph and Software Defined Storage on ARM servers
Bluestore
Ceph - A distributed storage system

What's hot (19)

PDF
Experiences building a distributed shared log on RADOS - Noah Watkins
PPTX
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
PPTX
Designing for High Performance Ceph at Scale
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PPTX
MySQL on Ceph
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
CephFS update February 2016
PDF
Ceph Object Storage Reference Architecture Performance and Sizing Guide
PDF
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
PDF
A crash course in CRUSH
PDF
Ceph and RocksDB
PDF
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
PDF
Community Update at OpenStack Summit Boston
PDF
BlueStore, A New Storage Backend for Ceph, One Year In
PDF
BlueStore: a new, faster storage backend for Ceph
PPTX
ceph-barcelona-v-1.2
PDF
Intorduce to Ceph
PDF
Ceph Client librbd Performance Analysis and Learnings - Mahati Chamarthy
PDF
BlueStore: a new, faster storage backend for Ceph
Experiences building a distributed shared log on RADOS - Noah Watkins
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Designing for High Performance Ceph at Scale
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
MySQL on Ceph
QCT Ceph Solution - Design Consideration and Reference Architecture
CephFS update February 2016
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Ceph BlueStore - новый тип хранилища в Ceph / Максим Воронцов, (Redsys)
A crash course in CRUSH
Ceph and RocksDB
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Community Update at OpenStack Summit Boston
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore: a new, faster storage backend for Ceph
ceph-barcelona-v-1.2
Intorduce to Ceph
Ceph Client librbd Performance Analysis and Learnings - Mahati Chamarthy
BlueStore: a new, faster storage backend for Ceph
Ad

Similar to Unified readonly cache for ceph (20)

PDF
Cache Tiering and Erasure Coding
PDF
Cache Tiering and Erasure Coding
PPTX
ASAUDIT April 2016 New
PDF
RBD: What will the future bring? - Jason Dillaman
PDF
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
PDF
Ceph Block Devices: A Deep Dive
PDF
Ceph Block Devices: A Deep Dive
PDF
4K Video Downloader Crack + License Key 2025
PDF
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
PDF
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
PDF
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
PDF
Ceph Performance: Projects Leading Up to Jewel
PDF
Ceph Tech Talk: Bluestore
PPTX
Tendências e Evoluções em Armazemamento de Dados
PDF
20171101 taco scargo luminous is out, what's in it for you
PDF
XenSummit - 08/28/2012
PDF
Ceph Overview for Distributed Computing Denver Meetup
ODP
Block Storage For VMs With Ceph
PDF
OSDC 2015: John Spray | The Ceph Storage System
PDF
Ceph RBD Update - June 2021
Cache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
ASAUDIT April 2016 New
RBD: What will the future bring? - Jason Dillaman
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
4K Video Downloader Crack + License Key 2025
Alluxio Webinar | Inside Deepseek 3FS: A Deep Dive into AI-Optimized Distribu...
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Performance: Projects Leading Up to Jewel
Ceph Tech Talk: Bluestore
Tendências e Evoluções em Armazemamento de Dados
20171101 taco scargo luminous is out, what's in it for you
XenSummit - 08/28/2012
Ceph Overview for Distributed Computing Denver Meetup
Block Storage For VMs With Ceph
OSDC 2015: John Spray | The Ceph Storage System
Ceph RBD Update - June 2021
Ad

Recently uploaded (20)

PPTX
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
PDF
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
PPTX
Why 2025 Is the Best Year to Hire Software Developers in India
PDF
Cloud Native Aachen Meetup - Aug 21, 2025
PPTX
ROI from Efficient Content & Campaign Management in the Digital Media Industry
PPTX
StacksandQueuesCLASS 12 COMPUTER SCIENCE.pptx
PDF
Module 1 - Introduction to Generative AI.pdf
PPTX
Lesson-3-Operation-System-Support.pptx-I
PPTX
Chapter 1 - Transaction Processing and Mgt.pptx
PDF
solman-7.0-ehp1-sp21-incident-management
PDF
Building an Inclusive Web Accessibility Made Simple with Accessibility Analyzer
PPTX
MCP empowers AI Agents from Zero to Production
PPTX
Presentation - Summer Internship at Samatrix.io_template_2.pptx
PDF
Engineering Document Management System (EDMS)
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PPTX
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
PPT
3.Software Design for software engineering
PDF
Mobile App for Guard Tour and Reporting.pdf
PPTX
UNIT II: Software design, software .pptx
PPTX
Folder Lock 10.1.9 Crack With Serial Key
DevOpsDays Halifax 2025 - Building 10x Organizations Using Modern Productivit...
Coding with GPT-5- What’s New in GPT 5 That Benefits Developers.pdf
Why 2025 Is the Best Year to Hire Software Developers in India
Cloud Native Aachen Meetup - Aug 21, 2025
ROI from Efficient Content & Campaign Management in the Digital Media Industry
StacksandQueuesCLASS 12 COMPUTER SCIENCE.pptx
Module 1 - Introduction to Generative AI.pdf
Lesson-3-Operation-System-Support.pptx-I
Chapter 1 - Transaction Processing and Mgt.pptx
solman-7.0-ehp1-sp21-incident-management
Building an Inclusive Web Accessibility Made Simple with Accessibility Analyzer
MCP empowers AI Agents from Zero to Production
Presentation - Summer Internship at Samatrix.io_template_2.pptx
Engineering Document Management System (EDMS)
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Bandicam Screen Recorder 8.2.1 Build 2529 Crack
3.Software Design for software engineering
Mobile App for Guard Tour and Reporting.pdf
UNIT II: Software design, software .pptx
Folder Lock 10.1.9 Crack With Serial Key

Unified readonly cache for ceph

  • 2. Design goals & current status • A standalone SSD caching library that can be re-used between librbd RGW • Current status: • Librbd read-only cache: caching block contents on SSD • Librbd parent/clone images, caching parent rbd contents on SSD, all cloned image can read from parent image cache before COW happen • PR will be send out soon • RGW immutable caching: caching rados objects on SSD • A small CDN farm behind RGW cluster • PR against Jewel ready(#13144) but need to clean up
  • 3. General architecture • Libcachefile: common lib that does read/write on SSD • Sparse-file based cache • Policy: controlling on the cache promotion/demotion, sizing of the cache • Simple LRU based • librbd/librgw hooks: call API from libcachefile FileImageCache RBD_0 SSD libCacheStore RGW_DataCache librbd librgw RGW_civetweb RBD_1 RBD_2 RGW_civetweb RGW_civetweb RADOS librbd librados hooks hookspolicy policy
  • 5. Shared Read-only cache for RBD –rbd clone flow RBD_0 RBD_0@snap1 RBD_1 RBD_2 RBD_N … Template image Protected snapshot Cloned image Cloned image Cloned image This is the shared image content
  • 6. Shared Read-only cache for RBD -- overview • There will be a shared cache(from parent image) on each compute node • Cloned image will read from the shared cache unless COW happened Local Cache Write I/O Read I/O SSD backend Write I/O Read I/O … … Compute node Local CacheShared Cache Shared Cache … … Compute node RADOS OSD OSD OSD OSD OSD OSD OSD SSD backend
  • 7. Shared Read-only cache for RBD -- Cache metadata Each cloned image will have its COW cache mapping: - For each read hit, either in shared cache, or in its own cache - Cache mapping bits for COWed data - Updated when COW happen 2 bits : not_in_cache, In_shared_cache, In_cache 62 bits: block_id
  • 8. Cache fileCache file RBD_2(cloned) librbd FileImageCache COW data librbd FileImageCache Shared Read-only cache for RBD – IO flow RBD_0(parent) image_store Shared Cache file (fully promoted on first cloned image open) RADOS 1 RBD_1(cloned) librbd FileImageCache Cache lookup COW data 2 in shared cache: - Read from shared cache 2’ in cow cache: - Read from cow cache Compute node read SSD COW Cache mapping rbd_id lba length Rbd_1 8192 4096 Rbd_1 1048576 4096 COW Cache mapping rbd_id lba length Rbd_2 8192 4096 Rbd_2 1048576 4096
  • 9. librbd FileImageCache Cache fileCache file RBD_2(cloned) librbd FileImageCache COW data librbd FileImageCache RBD_0(parent) image_store Shared Cache file (fully promoted on first cloned image open) RADOS 1 COW Cache mapping RBD_1(cloned) Cache lookup COW data 2 in shared cache: - Create entry in COW mapping table - Write to RADOS 2’ in cow cache: - Invalidate the chunk in the cache file - Write to RADOS Compute node rbd_id lba length rbd_1 8192 4096 rbd_1 1048576 4096 write SSD Shared Read-only cache for RBD – IO flow COW Cache mapping rbd_id lba length rbd_2 81920 4096 rbd_2 1048576 4096
  • 10. Shared Read-only cache for RBD – initial results 4k Rand Read Op_Size Op_Type QD Runtime(sec) IOPS BW(MB/s) Latency(ms) 99.99% Latency(ms) Baseline(w/o cache) 4k randread qd32 300 12927 50.5MB/s 2.437ms 8.89ms Read-only cache 4k randread qd32 300 52351 204.5MB/s 0.555ms 3.95ms independent read-only cache(2 volumes) 4k randread qd32 300 70079 273.74MB/s 0.860ms 5.56ms Shared read-only cache(2 volumes) 4k randread qd32 300 68612 268MB/s 0.875ms 2.98ms(?) PR will be send out soon..
  • 11. Shared RGW read-only SSD cache
  • 12. Shared Read-only cache for RGW chunk_id RGW instance id Cache_chunk_id 7e21a6b2-89b9-4de6-869e- 1ddc0198a82b.5228.1__shadow_.Tzk bVV_syqJ2vumnFe8uAaiL9j6ghtC_34 Rgw_1 7e21a6b2-89b9-4de6-869e- 1ddc0198a82b.5228.1__shadow_.Tzk bVV_syqJ2vumnFe8uAaiL9j6ghtC_34 • A CDN cluster behind the RGW clusters • L1 cache: allow to read from SSD cache of local RGW instance • L2 cache(configurable): allow to read from SSD cache on other remote RGW instances • Each object/chunk has an unique ID • Need a centralized distributed K/V to store the mapping as the chunks maybe spreaded on different RGW instances
  • 13. Shared Read-only cache for RGW rgw_1 rgw_2 RADOS Local cache Local cache librados Immutable Cache S3 API Swift API rgw_frontend rgw_rados rgw_cache datacachepolicy Immutable Cache L1 L2
  • 14. Issues • different caching semantics for block and object? • Promoting at block level(default 8k) for librbd • Promoting at object level for RGW • #13144 is not compiling • https://github.com/maniaabdi/engage1.git • Jewel based, need to rebase against master • Currently the logic is inside rgw_rados, need to be decupled to cope with our design(libcachefile + policy)
  • 15. RGW datacache (PR #13144) rgw_1 rgw_2 RADOS Local cache Local cache librados Immutable Cache S3 API Swift API rgw_frontend rgw_rados rgw_cache datacache policy Immutable Cache L1 L2

Editor's Notes

  • #3: How to maintain the librbd parent/clone image table?
  • #9: When to promote the shared cache file? -> when opening the first cloned image, the cache will be promoted to local, this could be optimized What data should we promote? parent_image@snapshot Librbd caching will be promoting at block size(4k default) level What is the cache file format? -> sparse file based
  • #10: Only do promote when read Writes to osd directly and invalidates the cache if cache_hit