Ceph for Big Science - Dan van der Ster
Ceph for Big Science
Dan van der Ster, CERN IT
Cephalocon APAC 2018
23 March 2018 | Beijing
Ceph for Big Science - Dan van der Ster
27kmcircumference
5
~30MHz interactions filtered to ~1kHz recorded collisions
6
~30MHz interactions filtered to ~1kHz recorded collisions
ATLAS Detector, 100m underground
Higgs Boson Candidate
300 petabytes storage, 230 000 CPU cores
Worldwide LHC Compute Grid
Worldwide LHC Compute Grid
Beijing IHEP
WLCG Centre
Ceph at CERN: Yesterday & Today
12
Ceph for Big Science - Dan van der Ster
14
First production cluster built mid to late 2013
for OpenStack Cinder block storage.
3 PB, 48x24x3TB drives, 200 journaling SSDs
Ceph dumpling v0.67 on Scientific Linux 6
We were very cautious: 4 replicas! (now 3)
History
• March 2013: 300TB proof of concept
• Dec 2013: 3PB in prod for RBD
• 2014-15: EC, radosstriper, radosfs
• 2016: 3PB to 6PB, no downtime
• 2017: 8 prod clusters
15
CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB jewel
Satellite data centre (1000km away) 0.4PB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Manila testing cluster 0.4PB luminous
Hyperconverged HPC 0.4PB luminous
CASTOR/XRootD Production 4.2PB luminous
CERN Tape Archive 0.8PB luminous
S3+SWIFT Production 0.9PB luminous
16
CephFS
17
CephFS: Filer Evolution
• Virtual NFS filers are stable and
perform well:
• nfsd, ZFS, zrep, OpenStack VMs,
Cinder/RBD
• We have ~60TB on ~30 servers
• High performance, but not scalable:
• Quota management tedious
• Labour-intensive to create new filers
• Can’t scale performance horizontally
18
CephFS: Filer Evolution
• OpenStack Manila (with CephFS) has most of the needed features:
• Multi-tenant with security isolation + quotas
• Easy self-service share provisioning
• Scalable performance (add more MDSs or OSDs as needed)
• Successful testing with preproduction users since mid-2017.
• Single MDS was seen as a bottleneck. Luminous has stable multi-MDS.
• Manila + CephFS now in production:
• One user already asked for 2000 shares
• Also using for Kubernetes: we are working on a new CSI CephFS plugin
• Really need kernel quota support!
19
Multi-MDS in Production
• ~20 tenants on our pre-prod
environment for several
months
• 2 active MDSs since luminous
• Enabled multi-MDS on our
production cluster on Jan 24
• Currently have 3 active MDSs
• default balancer and pinning
20
HPC on CephFS?
• CERN is mostly a high
throughput computing lab:
• File-based parallelism
• Several smaller HPC use-
cases exist within our lab:
• Beams, plasma, CFD,
QCD, ASICs
• Need full POSIX,
consistency, parallel IO
21
“Software Defined HPC”
• CERN’s approach is to build HPC clusters with commodity
parts: “Software Defined HPC”
• Compute side is solved with HTCondor & SLURM
• Typical HPC storage is not very attractive (missing expertise + budget)
• 200-300 HPC nodes accessing ~1PB CephFS since mid-2016:
• Manila + HPC use-cases on the same clusters. HPC is just another
user.
• Quite stable but not super high performance
22
IO-500
• Storage benchmark announced by John Bent on ceph-users ML
(from SuperComputing 2017)
• « goal is to improve parallel file systems by ensuring that sites
publish results of both "hero" and "anti-hero" runs and by
sharing the tuning and configuration »
• We have just started testing on our CephFS clusters:
• IOR throughput tests, mdtest + find metadata test
• Easy/hard mode for shared/unique file tests
23
IO-500 First Look…No tuning!
Test Result
ior_easy_write 2.595 GB/s
ior_hard_write 0.761 GB/s
ior_easy_read 4.951 GB/s
ior_hard_read 0.944 GB/s
24
Test Result
mdtest_easy_write 1.774 kiops
mdtest_hard_write 1.512 kiops
find 50.00 kiops
mdtest_easy_stat 8.722 kiops
mdtest_hard_stat 7.076 kiops
mdtest_easy_delete 0.747 kiops
mdtest_hard_read 2.521 kiops
mdtest_hard_delete 1.351 kiops
Luminous v12.2.4 -- Tested March 2018
411 OSDs: 800TB SSDs, 2 per server
OSDs running on same HW as clients
2 active MDSs running on VMs
[SCORE] Bandwidth 1.74 GB/s : IOPS 3.47 kiops : TOTAL 2.46
IO-500 First Look…No tuning!
Test Result
ior_easy_write 2.595 GB/s
ior_hard_write 0.761 GB/s
ior_easy_read 4.951 GB/s
ior_hard_read 0.944 GB/s
25
Test Result
mdtest_easy_write 1.774 kiops
mdtest_hard_write 1.512 kiops
find 50.000 kiops
mdtest_easy_stat 8.722 kiops
mdtest_hard_stat 7.076 kiops
mdtest_easy_delete 0.747 kiops
mdtest_hard_read 2.521 kiops
mdtest_hard_delete 1.351 kiops
Luminous v12.2.4 -- Tested March 2018
411 OSDs: 800TB SSDs, 2 per server
OSDs running on same HW as clients
2 active MDSs running on VMs
[SCORE] Bandwidth 1.74 GB/s : IOPS 3.47 kiops : TOTAL 2.46
RGW
26
S3 @ CERN
• Ceph luminous cluster with VM gateways. Single region.
• 4+2 erasure coding. Physics data for small objects, volunteer computing, some backups.
• Pre-signed URLs and object expiration working well.
• HAProxy is very useful:
• High-availability & mapping special buckets to dedicated gateways
27
RBD
28
RBD: Ceph + OpenStack
29
Cinder Volume Types
Volume Type Size (TB) Count
standard 871 4,758
io1 440 608
cp1 97 118
cpio1 269 107
wig-cp1 26 19
wig-cpio1 106 13
io-test10k 20 1
Totals: 1,811 5,624
30
RBD @ CERN
• OpenStack Cinder + Glance use-cases continue to be highly
reliable:
• QoS via IOPS/BW throttles is essential.
• Spectre/Meltdown reboots updated all clients to luminous!
• Ongoing work:
• Recently finished an expanded rbd trash feature
• Just starting work on a persistent cache for librbd
• CERN Openlab collaboration with Rackspace!
• Writing a backup driver for glance (RBD to S3)
31
Hyperconverged Ceph/Cloud
• Experimenting with co-located ceph-osd on HVs and HPC:
• New cluster with 384 SSDs on HPC nodes
• Minor issues related to server isolation:
• cgroups or NUMA pinning are options but not yet used.
• Issues are related to our operations culture:
• We (Ceph team) don’t own the servers – need to co-operate with the
cloud/HPC teams.
• E.g. When is it ok to reboot a node? how to drain a node? Software upgrade
procedures.
32
User Feedback:
From Jewel to Luminous
33
Jewel to Luminous Upgrade Notes
• In general upgrades went well with no big problems.
• New/replaced OSDs are BlueStore (ceph-volume lvm)
• Preparing a FileStore conversion script for our infrastructure
• ceph-mgr balancer is very interesting:
• Actively testing the crush-compat mode
• Python module can be patched in place for quick fixes
34
How to replace many OSDs?
35
Fully replaced 3PB of block storage
with 6PB new hardware over several
weeks, transparent to users.
cernceph/ceph-scripts
Current Challenges
• RBD / OpenStack Cinder:
• Ops: how to identify active volumes?
• “rbd top”
• Performance: µs latencies and kHz IOPS.
• Need persistent SSD caches.
• On the wire encryption, client-side volume encryption
• OpenStack: volume type / availability zone coupling for
hyper-converged clusters
36
Current Challenges
• CephFS HPC:
• HPC: parallel MPI I/O and single-MDS metadata perf (IO-500!)
• Copying data across /cephfs: need "rsync --ceph"
• CephFS general use-case:
• Scaling to 10,000 (or 100,000!) clients:
• client throttles, tools to block/disconnect noisy users/clients.
• Need “ceph client top”
• native Kerberos (without NFS gateway), group accounting and quotas
• HA CIFS and NFS gateways for non-Linux clients
• How to backup a 10 billion file CephFS?
• e.g. how about binary diff between snaps, similar to ZFS send/receive?
37
Current Challenges
• RADOS:
• How to phase in new features on old clusters
• e.g. we have 3PB of RBD data with hammer tunables
• Pool-level object backup (convert from replicated to EC, copy to non-Ceph)
• rados export the diff between two pool snaphots?
• Areas we cannot use Ceph yet:
• Storage for large enterprise databases (are we close?)
• Large scale batch processing
• Single filesystems spanning multiple sites
• HSM use-cases (CephFS with tape backend?, Glacier for S3?)
38
Future…
39
HEP Computing for the 2020s
• Run-2 (2015-18):
~50-80PB/year
• Run-3 (2020-23):
~150PB/year
• Run-4: ~600PB/year?!
“Data Lakes” – globally distributed, flexible placement, ubiquitous access
https://arxiv.org/abs/1712.06982
Ceph Bigbang Scale Testing
• Bigbang scale tests mutually benefit
CERN & Ceph project
• Bigbang I: 30PB, 7200 OSDs, Ceph
hammer. Several osdmap limitations
• Bigbang II: Similar size, Ceph jewel.
Scalability limited by OSD/MON
messaging. Motivated ceph-mgr
• Bigbang III: 65PB, 10800 OSDs
41
https://ceph.com/community/new-luminous-scalability/
Thanks…
42
Thanks to my CERN Colleagues
• Ceph team at CERN
• Hervé Rousseau, Teo Mouratidis, Roberto Valverde, Paul Musset, Julien Collet
• Massimo Lamanna / Alberto Pace (Storage Group Leadership)
• Andreas-Joachim Peters (Intel EC)
• Sebastien Ponce (radosstriper)
• OpenStack & Containers teams at CERN
• Tim Bell, Jan van Eldik, Arne Wiebalck (also co-initiator of Ceph at CERN),
Belmiro Moreira, Ricardo Rocha, Jose Castro Leon
• HPC team at CERN
• Nils Hoimyr, Carolina Lindqvist, Pablo Llopis
43
!

More Related Content

PDF
Erasure Code at Scale - Thomas William Byrne
PDF
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
PDF
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
PDF
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
PDF
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
PDF
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
PDF
Ceph, the future of Storage - Sage Weil
PDF
Ceph on arm64 upload
Erasure Code at Scale - Thomas William Byrne
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Linux Block Cache Practice on Ceph BlueStore - Junxin Zhang
Ceph, the future of Storage - Sage Weil
Ceph on arm64 upload

What's hot (20)

PPTX
OpenStack and Ceph case study at the University of Alabama
PDF
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
PPTX
Designing for High Performance Ceph at Scale
PDF
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
PDF
Global deduplication for Ceph - Myoungwon Oh
PPTX
MySQL on Ceph
PPTX
Which Hypervisor is Best?
PDF
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
PPTX
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PPTX
Ceph on 64-bit ARM with X-Gene
PDF
Evaluation of RBD replication options @CERN
PPTX
ceph-barcelona-v-1.2
PPTX
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
PDF
Automatic Operation Bot for Ceph - You Ji
PDF
RBD: What will the future bring? - Jason Dillaman
PDF
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
PPTX
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
PPTX
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
PDF
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
OpenStack and Ceph case study at the University of Alabama
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Designing for High Performance Ceph at Scale
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Global deduplication for Ceph - Myoungwon Oh
MySQL on Ceph
Which Hypervisor is Best?
RADOS improvements and roadmap - Greg Farnum, Josh Durgin, Kefu Chai
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
QCT Ceph Solution - Design Consideration and Reference Architecture
Ceph on 64-bit ARM with X-Gene
Evaluation of RBD replication options @CERN
ceph-barcelona-v-1.2
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Automatic Operation Bot for Ceph - You Ji
RBD: What will the future bring? - Jason Dillaman
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
CEPH DAY BERLIN - DISK HEALTH PREDICTION AND RESOURCE ALLOCATION FOR CEPH BY ...
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Ad

Similar to Ceph for Big Science - Dan van der Ster (20)

PPTX
Ceph Day Chicago - Ceph at work at Bloomberg
PDF
adp.ceph.openstack.talk
PPTX
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
PDF
Scaling Ceph at CERN - Ceph Day Frankfurt
PDF
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
PDF
What's new in Luminous and Beyond
PPTX
Dfs in iaa_s
PDF
Ceph in 2023 and Beyond.pdf
PPTX
20190620 accelerating containers v3
PPTX
Your 1st Ceph cluster
PPTX
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
PPTX
New Ceph capabilities and Reference Architectures
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PDF
2021.02 new in Ceph Pacific Dashboard
PDF
What's New with Ceph - Ceph Day Silicon Valley
PDF
Ceph Day London 2014 - Deploying ceph in the wild
PDF
NAVER Ceph Storage on ssd for Container
PDF
Ceph Day Netherlands - Ceph @ BIT
PPTX
Manila on CephFS at CERN (OpenStack Summit Boston, 11 May 2017)
Ceph Day Chicago - Ceph at work at Bloomberg
adp.ceph.openstack.talk
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Scaling Ceph at CERN - Ceph Day Frankfurt
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
What's new in Luminous and Beyond
Dfs in iaa_s
Ceph in 2023 and Beyond.pdf
20190620 accelerating containers v3
Your 1st Ceph cluster
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
New Ceph capabilities and Reference Architectures
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
2021.02 new in Ceph Pacific Dashboard
What's New with Ceph - Ceph Day Silicon Valley
Ceph Day London 2014 - Deploying ceph in the wild
NAVER Ceph Storage on ssd for Container
Ceph Day Netherlands - Ceph @ BIT
Manila on CephFS at CERN (OpenStack Summit Boston, 11 May 2017)
Ad

Recently uploaded (20)

PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Ensemble model-based arrhythmia classification with local interpretable model...
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PDF
Human Computer Interaction Miterm Lesson
Build Real-Time ML Apps with Python, Feast & NoSQL
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Co-training pseudo-labeling for text classification with support vector machi...
4 layer Arch & Reference Arch of IoT.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
Early detection and classification of bone marrow changes in lumbar vertebrae...
Module 1 Introduction to Web Programming .pptx
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Lung cancer patients survival prediction using outlier detection and optimize...
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
A symptom-driven medical diagnosis support model based on machine learning te...
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Ensemble model-based arrhythmia classification with local interpretable model...
Build automations faster and more reliably with UiPath ScreenPlay
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
Human Computer Interaction Miterm Lesson

Ceph for Big Science - Dan van der Ster

  • 2. Ceph for Big Science Dan van der Ster, CERN IT Cephalocon APAC 2018 23 March 2018 | Beijing
  • 5. 5 ~30MHz interactions filtered to ~1kHz recorded collisions
  • 6. 6 ~30MHz interactions filtered to ~1kHz recorded collisions
  • 7. ATLAS Detector, 100m underground
  • 9. 300 petabytes storage, 230 000 CPU cores
  • 11. Worldwide LHC Compute Grid Beijing IHEP WLCG Centre
  • 12. Ceph at CERN: Yesterday & Today 12
  • 14. 14 First production cluster built mid to late 2013 for OpenStack Cinder block storage. 3 PB, 48x24x3TB drives, 200 journaling SSDs Ceph dumpling v0.67 on Scientific Linux 6 We were very cautious: 4 replicas! (now 3)
  • 15. History • March 2013: 300TB proof of concept • Dec 2013: 3PB in prod for RBD • 2014-15: EC, radosstriper, radosfs • 2016: 3PB to 6PB, no downtime • 2017: 8 prod clusters 15
  • 16. CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 5.5PB jewel Satellite data centre (1000km away) 0.4PB luminous CephFS (HPC+Manila) Production 0.8PB luminous Manila testing cluster 0.4PB luminous Hyperconverged HPC 0.4PB luminous CASTOR/XRootD Production 4.2PB luminous CERN Tape Archive 0.8PB luminous S3+SWIFT Production 0.9PB luminous 16
  • 18. CephFS: Filer Evolution • Virtual NFS filers are stable and perform well: • nfsd, ZFS, zrep, OpenStack VMs, Cinder/RBD • We have ~60TB on ~30 servers • High performance, but not scalable: • Quota management tedious • Labour-intensive to create new filers • Can’t scale performance horizontally 18
  • 19. CephFS: Filer Evolution • OpenStack Manila (with CephFS) has most of the needed features: • Multi-tenant with security isolation + quotas • Easy self-service share provisioning • Scalable performance (add more MDSs or OSDs as needed) • Successful testing with preproduction users since mid-2017. • Single MDS was seen as a bottleneck. Luminous has stable multi-MDS. • Manila + CephFS now in production: • One user already asked for 2000 shares • Also using for Kubernetes: we are working on a new CSI CephFS plugin • Really need kernel quota support! 19
  • 20. Multi-MDS in Production • ~20 tenants on our pre-prod environment for several months • 2 active MDSs since luminous • Enabled multi-MDS on our production cluster on Jan 24 • Currently have 3 active MDSs • default balancer and pinning 20
  • 21. HPC on CephFS? • CERN is mostly a high throughput computing lab: • File-based parallelism • Several smaller HPC use- cases exist within our lab: • Beams, plasma, CFD, QCD, ASICs • Need full POSIX, consistency, parallel IO 21
  • 22. “Software Defined HPC” • CERN’s approach is to build HPC clusters with commodity parts: “Software Defined HPC” • Compute side is solved with HTCondor & SLURM • Typical HPC storage is not very attractive (missing expertise + budget) • 200-300 HPC nodes accessing ~1PB CephFS since mid-2016: • Manila + HPC use-cases on the same clusters. HPC is just another user. • Quite stable but not super high performance 22
  • 23. IO-500 • Storage benchmark announced by John Bent on ceph-users ML (from SuperComputing 2017) • « goal is to improve parallel file systems by ensuring that sites publish results of both "hero" and "anti-hero" runs and by sharing the tuning and configuration » • We have just started testing on our CephFS clusters: • IOR throughput tests, mdtest + find metadata test • Easy/hard mode for shared/unique file tests 23
  • 24. IO-500 First Look…No tuning! Test Result ior_easy_write 2.595 GB/s ior_hard_write 0.761 GB/s ior_easy_read 4.951 GB/s ior_hard_read 0.944 GB/s 24 Test Result mdtest_easy_write 1.774 kiops mdtest_hard_write 1.512 kiops find 50.00 kiops mdtest_easy_stat 8.722 kiops mdtest_hard_stat 7.076 kiops mdtest_easy_delete 0.747 kiops mdtest_hard_read 2.521 kiops mdtest_hard_delete 1.351 kiops Luminous v12.2.4 -- Tested March 2018 411 OSDs: 800TB SSDs, 2 per server OSDs running on same HW as clients 2 active MDSs running on VMs [SCORE] Bandwidth 1.74 GB/s : IOPS 3.47 kiops : TOTAL 2.46
  • 25. IO-500 First Look…No tuning! Test Result ior_easy_write 2.595 GB/s ior_hard_write 0.761 GB/s ior_easy_read 4.951 GB/s ior_hard_read 0.944 GB/s 25 Test Result mdtest_easy_write 1.774 kiops mdtest_hard_write 1.512 kiops find 50.000 kiops mdtest_easy_stat 8.722 kiops mdtest_hard_stat 7.076 kiops mdtest_easy_delete 0.747 kiops mdtest_hard_read 2.521 kiops mdtest_hard_delete 1.351 kiops Luminous v12.2.4 -- Tested March 2018 411 OSDs: 800TB SSDs, 2 per server OSDs running on same HW as clients 2 active MDSs running on VMs [SCORE] Bandwidth 1.74 GB/s : IOPS 3.47 kiops : TOTAL 2.46
  • 27. S3 @ CERN • Ceph luminous cluster with VM gateways. Single region. • 4+2 erasure coding. Physics data for small objects, volunteer computing, some backups. • Pre-signed URLs and object expiration working well. • HAProxy is very useful: • High-availability & mapping special buckets to dedicated gateways 27
  • 29. RBD: Ceph + OpenStack 29
  • 30. Cinder Volume Types Volume Type Size (TB) Count standard 871 4,758 io1 440 608 cp1 97 118 cpio1 269 107 wig-cp1 26 19 wig-cpio1 106 13 io-test10k 20 1 Totals: 1,811 5,624 30
  • 31. RBD @ CERN • OpenStack Cinder + Glance use-cases continue to be highly reliable: • QoS via IOPS/BW throttles is essential. • Spectre/Meltdown reboots updated all clients to luminous! • Ongoing work: • Recently finished an expanded rbd trash feature • Just starting work on a persistent cache for librbd • CERN Openlab collaboration with Rackspace! • Writing a backup driver for glance (RBD to S3) 31
  • 32. Hyperconverged Ceph/Cloud • Experimenting with co-located ceph-osd on HVs and HPC: • New cluster with 384 SSDs on HPC nodes • Minor issues related to server isolation: • cgroups or NUMA pinning are options but not yet used. • Issues are related to our operations culture: • We (Ceph team) don’t own the servers – need to co-operate with the cloud/HPC teams. • E.g. When is it ok to reboot a node? how to drain a node? Software upgrade procedures. 32
  • 33. User Feedback: From Jewel to Luminous 33
  • 34. Jewel to Luminous Upgrade Notes • In general upgrades went well with no big problems. • New/replaced OSDs are BlueStore (ceph-volume lvm) • Preparing a FileStore conversion script for our infrastructure • ceph-mgr balancer is very interesting: • Actively testing the crush-compat mode • Python module can be patched in place for quick fixes 34
  • 35. How to replace many OSDs? 35 Fully replaced 3PB of block storage with 6PB new hardware over several weeks, transparent to users. cernceph/ceph-scripts
  • 36. Current Challenges • RBD / OpenStack Cinder: • Ops: how to identify active volumes? • “rbd top” • Performance: µs latencies and kHz IOPS. • Need persistent SSD caches. • On the wire encryption, client-side volume encryption • OpenStack: volume type / availability zone coupling for hyper-converged clusters 36
  • 37. Current Challenges • CephFS HPC: • HPC: parallel MPI I/O and single-MDS metadata perf (IO-500!) • Copying data across /cephfs: need "rsync --ceph" • CephFS general use-case: • Scaling to 10,000 (or 100,000!) clients: • client throttles, tools to block/disconnect noisy users/clients. • Need “ceph client top” • native Kerberos (without NFS gateway), group accounting and quotas • HA CIFS and NFS gateways for non-Linux clients • How to backup a 10 billion file CephFS? • e.g. how about binary diff between snaps, similar to ZFS send/receive? 37
  • 38. Current Challenges • RADOS: • How to phase in new features on old clusters • e.g. we have 3PB of RBD data with hammer tunables • Pool-level object backup (convert from replicated to EC, copy to non-Ceph) • rados export the diff between two pool snaphots? • Areas we cannot use Ceph yet: • Storage for large enterprise databases (are we close?) • Large scale batch processing • Single filesystems spanning multiple sites • HSM use-cases (CephFS with tape backend?, Glacier for S3?) 38
  • 40. HEP Computing for the 2020s • Run-2 (2015-18): ~50-80PB/year • Run-3 (2020-23): ~150PB/year • Run-4: ~600PB/year?! “Data Lakes” – globally distributed, flexible placement, ubiquitous access https://arxiv.org/abs/1712.06982
  • 41. Ceph Bigbang Scale Testing • Bigbang scale tests mutually benefit CERN & Ceph project • Bigbang I: 30PB, 7200 OSDs, Ceph hammer. Several osdmap limitations • Bigbang II: Similar size, Ceph jewel. Scalability limited by OSD/MON messaging. Motivated ceph-mgr • Bigbang III: 65PB, 10800 OSDs 41 https://ceph.com/community/new-luminous-scalability/
  • 43. Thanks to my CERN Colleagues • Ceph team at CERN • Hervé Rousseau, Teo Mouratidis, Roberto Valverde, Paul Musset, Julien Collet • Massimo Lamanna / Alberto Pace (Storage Group Leadership) • Andreas-Joachim Peters (Intel EC) • Sebastien Ponce (radosstriper) • OpenStack & Containers teams at CERN • Tim Bell, Jan van Eldik, Arne Wiebalck (also co-initiator of Ceph at CERN), Belmiro Moreira, Ricardo Rocha, Jose Castro Leon • HPC team at CERN • Nils Hoimyr, Carolina Lindqvist, Pablo Llopis 43
  • 44. !