SlideShare a Scribd company logo
14th ANNUAL WORKSHOP 2018
ACCELERATING CEPH WITH RDMA AND NVME-OF
Haodong Tang, Jian Zhang and Fred Zhang
Intel Corporation
{haodong.tang, jian.zhang, fred.zhang}@intel.com
April, 2018
AGENDA
 Background and motivation
 RDMA as Ceph networking component
 RDMA as Ceph NVMe fabrics
 Summary & next step
BACKGROUND AND MOTIVATION
CEPH INTRODUCTION
 References: http://ceph.com/ceph-storage, http://thenewstack.io/software-defined-storage-ceph-way,
RADOS
A software-based, reliable, autonomous, distributed object store
comprised of self-healing, self-managing, intelligent storage nodes and
lightweight monitors
LIBRADOS
A library allowing apps to directly access RADOS
RGW
A web services
gateway for object
storage
Application
RBD
A reliable, fully
distributed block
device
CephFS
A distributed file
system with POSIX
semantics
Host/VM Client
 Scalability – CRUSH data placement, no single POF
 Replicates and re-balances dynamically
 Enterprise features – snapshots, cloning, mirroring
 Most popular block storage for Openstack use cases
 Commercial support from Red Hat
 Open-source, object-based scale-out storage
 Object, Block and File in single unified storage cluster
 Highly durable, available – replication, erasure coding
 Runs on economical commodity hardware
 10 years of hardening, vibrant community
CEPH PERFORMANCE PROFILING
 CPU is uneven distributed.
 CPU tend to be the bottleneck for 4K random write and 4K random read.
 Ceph networking layer consumes 20%+ CPU of the totally CPU used by Ceph in 4K random read
workload.
AsyncMsg(~22 -
~24%)
* This picture is from the Boston OpenStack Summit
MOTIVATION
 RDMA is a direct access from the memory of one computer into that of another
without involving either one’s operating system.
 RDMA supports zero-copy networking(kernel bypass).
• Eliminate CPUs, memory or context switches.
• Reduce latency and enable fast messenger transfer.
 Potential benefit for ceph.
• Better Resource Allocation – Bring additional disk to servers with spare CPU.
• Lower latency - generated by ceph network stack.
RDMA AS CEPH NETWORKING COMPONENT
RDMA IN CEPH
 XIO Messenger.
• Based on Accelio, seamlessly supporting RDMA.
• Scalability issue.
• Merged to Ceph master three years ago, no support
for now.
 Async Messenger.
• Async Messenger is compatible with different network
protocol, like Posix, RDMA and DPDK.
• Current RDMA implementation supports IB protocol. HW Ethernet NIC(RNIC)
NIC Driver
Dispatcher
Kernel
Async Messenger
IO Library
RDMA Stack + OFA Verb APInetwork stack
event driver
event driver
event driver
workers pool
dispatch queue
event driver
event driver
event driver
workers pool
same pool
IB Link
IB Transport
IB Transport APIkernel bypass
RDMA OVER ETHERNET
 Motivation
• Leverage RDMA to improve performance (low CPU, low
latency).
• Leverage Intel RDMA NIC to accelerate Ceph.
• RDMA over Ethernet provide is one of the most convenient
and practical way for datacenter running Ceph over
TCP/IP.
 To-do
• Need introduce rdma-cm library. HW Ethernet NIC(RNIC)
NIC Driver
Dispatcher
Kernel
Async
Messenger
IO Library
RDMA Stack + OFA Verb APInetwork stack
event driver
event driver
event driver
workers pool
dispatch queue
event driver
event driver
event driver
workers pool
same pool
TCP/IP
MPA
DDP
RDMAPkernel bypass
IMPLEMENTATION DETAILS
 Current implementation for Infiniband in Ceph:
• Connection management: Self-implemented TCP/IP based RDMA connection management
• RDMA verbs: RDMA send, RDMA recv
• Queue pairs: Shared receive queue (SRQ)
• Completed Queue: All queue pair share one completed queue
 iWARP protocol needs:
• Connection management: RDMA-CM based RDMA connection management
• Queue pairs: centralized memory pool for recv queue (RQ)
The networking component protocol between OSD node and client node can be changed. We compared the
Ceph performance w/ TCP/IP and it w/ RDMA protocol.
BENCHMARK METHODOLOGY – SMALL SCALE
CPU SKX Platform (112 cores)
Memory 128 GB
NIC 10 GbE IntelÂŽ Ethernet Connection X722 with
iWARP
Disk distribution 4x P3700 as OSD drive, 1x Optane as DB driver
Software configuration CentOS 7, Ceph Luminous (dev)
FIO version 2.17
MON
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
FIO FIO FIO
FIO FIO FIOClient Node
OSD Node
CEPH PERFORMANCE – TCP/IP VS RDMA – 1X OSD NDOE
 Ceph w/ iWARP delivers higher 4K random write performance than it with TCP/IP.
 Ceph w/ iWARP generates higher CPU Utilization.
• Ceph w/ iWARP consumes more user level CPU.
• Ceph w/ TCP/IP consumes more system level CPU.
0%
5%
10%
15%
20%
25%
30%
0
20000
40000
60000
80000
100000
120000
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64
IOPS
Ceph Performance Comparison - RDMA vs TCP/IP - 1x OSD Node
4K Random Write
RDMA Cluster 4K Random Write IOPS TCP/IP Cluster 4K Random Write IOPS
RDMA Cluster CPU Utilization TCP/IP Cluster CPU Utilization
0
5
10
15
20
25
TCP/IP Cluster CPU Utilzation RDMA Cluster CPU Utilization
Ceph CPU Comparison - RDMA vs TCP/IP - QD=64
4K Random Write
usr sys iowait soft
18.7% 20.5%
4.5% 2.8%
We scale the OSD node to verify the RDMA protocol scale-out ability.
BENCHMARK METHODOLOGY – LARGER SCALE
CPU SKX Platform (72 cores)
Memory 128 GB
NIC 10 GbE IntelÂŽ Ethernet Connection X722 with
iWARP
Disk distribution 4x P4500 as OSD/DB drive
Software configuration Ubuntu 17.10, Ceph Luminous (dev)
FIO version 2.12
MON
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
FIO FIO FIO
FIO FIO FIOClient Node
OSD Node
FIO FIO FIO
FIO FIO FIO
MON
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
CEPH PERFORMANCE – TCP/IP VS RDMA – 2X OSD NODES
0%
10%
20%
30%
40%
50%
60%
70%
0
20000
40000
60000
80000
100000
120000
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32
CPUUtilization
IOPS
Ceph Performance Comparison - RDMA vs TCP/IP - 2x OSD Nodes
4K Random Write
RDMA Cluster 4K Random Write IOPS TCP/IP Cluster 4K Random Write IOPS
RDMA Cluster CPU Utilization TCP/IP Cluster CPU Utilization
0
200
400
600
800
1000
1200
1400
1600
1800
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32
IOPS
Ceph Performance Comparison - RDMA vs TCP/IP
4K Random Write IOPS / CPU Utilization%
RDMA Cluster 4K Random Write IOPS/CPU utilization% TCP/IP Cluster 4K Random Write IOPS/CPU utilization%
 Ceph w/ iWARP delivers up to 17% 4K random write performance benefit than it w/
TCP/IP.
 Ceph w/ iWARP is more CPU efficient.
13%
17%
7%
0%
13%
3%
Higher is better
CEPH PERFORMANCE – TCP/IP VS RDMA – 3X OSD NODES
 Ceph node scaling out: RDMA vs TCP/IP - 48.7% vs 50.3%  scale out well.
 When QD is 16, Ceph w/ RDMA shows 12% higher 4K random write performance.
82409
122601
72289
108685
0
20000
40000
60000
80000
100000
120000
140000
2x OSD nodes 3x OSD nodes
IOPS
Ceph Performance Comparison - RDMA vs TCP/IP – QD=16
Scale-out performance
RDMA Cluster 4K Random Write IOPS TCP/IP Cluster 4K Random Write IOPS
13%
12%48.7%
50.3%
PERFORMANCE RESULT DEEP ANALYSIS
CPU Profiling
 Two polling thread: Ceph Epoll based Async driver thread + RDMA polling thread.
 Not really zero-copy: there’s one copy from RDMA recv buffer to Ceph Async driver buffer.
RDMA AS CEPH NVME FABRICS
RDMA AS CEPH NVME FABRICS
 NVMe is a new specification optimized for NAND flash
and next-generation solid-state storage technologies.
 NVMe over Fabrics enables access to remote NVMe
devices over multiple network fabrics.
 Supported fabrics
 RDMA – InfiniBand, IWARP, RoCE
 Fiber Channel
 TCP/IP
 NVMe-oF benefits
 NVMe disaggregation.
 Delivers performance of remote NVMe on-par with local NVMe.
RDMA AS CEPH NVME FABRICS
 Baseline and comparison
 The baseline setup used local NVMe.
 The comparison setup attaches remote NVMe as OSD data drive.
 6x 2T P3700 are among 2x Storage nodes.
 OSD nodes attach the 6x P3700 over RoCE V2 fabric.
 Set NVMe-oF CPU offload on target node.
 Hardware configuration
 2x Storage nodes, 3x OSD nodes, 3x Client nodes.
 6x P3700 (800 GB U.2), 3x Optane (375 GB)
 30x FIO processes worked on 30x RBD volumes.
 All these 8x servers are BRW, 128 GB memory, Mellanox Connect-X4 NICs.
Optane
Ceph OSD
P3700
P3700
client
Optane
Ceph OSD
P3700
P3700
client
Optane
Ceph OSD
P3700
P3700
client
Ceph Client
RBD
RBD
RBD
RBD
FIO
Ceph Client
RBD
RBD
RBD
RBD
FIO
Ceph Client
RBD
RBD
RBD
RBD
FIO
TCP/IP
Optane
Ceph OSD
P3700
P3700
NVMf
client
Optane
Ceph OSD
P3700
P3700
NVMf
client
Optane
Ceph OSD
P3700
P3700
NVMf
client
Ceph Client
RBD
RBD
RBD
RBD
FIO
Ceph Client
RBD
RBD
RBD
RBD
FIO
Ceph Client
RBD
RBD
RBD
RBD
FIO
P3700
P3700
P3700
Ceph Target
P3700
P3700
P3700
Ceph Target
RDMA
TCP/IP
EXPECTATION BEFORE POC
 Expectations and questions before POC.
 Expectations: According to the benchmark from the first part, we’re expecting
 on-par 4K random write performance.
 on-par CPU utilization on NVMe-oF host node.
 Questions:
 How many CPU will be used on NVMe-oF target node ?
 How is the behavior of tail latency(99.0%) latency with NVMe-oF ?
 Does NVMe-oF influence the Scale-out ability of Ceph ?
RDMA AS CEPH NVME FABRICS
 On-par 4K random write performance
 Running Ceph with NVMe-oF brings <1% CPU
overhead on target node.
 CPU is not the bottleneck on the host node.
Client side performance comparison
0
50000
100000
150000
200000
250000
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128
IOPS
4K Random Write - Ceph over NVMf vs Ceph over local NVMe
4K RW - Ceph with Local NVMe 4K RW - Ceph with NVMe-oF
CPU Utilization on OSD Node
CPU Utilization on Target Node
CEPH TAIL LATENCY
 When QD is higher than 16, Ceph with NVMe-oF shows higher tail latency (99%).
 When QD is lower than 16, Ceph with NVMe-oF on-par with Ceph over local NVMe.
0
50
100
150
200
250
300
350
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128
ms
Tail Latency Comparison - Ceph over NVMf vs Ceph over local NVMe
4K RW - Ceph with Local NVMe 4K RW - Ceph with NVMe-oF
lower is better
RDMA AS CEPH NVME FABRICS
 Running Ceph over NVMe-oF didn’t limit the Ceph OSD node scaling out.
 For 4K random write/read, the maximum ratio of 3x nodes to 2x nodes is 1.47, closing to 1.5 (ideal
value).
Scaling out performance
0
50000
100000
150000
200000
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128
IOPS
Scaling Out Testing - Ceph over NVMf
4K Random Write
2xnodes 3xnodes
0
200000
400000
600000
800000
1000000
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128
IOPS
Scaling Out Testing - Ceph over NVMf
4K Random Read
2xnodes 3xnodes
1.3
1.35
1.4
1.45
1.5
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128
Performance Comparison- Ceph over NVMf
4K Random Write, 3x nodes/2x nodes
1.2
1.25
1.3
1.35
1.4
1.45
1.5
QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128
Performance Comparison - Ceph over NVMf
4K Random Write, 3x nodes/2x nodes
SUMMARY
SUMMARY & NEXT-STEP
 Summary
 RDMA is critical for future Ceph AFA solutions.
 Ceph with RDMA messenger provides up to ~17% performance advantage over TCP/IP.
 Ceph with RDMA messenger shows great scale-our ability.
 As network fabrics, RDMA performs well in Ceph NVMe-oF solutions.
 Running Ceph on NVMe-oF does not appreciably degrade Ceph write performance.
 Ceph with NVMe-oF brings more flexible provisioning and lower TCO.
 Next-step
 Ceph RDMA networking component optimization based on previous analysis.
 leverage NVMe-oF with the high density storage node for lower TCO.
LEGAL DISCLAIMER & OPTIMIZATION NOTICE
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information
and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products. For more complete information visit www.intel.com/benchmarks.
 INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
 Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are
trademarks of Intel Corporation in the U.S. and other countries.
14th ANNUAL WORKSHOP 2018
THANK YOU!
Haodong Tang, Jian Zhang and Fred Zhang
Intel Corporation
{haodong.tang, jian.zhang, fred.zhang}@intel.com
April, 2018

More Related Content

What's hot (20)

PDF
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
PDF
SRv6 Network Programming: deployment use-cases
APNIC
 
PDF
Cumulus networks conversion guide
Scott Suehle
 
PDF
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Kentaro Ebisawa
 
PDF
Nick Fisk - low latency Ceph
ShapeBlue
 
PDF
Bgp tutorial for ISP
Wahyu Nasution
 
PDF
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 
PDF
A comparison of segment routing data-plane encodings
Gunter Van de Velde
 
PPTX
The Basic Introduction of Open vSwitch
Te-Yen Liu
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Ceph: Open Source Storage Software Optimizations on IntelÂŽ Architecture for C...
Odinot Stanislas
 
PDF
DevConf 2014 Kernel Networking Walkthrough
Thomas Graf
 
PPTX
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
PDF
Operationalizing EVPN in the Data Center: Part 2
Cumulus Networks
 
PDF
hpsr-2020-srv6-tutorial
Stefano Salsano
 
PDF
Dasan zhone mxk_msan_solution
Husam Al-Hasani
 
PDF
DDoS Mitigation using BGP Flowspec
APNIC
 
PDF
Designing Multi-tenant Data Centers Using EVPN
Anas
 
PPTX
Vxlan control plane and routing
Wilfredzeng
 
PDF
SDN & NFV Introduction - Open Source Data Center Networking
Thomas Graf
 
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
 
SRv6 Network Programming: deployment use-cases
APNIC
 
Cumulus networks conversion guide
Scott Suehle
 
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Kentaro Ebisawa
 
Nick Fisk - low latency Ceph
ShapeBlue
 
Bgp tutorial for ISP
Wahyu Nasution
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
ScyllaDB
 
A comparison of segment routing data-plane encodings
Gunter Van de Velde
 
The Basic Introduction of Open vSwitch
Te-Yen Liu
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Ceph: Open Source Storage Software Optimizations on IntelÂŽ Architecture for C...
Odinot Stanislas
 
DevConf 2014 Kernel Networking Walkthrough
Thomas Graf
 
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
Operationalizing EVPN in the Data Center: Part 2
Cumulus Networks
 
hpsr-2020-srv6-tutorial
Stefano Salsano
 
Dasan zhone mxk_msan_solution
Husam Al-Hasani
 
DDoS Mitigation using BGP Flowspec
APNIC
 
Designing Multi-tenant Data Centers Using EVPN
Anas
 
Vxlan control plane and routing
Wilfredzeng
 
SDN & NFV Introduction - Open Source Data Center Networking
Thomas Graf
 

Similar to Accelerating Ceph with RDMA and NVMe-oF (20)

PDF
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Ceph Community
 
PDF
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK
 
PDF
DPDK Support for New HW Offloads
Netronome
 
PDF
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
 
PPTX
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
 
PDF
DPDK Summit 2015 - HP - Al Sanders
Jim St. Leger
 
PPTX
NFV Orchestration for Optimal Performance
dfilppi
 
PPTX
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 
PDF
Ceph
Hien Nguyen Van
 
PDF
stackconf 2025 | How NVMe over TCP runs PostgreSQL in Quicksilver mode! by Sa...
NETWAYS
 
PPTX
Software Stacks to enable SDN and NFV
Yoshihiro Nakajima
 
PDF
Deploying flash storage for Ceph without compromising performance
Ceph Community
 
PPTX
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Community
 
PPTX
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
Lagopus SDN/OpenFlow switch
 
PDF
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
Jim St. Leger
 
PDF
Scaling the Container Dataplane
Michelle Holley
 
PDF
FD.io - The Universal Dataplane
Open Networking Summit
 
PDF
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
inside-BigData.com
 
PPTX
Introduction to DPDK
Kernel TLV
 
PPTX
RDMA at Hyperscale: Experience and Future Directions
parit11616
 
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Ceph Community
 
LF_DPDK17_DPDK support for new hardware offloads
LF_DPDK
 
DPDK Support for New HW Offloads
Netronome
 
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
Jim St. Leger
 
High Performance Networking Leveraging the DPDK and Growing Community
6WIND
 
DPDK Summit 2015 - HP - Al Sanders
Jim St. Leger
 
NFV Orchestration for Optimal Performance
dfilppi
 
Ceph Day Berlin: Deploying Flash Storage for Ceph without Compromising Perfor...
Ceph Community
 
stackconf 2025 | How NVMe over TCP runs PostgreSQL in Quicksilver mode! by Sa...
NETWAYS
 
Software Stacks to enable SDN and NFV
Yoshihiro Nakajima
 
Deploying flash storage for Ceph without compromising performance
Ceph Community
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Community
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
Lagopus SDN/OpenFlow switch
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
Jim St. Leger
 
Scaling the Container Dataplane
Michelle Holley
 
FD.io - The Universal Dataplane
Open Networking Summit
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
inside-BigData.com
 
Introduction to DPDK
Kernel TLV
 
RDMA at Hyperscale: Experience and Future Directions
parit11616
 
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
inside-BigData.com
 
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
PPTX
Transforming Private 5G Networks
inside-BigData.com
 
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
PDF
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
PDF
Machine Learning for Weather Forecasts
inside-BigData.com
 
PPTX
HPC AI Advisory Council Update
inside-BigData.com
 
PDF
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
PDF
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
PDF
State of ARM-based HPC
inside-BigData.com
 
PDF
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
PDF
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
PDF
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
PDF
Overview of HPC Interconnects
inside-BigData.com
 
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
inside-BigData.com
 
Ad

Recently uploaded (20)

PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 

Accelerating Ceph with RDMA and NVMe-oF

  • 1. 14th ANNUAL WORKSHOP 2018 ACCELERATING CEPH WITH RDMA AND NVME-OF Haodong Tang, Jian Zhang and Fred Zhang Intel Corporation {haodong.tang, jian.zhang, fred.zhang}@intel.com April, 2018
  • 2. AGENDA  Background and motivation  RDMA as Ceph networking component  RDMA as Ceph NVMe fabrics  Summary & next step
  • 4. CEPH INTRODUCTION  References: http://ceph.com/ceph-storage, http://thenewstack.io/software-defined-storage-ceph-way, RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors LIBRADOS A library allowing apps to directly access RADOS RGW A web services gateway for object storage Application RBD A reliable, fully distributed block device CephFS A distributed file system with POSIX semantics Host/VM Client  Scalability – CRUSH data placement, no single POF  Replicates and re-balances dynamically  Enterprise features – snapshots, cloning, mirroring  Most popular block storage for Openstack use cases  Commercial support from Red Hat  Open-source, object-based scale-out storage  Object, Block and File in single unified storage cluster  Highly durable, available – replication, erasure coding  Runs on economical commodity hardware  10 years of hardening, vibrant community
  • 5. CEPH PERFORMANCE PROFILING  CPU is uneven distributed.  CPU tend to be the bottleneck for 4K random write and 4K random read.  Ceph networking layer consumes 20%+ CPU of the totally CPU used by Ceph in 4K random read workload. AsyncMsg(~22 - ~24%) * This picture is from the Boston OpenStack Summit
  • 6. MOTIVATION  RDMA is a direct access from the memory of one computer into that of another without involving either one’s operating system.  RDMA supports zero-copy networking(kernel bypass). • Eliminate CPUs, memory or context switches. • Reduce latency and enable fast messenger transfer.  Potential benefit for ceph. • Better Resource Allocation – Bring additional disk to servers with spare CPU. • Lower latency - generated by ceph network stack.
  • 7. RDMA AS CEPH NETWORKING COMPONENT
  • 8. RDMA IN CEPH  XIO Messenger. • Based on Accelio, seamlessly supporting RDMA. • Scalability issue. • Merged to Ceph master three years ago, no support for now.  Async Messenger. • Async Messenger is compatible with different network protocol, like Posix, RDMA and DPDK. • Current RDMA implementation supports IB protocol. HW Ethernet NIC(RNIC) NIC Driver Dispatcher Kernel Async Messenger IO Library RDMA Stack + OFA Verb APInetwork stack event driver event driver event driver workers pool dispatch queue event driver event driver event driver workers pool same pool IB Link IB Transport IB Transport APIkernel bypass
  • 9. RDMA OVER ETHERNET  Motivation • Leverage RDMA to improve performance (low CPU, low latency). • Leverage Intel RDMA NIC to accelerate Ceph. • RDMA over Ethernet provide is one of the most convenient and practical way for datacenter running Ceph over TCP/IP.  To-do • Need introduce rdma-cm library. HW Ethernet NIC(RNIC) NIC Driver Dispatcher Kernel Async Messenger IO Library RDMA Stack + OFA Verb APInetwork stack event driver event driver event driver workers pool dispatch queue event driver event driver event driver workers pool same pool TCP/IP MPA DDP RDMAPkernel bypass
  • 10. IMPLEMENTATION DETAILS  Current implementation for Infiniband in Ceph: • Connection management: Self-implemented TCP/IP based RDMA connection management • RDMA verbs: RDMA send, RDMA recv • Queue pairs: Shared receive queue (SRQ) • Completed Queue: All queue pair share one completed queue  iWARP protocol needs: • Connection management: RDMA-CM based RDMA connection management • Queue pairs: centralized memory pool for recv queue (RQ)
  • 11. The networking component protocol between OSD node and client node can be changed. We compared the Ceph performance w/ TCP/IP and it w/ RDMA protocol. BENCHMARK METHODOLOGY – SMALL SCALE CPU SKX Platform (112 cores) Memory 128 GB NIC 10 GbE IntelÂŽ Ethernet Connection X722 with iWARP Disk distribution 4x P3700 as OSD drive, 1x Optane as DB driver Software configuration CentOS 7, Ceph Luminous (dev) FIO version 2.17 MON OSD OSD OSD OSD OSD OSD OSD OSD FIO FIO FIO FIO FIO FIOClient Node OSD Node
  • 12. CEPH PERFORMANCE – TCP/IP VS RDMA – 1X OSD NDOE  Ceph w/ iWARP delivers higher 4K random write performance than it with TCP/IP.  Ceph w/ iWARP generates higher CPU Utilization. • Ceph w/ iWARP consumes more user level CPU. • Ceph w/ TCP/IP consumes more system level CPU. 0% 5% 10% 15% 20% 25% 30% 0 20000 40000 60000 80000 100000 120000 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 IOPS Ceph Performance Comparison - RDMA vs TCP/IP - 1x OSD Node 4K Random Write RDMA Cluster 4K Random Write IOPS TCP/IP Cluster 4K Random Write IOPS RDMA Cluster CPU Utilization TCP/IP Cluster CPU Utilization 0 5 10 15 20 25 TCP/IP Cluster CPU Utilzation RDMA Cluster CPU Utilization Ceph CPU Comparison - RDMA vs TCP/IP - QD=64 4K Random Write usr sys iowait soft 18.7% 20.5% 4.5% 2.8%
  • 13. We scale the OSD node to verify the RDMA protocol scale-out ability. BENCHMARK METHODOLOGY – LARGER SCALE CPU SKX Platform (72 cores) Memory 128 GB NIC 10 GbE IntelÂŽ Ethernet Connection X722 with iWARP Disk distribution 4x P4500 as OSD/DB drive Software configuration Ubuntu 17.10, Ceph Luminous (dev) FIO version 2.12 MON OSD OSD OSD OSD OSD OSD OSD OSD FIO FIO FIO FIO FIO FIOClient Node OSD Node FIO FIO FIO FIO FIO FIO MON OSD OSD OSD OSD OSD OSD OSD OSD
  • 14. CEPH PERFORMANCE – TCP/IP VS RDMA – 2X OSD NODES 0% 10% 20% 30% 40% 50% 60% 70% 0 20000 40000 60000 80000 100000 120000 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 CPUUtilization IOPS Ceph Performance Comparison - RDMA vs TCP/IP - 2x OSD Nodes 4K Random Write RDMA Cluster 4K Random Write IOPS TCP/IP Cluster 4K Random Write IOPS RDMA Cluster CPU Utilization TCP/IP Cluster CPU Utilization 0 200 400 600 800 1000 1200 1400 1600 1800 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 IOPS Ceph Performance Comparison - RDMA vs TCP/IP 4K Random Write IOPS / CPU Utilization% RDMA Cluster 4K Random Write IOPS/CPU utilization% TCP/IP Cluster 4K Random Write IOPS/CPU utilization%  Ceph w/ iWARP delivers up to 17% 4K random write performance benefit than it w/ TCP/IP.  Ceph w/ iWARP is more CPU efficient. 13% 17% 7% 0% 13% 3% Higher is better
  • 15. CEPH PERFORMANCE – TCP/IP VS RDMA – 3X OSD NODES  Ceph node scaling out: RDMA vs TCP/IP - 48.7% vs 50.3%  scale out well.  When QD is 16, Ceph w/ RDMA shows 12% higher 4K random write performance. 82409 122601 72289 108685 0 20000 40000 60000 80000 100000 120000 140000 2x OSD nodes 3x OSD nodes IOPS Ceph Performance Comparison - RDMA vs TCP/IP – QD=16 Scale-out performance RDMA Cluster 4K Random Write IOPS TCP/IP Cluster 4K Random Write IOPS 13% 12%48.7% 50.3%
  • 16. PERFORMANCE RESULT DEEP ANALYSIS CPU Profiling  Two polling thread: Ceph Epoll based Async driver thread + RDMA polling thread.  Not really zero-copy: there’s one copy from RDMA recv buffer to Ceph Async driver buffer.
  • 17. RDMA AS CEPH NVME FABRICS
  • 18. RDMA AS CEPH NVME FABRICS  NVMe is a new specification optimized for NAND flash and next-generation solid-state storage technologies.  NVMe over Fabrics enables access to remote NVMe devices over multiple network fabrics.  Supported fabrics  RDMA – InfiniBand, IWARP, RoCE  Fiber Channel  TCP/IP  NVMe-oF benefits  NVMe disaggregation.  Delivers performance of remote NVMe on-par with local NVMe.
  • 19. RDMA AS CEPH NVME FABRICS  Baseline and comparison  The baseline setup used local NVMe.  The comparison setup attaches remote NVMe as OSD data drive.  6x 2T P3700 are among 2x Storage nodes.  OSD nodes attach the 6x P3700 over RoCE V2 fabric.  Set NVMe-oF CPU offload on target node.  Hardware configuration  2x Storage nodes, 3x OSD nodes, 3x Client nodes.  6x P3700 (800 GB U.2), 3x Optane (375 GB)  30x FIO processes worked on 30x RBD volumes.  All these 8x servers are BRW, 128 GB memory, Mellanox Connect-X4 NICs. Optane Ceph OSD P3700 P3700 client Optane Ceph OSD P3700 P3700 client Optane Ceph OSD P3700 P3700 client Ceph Client RBD RBD RBD RBD FIO Ceph Client RBD RBD RBD RBD FIO Ceph Client RBD RBD RBD RBD FIO TCP/IP Optane Ceph OSD P3700 P3700 NVMf client Optane Ceph OSD P3700 P3700 NVMf client Optane Ceph OSD P3700 P3700 NVMf client Ceph Client RBD RBD RBD RBD FIO Ceph Client RBD RBD RBD RBD FIO Ceph Client RBD RBD RBD RBD FIO P3700 P3700 P3700 Ceph Target P3700 P3700 P3700 Ceph Target RDMA TCP/IP
  • 20. EXPECTATION BEFORE POC  Expectations and questions before POC.  Expectations: According to the benchmark from the first part, we’re expecting  on-par 4K random write performance.  on-par CPU utilization on NVMe-oF host node.  Questions:  How many CPU will be used on NVMe-oF target node ?  How is the behavior of tail latency(99.0%) latency with NVMe-oF ?  Does NVMe-oF influence the Scale-out ability of Ceph ?
  • 21. RDMA AS CEPH NVME FABRICS  On-par 4K random write performance  Running Ceph with NVMe-oF brings <1% CPU overhead on target node.  CPU is not the bottleneck on the host node. Client side performance comparison 0 50000 100000 150000 200000 250000 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128 IOPS 4K Random Write - Ceph over NVMf vs Ceph over local NVMe 4K RW - Ceph with Local NVMe 4K RW - Ceph with NVMe-oF CPU Utilization on OSD Node CPU Utilization on Target Node
  • 22. CEPH TAIL LATENCY  When QD is higher than 16, Ceph with NVMe-oF shows higher tail latency (99%).  When QD is lower than 16, Ceph with NVMe-oF on-par with Ceph over local NVMe. 0 50 100 150 200 250 300 350 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128 ms Tail Latency Comparison - Ceph over NVMf vs Ceph over local NVMe 4K RW - Ceph with Local NVMe 4K RW - Ceph with NVMe-oF lower is better
  • 23. RDMA AS CEPH NVME FABRICS  Running Ceph over NVMe-oF didn’t limit the Ceph OSD node scaling out.  For 4K random write/read, the maximum ratio of 3x nodes to 2x nodes is 1.47, closing to 1.5 (ideal value). Scaling out performance 0 50000 100000 150000 200000 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128 IOPS Scaling Out Testing - Ceph over NVMf 4K Random Write 2xnodes 3xnodes 0 200000 400000 600000 800000 1000000 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128 IOPS Scaling Out Testing - Ceph over NVMf 4K Random Read 2xnodes 3xnodes 1.3 1.35 1.4 1.45 1.5 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128 Performance Comparison- Ceph over NVMf 4K Random Write, 3x nodes/2x nodes 1.2 1.25 1.3 1.35 1.4 1.45 1.5 QD=1 QD=2 QD=4 QD=8 QD=16 QD=32 QD=64 QD=128 Performance Comparison - Ceph over NVMf 4K Random Write, 3x nodes/2x nodes
  • 25. SUMMARY & NEXT-STEP  Summary  RDMA is critical for future Ceph AFA solutions.  Ceph with RDMA messenger provides up to ~17% performance advantage over TCP/IP.  Ceph with RDMA messenger shows great scale-our ability.  As network fabrics, RDMA performs well in Ceph NVMe-oF solutions.  Running Ceph on NVMe-oF does not appreciably degrade Ceph write performance.  Ceph with NVMe-oF brings more flexible provisioning and lower TCO.  Next-step  Ceph RDMA networking component optimization based on previous analysis.  leverage NVMe-oF with the high density storage node for lower TCO.
  • 26. LEGAL DISCLAIMER & OPTIMIZATION NOTICE Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.  INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.  Copyright Š 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
  • 27. 14th ANNUAL WORKSHOP 2018 THANK YOU! Haodong Tang, Jian Zhang and Fred Zhang Intel Corporation {haodong.tang, jian.zhang, fred.zhang}@intel.com April, 2018