SlideShare a Scribd company logo
Anu H Rao
Storage Software Product line Manager
Datacenter Group, Intel® Corp
Notices & Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance
varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual
performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about
performance and benchmark results, visit http://www.intel.com/benchmarks .
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as
SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect
future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm
whether referenced data are accurate.
© 2017 Intel Corporation.
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
Latency(µs)
Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications.
0
25
50
75
100
125
150
175
200
10,000
HDD
+SAS/SATA
SSDNAND
+SAS/SATA
SSDNAND
+NVMe™
SSDoptane™
+NVMe™
kerneldriveroverhead1-8%
kerneldriverOverhead<0.01%
kerneldriveroverhead30%-50%
Drive Latency Controller Latency Driver Latency
The Challenge: Media Latency
Storage
Performance
Development
Kit
5
Scalable and Efficient Software Ingredients
• User space, lockless, polled-mode components
• Up to millions of IOPS per core
• Designed to extract maximum performance from
non-volatile media
Storage Reference Architecture
• Optimized for latest generation CPUs and SSDs
• Open source composable building blocks (BSD
licensed)
• Available via SPDK.io
• Follow @SPDKProject on twitter for latest events
and activities
Benefits of using SPDK
SPDK
more performance
from CPUs, non-
volatile media, and
networking
10X MORE IOPS/coreUp to for NVMe-oF* vs. Linux kernel
for NVMe vs. Linux kernel8X MORE IOPS/coreUp to
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to
assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance
50% BETTERUp to for RocksDB workloadTail Latency
3X BETTERUp to for Virtualized StorageIOPS/core & Latency
Up to10.8 Million IOPS with Intel® Xeon Scalable Family and 24 Intel®
Optane™ SSD DC P4800X
1.5X BETTERUp to for NVMEoF vs Kernel for Optane SSDLatency
SPDK Community
http://SPDK.IO
Real Time Chat w/
Development Community
Backlog and
Ideas for Things to Do
Main Web Presence
Email Discussions
Weekly Calls
Multiple Annual Meetups
Code Reviews & Repo
Continuous
Integration
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
Architecture
Drivers
Storage
Services
Storage
Protocols
iSCSI
Target
NVMe-oF*
Target
SCSINVMe
NVMe Devices
Blobstore
NVMe-oF*
Initiator
Intel® QuickData
Technology Driver
Block Device Abstraction (bdev)
Ceph
RBD
Linux
AIO
Logical
Volumes
3rd Party
NVMe
NVMe* PCIe
Driver
Released
New release 18.01
1H‘18
vhost-blk
Target
BlobFS
Integration
RocksDB
Ceph
Core
Application
Framework
GPT
PMDK
blk
virtio
scsi
QEMU
QoS
Linux
nbd
RDMA
virtio
blk
vhost-scsi
Target
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
Virtio
• Paravirtualized driver specification
• Common mechanisms and layouts
for device discovery, I/O queues,
etc.
• virtio device types include:
• virtio-net
• virtio-blk
• virtio-scsi
• virtio-gpu
• virtio-rng
• virtio-cryptoHypervisor (i.e. QEMU/KVM)
Guest VM
(Linux*, Windows*, FreeBSD*, etc.)
virtio front-end drivers
virtio back-end drivers
device emulation
virtqueuevirtqueuevirtqueue
11
QEMU
Kernel
Guest VM
Guest kernel
Application
virtqueue
I/O
Processing
AIO
QEMUVirtIOSCSI
12
1. Add IO to virtqueue
2. IO processed by QEMU
3. IO issued to kernel
4. Kernel pins memory
5. Device executes IO
6. Guest completion interrupt
QEMUVIRTIO
Vhost(KERNEL)
vhost target (kernel or userspace)
Vhost
• Separate process for I/O processing
• vhost protocol for communicating
guest VM parameters
• memory
• number of virtqueues
• virtqueue locations
Hypervisor (i.e. QEMU/KVM)
Guest VM
(Linux*, Windows*, FreeBSD*, etc.)
virtio front-end drivers
device emulation
virtio back-end drivers
virtqueuevirtqueuevirtqueue
vhostvhost
QEMU
Kernel
Guest VM
Guest kernel
Application
virtqueue
vhost-kernel AIO
KernelVHOST
15
1. Add IO to virtqueue
2. Write virtio doorbell
3. Wake vhost kernel
4. Kernel pins memory
5. Device executes IO
6. Guest completion interrupt
kvm
QEMUVIRTIO
Vhost(KERNEL)
vhost(USERSPACE)
17
Host Memory
QEMU
Guest VM
virtio-scsi
Shared Guest VM
Memory
SPDK vhost
vhost DPDK vhost
virtio-scsi
virtqueuevirtqueuevirtqueue
eventfd
UNIX domain
socket
SPDKVHOSTArchitecture
QEMU
Kernel
Guest VM
Guest kernel
Application
virtqueue
SPDK Vhost
vhost i/o
SPDKVHOST
18
1. Add IO to virtqueue
2. Poll virtqueue
3. Device executes IO
4. Guest completion interrupt
kvm
Architecture
Drivers
Storage
Services
Storage
Protocols
iSCSI
Target
NVMe-oF*
Target
SCSI
vhost-scsi
Target
NVMe
NVMe Devices
Blobstore
NVMe-oF*
Initiator
Intel® QuickData
Technology Driver
Block Device Abstraction (bdev)
Ceph
RBD
Linux
AIO
Logical
Volumes
3rd Party
NVMe
NVMe* PCIe
Driver
Released
New release 18.01
1H‘18
vhost-blk
Target
BlobFS
Integration
RocksDB
Ceph
Core
Application
Framework
GPT
PMDK
blk
virtio
scsi
QEMU
QoS
Linux
nbd
RDMA
virtio
blk
Sharing SSDs in userspace
Typically not 1:1 VM to local attached NVMe SSD
 otherwise just use PCI direct assignment
What about SR-IOV?
 SR-IOV SSDs not prevalent yet
 precludes features such as snapshots
What about LVM?
 LVM depends on Linux kernel block layer and storage drivers (i.e. nvme)
 SPDK wants to use userspace polled mode drivers
SPDK Blobstore and Logical Volumes!
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
SPDK vhost Performance
0
10
20
30
40
50
Linux QEMU SPDK
QD=1 Latency (in us)
System Configuration: 2S Intel® Xeon® Platinum 8180: 28C, E5-2699v3: 18C, 2.5GHz (HT off), Intel® Turbo Boost Technology enabled, 12x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu* Server 16.04.2 LTS, 4.11 kernel,
23x Intel® P4800x Optane SSD – 375GB, 1 SPDK lvolstore or LVM lvgroup per SSD, SPDK commit ID c5d8b108f22ab, 46 VMs (CentOS 3.10, 1vCPU, 2GB DRAM, 100GB logical volume), vhost dedicated to 10 cores
As measured by: fio 2.10.1 – Direct=Yes, 4KB random read I/O, Ramp Time=30s, Run Time=180s, Norandommap=1, I/O Engine = libaio, Numjobs=1
Legend: Linux: Kernel vhost-scsi QEMU: virtio-blk dataplane SPDK: Userspace vhost-scsi
SPDK up to 3x better efficiency and latency
48 VMs: vhost-scsi performance (SPDK vs. Kernel)
Intel Xeon Platinum 8180 Processor, 24x Intel P4800x 375GB
2 partitions per VM, 10 vhost I/O processing cores
1
11
2.86 2.77
3.4
9.23 8.98
9.49
0
1
2
3
4
5
6
7
8
9
10
4K 100% Read 4K 100% Write 4K 70%Read30%Write
IOPSinMillions
vhost-kernel vhost-spdk
3.2x 2.7x3.2x
• Aggregate IOPS across all 48x VMs
reported. All VMs on separate cores
than vhost-scsi cores.
• 10 vhost-scsi cores for I/O
processing
• SPDK vhost-scsi up to 3.2x better
with 4K 100% Random read I/Os
• Used cgroups to restrict kernel
vhost-scsi processes to 10 cores
System Configuration:Intel Xeon Platinum 8180 @ 2.5GHz. 56 physical cores 6x 16GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4800x 375GB x24 drives, Bios: HT disabled, p-states enabled, turbo enabled, Ubuntu 16.04.1
LTS, 4.11.0 x86_64 kernel, 48 VMs, number of partition: 2, VM config : 1core 1GB memory, VM OS: fedora 25, blk-mq enabled, Software packages: Qemu-2.9, libvirt-3.0.0, spdk (3bfecec994), IO distribution: 10 vhost-cores for SPDK /
Kernel. Rest 46 cores for QEMU using cgroups, FIO-2.1.10 with SPDK plugin, io depth=1, 8, 32 numjobs=1, direct=1, block size 4k
VM Density: Rate Limiting 20K IOPS per VM
Intel Xeon Platinum 8180 Processor, 24x Intel P4800x 375GB
10 vhost-scsi cores
1
11
0
10
20
30
40
50
60
70
80
90
100
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
24 48 96
%CPUUtilization
(lowerisbetter)
IOPS
(higherisbetter)
No. of VMs
Kernel IOPS SPDK IOPS Kernel CPU Util. SPDK CPU Util.
• % CPU utilized shown from
VM side
• Each VM was running queue
depth=1, 4KB random read
workload
• Hyper threading enabled to
allow 112 cores.
• Each VM rate limited to 20K
IOPS using cgroups
• SPDK able to scale to 96 VMs,
supporting 20K per VM.
Kernel scale till 48 VMs.
Beyond 48 VMs, 10 vhost-
cores seem bottleneck
System Configuration:Intel Xeon Platinum 8180 @ 2.5GHz. 56 physical cores 6x 16GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4800x 375GB x24 drives, Bios: HT disabled, p-states enabled, turbo enabled,
Ubuntu 16.04.1 LTS, 4.11.0 x86_64 kernel, 48 VMs, number of partition: 2, VM config : 1core 1GB memory, VM OS: fedora 25, blk-mq enabled, Software packages: Qemu-2.9, libvirt-3.0.0, spdk (3bfecec994), IO
distribution: 10 vhost-cores for SPDK / Kernel. Rest 46 cores for QEMU using cgroups, FIO-2.1.10 with SPDK plugin, io depth=1, 8, 32 numjobs=1, direct=1, block size 4k
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
SPDK Vhost
BDAL
Logical
Vol
NVMe Driver
BDAL
NVMe
Bdev
VM
Intel® SSD for
Datacenter
VMEPHEMERALSTORAGE
• Increased efficiency yields
greater VM density
26
BDAL
Logical
Vol
VM
Intel® SSD for
Datacenter
SPDKSPDK
Vhost
NVMe-oF
Initiator
BDAL
NVMe-oF
BD
VM
NVMe-oF
Target
VMRemoteStorage
• Enable disaggregation and
migration of VMs using
remote storage
27
Ceph
Cluster
SPDK
Intel® SSD for
Datacenter
Ceph RBD
Driver
BDAL
Ceph
Bdev
SPDK
Vhost
VM
VMCEphStorage
• Potential for innovation in
data services
• Cache
• Deduplication
28
For More information on SPDK
• Visit SPDK.io for tutorials and links to github, maillist, IRC channel and other
resources
• Follow @SPDKProject on twitter for latest events, blogs and other SPDK
community information and activities
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
Basic Architecture
Configure vhost-scsi
controller
 JSON RPC
 creates SPDK constructs for
vhost device and backing
storage
 creates controller-specific
vhost domain socket
Logical
Core 0
Logical
Core 1
vhost-scsi ctrlr
NVMe SSD
scsi dev
scsi lun
bdev
nvme
/spdk/vhost.0
Basic Architecture
Launch VM
 QEMU connects to domain
socket
SPDK
 Assigns logical core
 Starts vhost dev poller
 Allocates NVMe queue pair
 Starts NVMe poller
Logical
Core 0
Logical
Core 1
vhost-scsi ctrlr
NVMe SSD
scsi dev
scsi lun
bdev
nvme
/spdk/vhost.0
vhost-scsi
poller
VQVQVQ
QP
bdev-nvme
poller
VM
Basic Architecture
Repeat for additional
VMs
 pollers spread across
available cores
Logical
Core 0
Logical
Core 1
vhost-scsi ctrlr
NVMe SSD
scsi dev
scsi lun
bdev
nvme
/spdk/vhost.0
vhost-scsi
poller
VQVQVQ
QP
bdev-nvme
poller
vhost-scsi
poller
bdev-nvme
poller
vhost-scsi
poller
bdev-nvme
poller
vhost-scsi
poller
bdev-nvme
poller
VM
Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive
Blobstore Design – Design Goals
• Minimalistic for targeted storage
use cases like Logical Volumes
and RocksDB
• Deliver only the basics to enable
another class of application
• Design for fast storage media
Blobstore Design – High Level
Application interacts with chunks of data called blobs
 Mutable array of pages of data, accessible via ID
Asynchronous
 No blocking, queuing or waiting
Fully parallel
 No locks in IO path
Atomic metadata operations
 Depends on SSD atomicity (i.e. NVMe)
 1+ 4KB metadata pages per blob
Logical Volumes
Blobstore plus:
 UUID xattr for lvolstore, lvols
 Friendly names
– lvol name unique within lvolstore
– lvolstore name unique within application
 Future
– snapshots (requires blobstore support)
NVMe SSD
bdev
bdev
nvme
lvol
blobstore
lvolstore
...
bdev
lvol
Asynchronous Polling
Poller execution
 Reactor on each core
 Iterates through pollers
round-robin
 vhost-scsi poller
– poll for new I/O requests
– submit to NVMe SSD
 bdev-nvme poller
– poll for I/O completions
– complete to guest VM
Logical
Core 0
Logical
Core 1
vhost-scsi ctrlr
NVMe SSD
scsi dev
scsi lun
bdev
nvme
/spdk/vhost.0
vhost-scsi
poller
VQVQVQ
QP
bdev-nvme
poller
vhost-scsi
poller
bdev-nvme
poller
vhost-scsi
poller
bdev-nvme
poller
vhost-scsi
poller
bdev-nvme
poller
VM

More Related Content

What's hot (20)

PDF
Userspace networking
Stephen Hemminger
 
PDF
DPDK in Containers Hands-on Lab
Michelle Holley
 
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
 
PPTX
Introduction to DPDK
Kernel TLV
 
PPTX
Docker Deep Dive Understanding Docker Engine Docker for DevOps
MehwishHayat3
 
PDF
NEDIA_SNIA_CXL_講演資料.pdf
Yasunori Goto
 
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
PDF
Intel DPDK Step by Step instructions
Hisaki Ohara
 
PDF
BPF Internals (eBPF)
Brendan Gregg
 
PDF
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
 
PPTX
Linux Network Stack
Adrien Mahieux
 
PDF
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
Yasunori Goto
 
ODP
eBPF maps 101
SUSE Labs Taipei
 
PPTX
Understanding eBPF in a Hurry!
Ray Jenkins
 
PPTX
Debug dpdk process bottleneck & painpoints
Vipin Varghese
 
PDF
YOW2021 Computing Performance
Brendan Gregg
 
PDF
eBPF - Rethinking the Linux Kernel
Thomas Graf
 
PDF
20111015 勉強会 (PCIe / SR-IOV)
Kentaro Ebisawa
 
PPTX
Linux Device Tree
艾鍗科技
 
Userspace networking
Stephen Hemminger
 
DPDK in Containers Hands-on Lab
Michelle Holley
 
Kernel Recipes 2017: Using Linux perf at Netflix
Brendan Gregg
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
Thomas Graf
 
Introduction to DPDK
Kernel TLV
 
Docker Deep Dive Understanding Docker Engine Docker for DevOps
MehwishHayat3
 
NEDIA_SNIA_CXL_講演資料.pdf
Yasunori Goto
 
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Intel DPDK Step by Step instructions
Hisaki Ohara
 
BPF Internals (eBPF)
Brendan Gregg
 
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
 
Linux Network Stack
Adrien Mahieux
 
The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers c...
Yasunori Goto
 
eBPF maps 101
SUSE Labs Taipei
 
Understanding eBPF in a Hurry!
Ray Jenkins
 
Debug dpdk process bottleneck & painpoints
Vipin Varghese
 
YOW2021 Computing Performance
Brendan Gregg
 
eBPF - Rethinking the Linux Kernel
Thomas Graf
 
20111015 勉強会 (PCIe / SR-IOV)
Kentaro Ebisawa
 
Linux Device Tree
艾鍗科技
 

Similar to Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive (20)

PPTX
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Community
 
PDF
LF_DPDK_Accelerate storage service via SPDK
LF_DPDK
 
PPTX
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Community
 
PDF
Achieving the Ultimate Performance with KVM
data://disrupted®
 
PDF
Ceph Day Beijing - SPDK for Ceph
Danielle Womboldt
 
PDF
Ceph Day Beijing - SPDK in Ceph
Ceph Community
 
PDF
Deep Dive On Intel Optane SSDs And New Server Platforms
NEXTtour
 
PDF
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Community
 
PDF
Intel xeon-scalable-processors-overview
DESMOND YUEN
 
PDF
Achieving the Ultimate Performance with KVM
DevOps.com
 
PPTX
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Community
 
PDF
Accelerate Ceph performance via SPDK related techniques
Ceph Community
 
PDF
Introduction to container networking in K8s - SDN/NFV London meetup
Haidee McMahon
 
PDF
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Community
 
PDF
Overcoming Scaling Challenges in MongoDB Deployments with SSD
MongoDB
 
PDF
Lynn Comp - Big Data & Cloud Summit 2013
IntelAPAC
 
PDF
Crooke CWF Keynote FINAL final platinum
Alan Frost
 
PDF
Intel xeon e5v3 y sdi
Telecomputer
 
PDF
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY
 
PDF
Inside story on Intel Data Center @ IDF 2013
Intel IT Center
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Community
 
LF_DPDK_Accelerate storage service via SPDK
LF_DPDK
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Community
 
Achieving the Ultimate Performance with KVM
data://disrupted®
 
Ceph Day Beijing - SPDK for Ceph
Danielle Womboldt
 
Ceph Day Beijing - SPDK in Ceph
Ceph Community
 
Deep Dive On Intel Optane SSDs And New Server Platforms
NEXTtour
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Community
 
Intel xeon-scalable-processors-overview
DESMOND YUEN
 
Achieving the Ultimate Performance with KVM
DevOps.com
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Community
 
Accelerate Ceph performance via SPDK related techniques
Ceph Community
 
Introduction to container networking in K8s - SDN/NFV London meetup
Haidee McMahon
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Community
 
Overcoming Scaling Challenges in MongoDB Deployments with SSD
MongoDB
 
Lynn Comp - Big Data & Cloud Summit 2013
IntelAPAC
 
Crooke CWF Keynote FINAL final platinum
Alan Frost
 
Intel xeon e5v3 y sdi
Telecomputer
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY
 
Inside story on Intel Data Center @ IDF 2013
Intel IT Center
 
Ad

More from Michelle Holley (20)

PDF
NFF-GO (YANFF) - Yet Another Network Function Framework
Michelle Holley
 
PDF
Edge and 5G: What is in it for the developers?
Michelle Holley
 
PDF
5G and Open Reference Platforms
Michelle Holley
 
PDF
De-fogging Edge Computing: Ecosystem, Use-cases, and Opportunities
Michelle Holley
 
PDF
Building the SD-Branch using uCPE
Michelle Holley
 
PDF
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Michelle Holley
 
PDF
Accelerating Edge Computing Adoption
Michelle Holley
 
PDF
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Michelle Holley
 
PDF
DPDK & Cloud Native
Michelle Holley
 
PDF
OpenDaylight Update (June 2018)
Michelle Holley
 
PDF
Tungsten Fabric Overview
Michelle Holley
 
PDF
Orchestrating NFV Workloads in Multiple Clouds
Michelle Holley
 
PDF
Convergence of device and data at the Edge Cloud
Michelle Holley
 
PDF
Intel® Network Builders - Network Edge Ecosystem Program
Michelle Holley
 
PDF
Design Implications, Challenges and Principles of Zero-Touch Management Envir...
Michelle Holley
 
PDF
Using Microservices Architecture and Patterns to Address Applications Require...
Michelle Holley
 
PDF
Intel Powered AI Applications for Telco
Michelle Holley
 
PDF
Artificial Intelligence in the Network
Michelle Holley
 
PDF
Service Mesh on Kubernetes with Istio
Michelle Holley
 
PDF
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Michelle Holley
 
NFF-GO (YANFF) - Yet Another Network Function Framework
Michelle Holley
 
Edge and 5G: What is in it for the developers?
Michelle Holley
 
5G and Open Reference Platforms
Michelle Holley
 
De-fogging Edge Computing: Ecosystem, Use-cases, and Opportunities
Michelle Holley
 
Building the SD-Branch using uCPE
Michelle Holley
 
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Michelle Holley
 
Accelerating Edge Computing Adoption
Michelle Holley
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Michelle Holley
 
DPDK & Cloud Native
Michelle Holley
 
OpenDaylight Update (June 2018)
Michelle Holley
 
Tungsten Fabric Overview
Michelle Holley
 
Orchestrating NFV Workloads in Multiple Clouds
Michelle Holley
 
Convergence of device and data at the Edge Cloud
Michelle Holley
 
Intel® Network Builders - Network Edge Ecosystem Program
Michelle Holley
 
Design Implications, Challenges and Principles of Zero-Touch Management Envir...
Michelle Holley
 
Using Microservices Architecture and Patterns to Address Applications Require...
Michelle Holley
 
Intel Powered AI Applications for Telco
Michelle Holley
 
Artificial Intelligence in the Network
Michelle Holley
 
Service Mesh on Kubernetes with Istio
Michelle Holley
 
Intel® QuickAssist Technology Introduction, Applications, and Lab, Including ...
Michelle Holley
 
Ad

Recently uploaded (20)

PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Help for Correlations in IBM SPSS Statistics.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 

Accelerating Virtual Machine Access with the Storage Performance Development Kit (SPDK) – Vhost Deep Dive

  • 1. Anu H Rao Storage Software Product line Manager Datacenter Group, Intel® Corp
  • 2. Notices & Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2017 Intel Corporation.
  • 4. Latency(µs) Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications. 0 25 50 75 100 125 150 175 200 10,000 HDD +SAS/SATA SSDNAND +SAS/SATA SSDNAND +NVMe™ SSDoptane™ +NVMe™ kerneldriveroverhead1-8% kerneldriverOverhead<0.01% kerneldriveroverhead30%-50% Drive Latency Controller Latency Driver Latency The Challenge: Media Latency
  • 5. Storage Performance Development Kit 5 Scalable and Efficient Software Ingredients • User space, lockless, polled-mode components • Up to millions of IOPS per core • Designed to extract maximum performance from non-volatile media Storage Reference Architecture • Optimized for latest generation CPUs and SSDs • Open source composable building blocks (BSD licensed) • Available via SPDK.io • Follow @SPDKProject on twitter for latest events and activities
  • 6. Benefits of using SPDK SPDK more performance from CPUs, non- volatile media, and networking 10X MORE IOPS/coreUp to for NVMe-oF* vs. Linux kernel for NVMe vs. Linux kernel8X MORE IOPS/coreUp to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 50% BETTERUp to for RocksDB workloadTail Latency 3X BETTERUp to for Virtualized StorageIOPS/core & Latency Up to10.8 Million IOPS with Intel® Xeon Scalable Family and 24 Intel® Optane™ SSD DC P4800X 1.5X BETTERUp to for NVMEoF vs Kernel for Optane SSDLatency
  • 7. SPDK Community http://SPDK.IO Real Time Chat w/ Development Community Backlog and Ideas for Things to Do Main Web Presence Email Discussions Weekly Calls Multiple Annual Meetups Code Reviews & Repo Continuous Integration
  • 9. Architecture Drivers Storage Services Storage Protocols iSCSI Target NVMe-oF* Target SCSINVMe NVMe Devices Blobstore NVMe-oF* Initiator Intel® QuickData Technology Driver Block Device Abstraction (bdev) Ceph RBD Linux AIO Logical Volumes 3rd Party NVMe NVMe* PCIe Driver Released New release 18.01 1H‘18 vhost-blk Target BlobFS Integration RocksDB Ceph Core Application Framework GPT PMDK blk virtio scsi QEMU QoS Linux nbd RDMA virtio blk vhost-scsi Target
  • 11. Virtio • Paravirtualized driver specification • Common mechanisms and layouts for device discovery, I/O queues, etc. • virtio device types include: • virtio-net • virtio-blk • virtio-scsi • virtio-gpu • virtio-rng • virtio-cryptoHypervisor (i.e. QEMU/KVM) Guest VM (Linux*, Windows*, FreeBSD*, etc.) virtio front-end drivers virtio back-end drivers device emulation virtqueuevirtqueuevirtqueue 11
  • 12. QEMU Kernel Guest VM Guest kernel Application virtqueue I/O Processing AIO QEMUVirtIOSCSI 12 1. Add IO to virtqueue 2. IO processed by QEMU 3. IO issued to kernel 4. Kernel pins memory 5. Device executes IO 6. Guest completion interrupt
  • 14. vhost target (kernel or userspace) Vhost • Separate process for I/O processing • vhost protocol for communicating guest VM parameters • memory • number of virtqueues • virtqueue locations Hypervisor (i.e. QEMU/KVM) Guest VM (Linux*, Windows*, FreeBSD*, etc.) virtio front-end drivers device emulation virtio back-end drivers virtqueuevirtqueuevirtqueue vhostvhost
  • 15. QEMU Kernel Guest VM Guest kernel Application virtqueue vhost-kernel AIO KernelVHOST 15 1. Add IO to virtqueue 2. Write virtio doorbell 3. Wake vhost kernel 4. Kernel pins memory 5. Device executes IO 6. Guest completion interrupt kvm
  • 17. 17 Host Memory QEMU Guest VM virtio-scsi Shared Guest VM Memory SPDK vhost vhost DPDK vhost virtio-scsi virtqueuevirtqueuevirtqueue eventfd UNIX domain socket SPDKVHOSTArchitecture
  • 18. QEMU Kernel Guest VM Guest kernel Application virtqueue SPDK Vhost vhost i/o SPDKVHOST 18 1. Add IO to virtqueue 2. Poll virtqueue 3. Device executes IO 4. Guest completion interrupt kvm
  • 19. Architecture Drivers Storage Services Storage Protocols iSCSI Target NVMe-oF* Target SCSI vhost-scsi Target NVMe NVMe Devices Blobstore NVMe-oF* Initiator Intel® QuickData Technology Driver Block Device Abstraction (bdev) Ceph RBD Linux AIO Logical Volumes 3rd Party NVMe NVMe* PCIe Driver Released New release 18.01 1H‘18 vhost-blk Target BlobFS Integration RocksDB Ceph Core Application Framework GPT PMDK blk virtio scsi QEMU QoS Linux nbd RDMA virtio blk
  • 20. Sharing SSDs in userspace Typically not 1:1 VM to local attached NVMe SSD  otherwise just use PCI direct assignment What about SR-IOV?  SR-IOV SSDs not prevalent yet  precludes features such as snapshots What about LVM?  LVM depends on Linux kernel block layer and storage drivers (i.e. nvme)  SPDK wants to use userspace polled mode drivers SPDK Blobstore and Logical Volumes!
  • 22. SPDK vhost Performance 0 10 20 30 40 50 Linux QEMU SPDK QD=1 Latency (in us) System Configuration: 2S Intel® Xeon® Platinum 8180: 28C, E5-2699v3: 18C, 2.5GHz (HT off), Intel® Turbo Boost Technology enabled, 12x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu* Server 16.04.2 LTS, 4.11 kernel, 23x Intel® P4800x Optane SSD – 375GB, 1 SPDK lvolstore or LVM lvgroup per SSD, SPDK commit ID c5d8b108f22ab, 46 VMs (CentOS 3.10, 1vCPU, 2GB DRAM, 100GB logical volume), vhost dedicated to 10 cores As measured by: fio 2.10.1 – Direct=Yes, 4KB random read I/O, Ramp Time=30s, Run Time=180s, Norandommap=1, I/O Engine = libaio, Numjobs=1 Legend: Linux: Kernel vhost-scsi QEMU: virtio-blk dataplane SPDK: Userspace vhost-scsi SPDK up to 3x better efficiency and latency
  • 23. 48 VMs: vhost-scsi performance (SPDK vs. Kernel) Intel Xeon Platinum 8180 Processor, 24x Intel P4800x 375GB 2 partitions per VM, 10 vhost I/O processing cores 1 11 2.86 2.77 3.4 9.23 8.98 9.49 0 1 2 3 4 5 6 7 8 9 10 4K 100% Read 4K 100% Write 4K 70%Read30%Write IOPSinMillions vhost-kernel vhost-spdk 3.2x 2.7x3.2x • Aggregate IOPS across all 48x VMs reported. All VMs on separate cores than vhost-scsi cores. • 10 vhost-scsi cores for I/O processing • SPDK vhost-scsi up to 3.2x better with 4K 100% Random read I/Os • Used cgroups to restrict kernel vhost-scsi processes to 10 cores System Configuration:Intel Xeon Platinum 8180 @ 2.5GHz. 56 physical cores 6x 16GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4800x 375GB x24 drives, Bios: HT disabled, p-states enabled, turbo enabled, Ubuntu 16.04.1 LTS, 4.11.0 x86_64 kernel, 48 VMs, number of partition: 2, VM config : 1core 1GB memory, VM OS: fedora 25, blk-mq enabled, Software packages: Qemu-2.9, libvirt-3.0.0, spdk (3bfecec994), IO distribution: 10 vhost-cores for SPDK / Kernel. Rest 46 cores for QEMU using cgroups, FIO-2.1.10 with SPDK plugin, io depth=1, 8, 32 numjobs=1, direct=1, block size 4k
  • 24. VM Density: Rate Limiting 20K IOPS per VM Intel Xeon Platinum 8180 Processor, 24x Intel P4800x 375GB 10 vhost-scsi cores 1 11 0 10 20 30 40 50 60 70 80 90 100 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 24 48 96 %CPUUtilization (lowerisbetter) IOPS (higherisbetter) No. of VMs Kernel IOPS SPDK IOPS Kernel CPU Util. SPDK CPU Util. • % CPU utilized shown from VM side • Each VM was running queue depth=1, 4KB random read workload • Hyper threading enabled to allow 112 cores. • Each VM rate limited to 20K IOPS using cgroups • SPDK able to scale to 96 VMs, supporting 20K per VM. Kernel scale till 48 VMs. Beyond 48 VMs, 10 vhost- cores seem bottleneck System Configuration:Intel Xeon Platinum 8180 @ 2.5GHz. 56 physical cores 6x 16GB, 2667 DDR4, 6 memory Channels, SSD: Intel P4800x 375GB x24 drives, Bios: HT disabled, p-states enabled, turbo enabled, Ubuntu 16.04.1 LTS, 4.11.0 x86_64 kernel, 48 VMs, number of partition: 2, VM config : 1core 1GB memory, VM OS: fedora 25, blk-mq enabled, Software packages: Qemu-2.9, libvirt-3.0.0, spdk (3bfecec994), IO distribution: 10 vhost-cores for SPDK / Kernel. Rest 46 cores for QEMU using cgroups, FIO-2.1.10 with SPDK plugin, io depth=1, 8, 32 numjobs=1, direct=1, block size 4k
  • 26. SPDK Vhost BDAL Logical Vol NVMe Driver BDAL NVMe Bdev VM Intel® SSD for Datacenter VMEPHEMERALSTORAGE • Increased efficiency yields greater VM density 26 BDAL Logical Vol VM
  • 28. Ceph Cluster SPDK Intel® SSD for Datacenter Ceph RBD Driver BDAL Ceph Bdev SPDK Vhost VM VMCEphStorage • Potential for innovation in data services • Cache • Deduplication 28
  • 29. For More information on SPDK • Visit SPDK.io for tutorials and links to github, maillist, IRC channel and other resources • Follow @SPDKProject on twitter for latest events, blogs and other SPDK community information and activities
  • 31. Basic Architecture Configure vhost-scsi controller  JSON RPC  creates SPDK constructs for vhost device and backing storage  creates controller-specific vhost domain socket Logical Core 0 Logical Core 1 vhost-scsi ctrlr NVMe SSD scsi dev scsi lun bdev nvme /spdk/vhost.0
  • 32. Basic Architecture Launch VM  QEMU connects to domain socket SPDK  Assigns logical core  Starts vhost dev poller  Allocates NVMe queue pair  Starts NVMe poller Logical Core 0 Logical Core 1 vhost-scsi ctrlr NVMe SSD scsi dev scsi lun bdev nvme /spdk/vhost.0 vhost-scsi poller VQVQVQ QP bdev-nvme poller VM
  • 33. Basic Architecture Repeat for additional VMs  pollers spread across available cores Logical Core 0 Logical Core 1 vhost-scsi ctrlr NVMe SSD scsi dev scsi lun bdev nvme /spdk/vhost.0 vhost-scsi poller VQVQVQ QP bdev-nvme poller vhost-scsi poller bdev-nvme poller vhost-scsi poller bdev-nvme poller vhost-scsi poller bdev-nvme poller VM
  • 35. Blobstore Design – Design Goals • Minimalistic for targeted storage use cases like Logical Volumes and RocksDB • Deliver only the basics to enable another class of application • Design for fast storage media
  • 36. Blobstore Design – High Level Application interacts with chunks of data called blobs  Mutable array of pages of data, accessible via ID Asynchronous  No blocking, queuing or waiting Fully parallel  No locks in IO path Atomic metadata operations  Depends on SSD atomicity (i.e. NVMe)  1+ 4KB metadata pages per blob
  • 37. Logical Volumes Blobstore plus:  UUID xattr for lvolstore, lvols  Friendly names – lvol name unique within lvolstore – lvolstore name unique within application  Future – snapshots (requires blobstore support) NVMe SSD bdev bdev nvme lvol blobstore lvolstore ... bdev lvol
  • 38. Asynchronous Polling Poller execution  Reactor on each core  Iterates through pollers round-robin  vhost-scsi poller – poll for new I/O requests – submit to NVMe SSD  bdev-nvme poller – poll for I/O completions – complete to guest VM Logical Core 0 Logical Core 1 vhost-scsi ctrlr NVMe SSD scsi dev scsi lun bdev nvme /spdk/vhost.0 vhost-scsi poller VQVQVQ QP bdev-nvme poller vhost-scsi poller bdev-nvme poller vhost-scsi poller bdev-nvme poller vhost-scsi poller bdev-nvme poller VM