SlideShare a Scribd company logo
Jiannan Ouyang, Brian Kocoloski, John Lange
The Prognostic Lab @ University of Pittsburgh
Kevin Pedretti
Sandia National Laboratories
HPDC 2015
Achieving Performance Isolation
with Lightweight Co-Kernels
HPC Architecture
2
—  Move computation to data
—  Improved data locality
—  Reduced power consumption
—  Reduced network traffic
Compute Node
Operating System and Runtimes
(OS/R)
Simulation
Analytic /
Visualization
Supercomputer
Shared Storage Cluster
Processing Cluster
Problem: massive data movement
over interconnects
Traditional In Situ Data Processing
Challenge: Predictable High Performance
3
—  Tightly coupled HPC workloads are sensitive to OS noise
and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10]
—  Specialized kernels for predictable performance
—  Tailored from Linux: CNL for Cray supercomputers
—  Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten
—  Data processing workloads favor Linux environments
—  Cross workload interference
—  Shared hardware (CPU time, cache, memory bandwidth)
—  Shared system software
How to provide both Linux and specialized kernels on the same node,
while ensuring performance isolation??
Approach: Lightweight Co-Kernels
4
—  Hardware resources on one node are dynamically composed into
multiple partitions or enclaves
—  Independent software stacks are deployed on each enclave
—  Optimized for certain applications and hardware
—  Performance isolation at both the software and hardware level
Hardware
Linux LWK
Analytic /
Visualization
Hardware
Linux
Analytic /
Visualization
Simulation
Simulation
Agenda
5
—  Introduction
—  The Pisces Lightweight Co-Kernel Architecture
—  Implementation
—  Evaluation
—  RelatedWork
—  Conclusion
Building Blocks: Kitten and Palacios
—  the Kitten Lightweight Kernel (LWK)
—  Goal: provide predictable performance for massively parallel HPC applications
—  Simple resource management policies
—  Limited kernel I/O support + direct user-level network access
—  the Palacios LightweightVirtual Machine Monitor (VMM)
—  Goal: predictable performance
—  Lightweight resource management policies
—  Established history of providing virtualized environments for HPC [Lange et al.
VEE ’11, Kocoloski and Lange ROSS‘12]
Kitten: https://software.sandia.gov/trac/kitten
Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
The Pisces Lightweight Co-Kernel Architecture
7
Linux
Hardware
Isolated Virtual
Machine
Applications
+
Virtual
MachinesPalacios VMM
Kitten Co-kernel
(1)
Kitten Co-kernel
(2)
Isolated
Application
Pisces Pisces
http://www.prognosticlab.org/pisces/
Pisces Design Goals
—  Performance isolation at both software and hardware level
—  Dynamic creation of resizable enclaves
—  Isolated virtual environments
Design Decisions
8
—  Elimination of cross OS dependencies
—  Each enclave must implement its own complete set of supported
system calls
—  No system call forwarding is allowed
—  Internalized management of I/O
—  Each enclave must provide its own I/O device drivers and manage
its hardware resources directly
—  Userspace cross enclave communication
—  Cross enclave communication is not a kernel provided feature
—  Explicitly setup cross enclave shared memory at runtime (XEMEM)
—  Using virtualization to provide missing OS features
Cross Kernel Communication
9
Hardware'Par))on' Hardware'Par))on'
User%
Context%
Kernel%
Context% Linux'
Cross%Kernel*
Messages*
Control'
Process'
Control'
Process'
Shared*Mem*
*Ctrl*Channel*
Linux'
Compa)ble'
Workloads'
Isolated'
Processes''
+'
Virtual'
Machines'
Shared*Mem*
Communica6on*Channels*
Ki@en'
CoAKernel'
XEMEM: Efficient Shared Memory for Composed
Applications on Multi-OS/R Exascale Systems
[Kocoloski and Lange, HPDC‘15]
Agenda
10
—  Introduction
—  The Pisces Lightweight Co-Kernel Architecture
—  Implementation
—  Evaluation
—  RelatedWork
—  Conclusion
Challenges & Approaches
11
—  How to boot a co-kernel?
—  Hot-remove resources from Linux, and load co-kernel
—  Reuse Linux boot code with modified target kernel address
—  Restrict the Kitten co-kernel to access assigned resources only
—  How to share hardware resources among kernels?
—  Hot-remove from Linux + direct assignment and adjustment (e.g.
CPU cores, memory blocks, PCI devices)
—  Managed by Linux and Pisces (e.g. IOMMU)
—  How to communicate with a co-kernel?
—  Kernel level: IPI + shared memory, primarily for Pisces commands
—  Application level: XEMEM [Kocoloski HPDC’15]
—  How to route device interrupts?
I/O Interrupt Routing
12
Legacy
Device
IO-APIC
Management
Kernel Co-Kernel
IRQ
Forwarder
IRQ
Handler
MSI/MSI-X
Device
Management
Kernel Co-Kernel
IRQ
Forwarder
IRQ
Handler
MSI/MSI-X
Device
MSI MSI
INTx
IPI
Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI)
•  Legacy interrupt vectors are potentially shared among multiple devices
•  Pisces provides IRQ forwarding service
•  IRQ forwarding is only used during initialization for PCI devices
•  Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X)
•  Directly route to the corresponding enclave
Implementation
13
—  Pisces
—  Linux kernel module supports unmodified Linux kernels
(2.6.3x – 3.x.y)
—  Co-kernel initialization and management
—  Kitten (~9000 LOC changes)
—  Manage assigned hardware resources
—  Dynamic resource assignment
—  Kernel level communication channel
—  Palacios (~5000 LOC changes)
—  Dynamic resource assignment
—  Command forwarding channel
Pisces: http://www.prognosticlab.org/pisces/
Kitten: https://software.sandia.gov/trac/kitten
Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
Agenda
14
—  Introduction
—  The Pisces Lightweight Co-Kernel Architecture
—  Implementation
—  Evaluation
—  RelatedWork
—  Conclusion
Evaluation
15
—  8 node Dell R450 cluster
—  Two six-core Intel “Ivy-Bridge” Xeon processors
—  24GB RAM split across two NUMA domains
—  QDR Infiniband
—  CentOS 7, Linux kernel 3.16
—  For performance isolation experiments, the hardware is
partitioned by NUMA domains.
—  i.e. Linux on one NUMA domain, co-kernel on the other
Fast Pisces Management Operations
16
Operations Latency (ms)
Booting a co-kernel 265.98
Adding a single CPU core 33.74
Adding a 128MB memory block 82.66
Adding an Ethernet NIC 118.98
Eliminating Cross Kernel Dependencies
17
solitary workloads (us) w/ other workloads (us)
Linux 3.05 3.48
co-kernel fwd 6.12 14.00
co-kernel 0.39 0.36
ExecutionTime of getpid()
—  Co-kernel has the best average performance
—  Co-kernel has the most consistent performance
—  System call forwarding has longer latency and suffers from
cross stack performance interference
Noise Analysis
18
0
5
10
15
20
0 1 2 3 4 5
Latency(us) Time (seconds)
(a) without competing workloads
0 1 2 3 4 5
Time (seconds)
(b) with competing workloads
0
5
10
15
20
0 1 2 3 4 5
Latency(us)
Time (seconds)
(a) without competing workloads
0 1 2 3 4 5
Time (seconds)
(b) with competing workloads
Linux
Kitten co-kernel
Co-Kernel: less noise + better isolation
* Each point represents the latency of an OS interruption
Single Node Performance
19
0
1
CentOS Kitten/KVM co-Kernel
82
83
84
85
CompletionTime(Seconds)
without bg
with bg
0
250
CentOS Kitten/KVM co-Kernel
20250
20500
20750
21000
21250
Throughput(GUPS)
without bg
with bg
CoMD Performance Stream Performance
Co-Kernel: consist performance + performance isolation
8 Node Performance
20
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7 8
Throughput(GFLOP/s)
Number of Nodes
co-VMM
native
KVM
co-VMM bg
native bg
KVM bg
w/o bg: co-VMM achieves native Linux performance
w/ bg: co-VMM outperforms native Linux
Co-VMM for HPC in the Cloud
21
0
20
40
60
80
100
44 45 46 47 48 49 50 51
CDF(%)
Runtime (seconds)
Co-VMM Native KVM
CDF of HPCCG Performance (running with Hadoop, 8 nodes)
co-VMM: consistent performance + performance isolation
Related Work
22
— Exascale operating systems and runtimes (OS/Rs)
—  Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities)
—  Argo (ANL, LLNL, PNNL, various universities)
—  FusedOS (Intel / IBM)
—  mOS (Intel)
—  McKernel (RIKENAICS, University ofTokyo)
Our uniqueness: performance isolation, dynamic
resource composition, lightweight virtualization
Conclusion
23
—  Design and implementation of the Pisces co-kernel architecture
—  Pisces framework
—  Kitten co-kernel
—  PalaciosVMM for Kitten co-kernel
—  Demonstrated that the co-kernel architecture provides
—  Optimized execution environments for in situ processing
—  Performance isolation
https://software.sandia.gov/trac/kitten
http://www.prognosticlab.org/pisces/
http://www.prognosticlab.org/palacios
Thank You
Jiannan Ouyang
—  Ph.D. Candidate @ University of Pittsburgh
—  ouyang@cs.pitt.edu
—  http://people.cs.pitt.edu/~ouyang/
—  The Prognostic Lab @ U. Pittsburgh
—  http://www.prognosticlab.org

More Related Content

What's hot (20)

PDF
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
harryvanhaaren
 
PDF
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Hajime Tazaki
 
PDF
Network stack personality in Android phone - netdev 2.2
Hajime Tazaki
 
PDF
Linux Kernel Library - Reusing Monolithic Kernel
Hajime Tazaki
 
PDF
NUSE (Network Stack in Userspace) at #osio
Hajime Tazaki
 
PDF
Make Your Containers Faster: Linux Container Performance Tools
Kernel TLV
 
PPTX
Accelerating Neutron with Intel DPDK
Alexander Shalimov
 
PDF
Trip down the GPU lane with Machine Learning
Renaldas Zioma
 
PDF
YOW2021 Computing Performance
Brendan Gregg
 
PDF
Library Operating System for Linux #netdev01
Hajime Tazaki
 
PDF
Direct Code Execution @ CoNEXT 2013
Hajime Tazaki
 
PDF
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
Jim St. Leger
 
PPTX
Ovs perf
Madhu c
 
PDF
Intel DPDK Step by Step instructions
Hisaki Ohara
 
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Thomas Graf
 
PDF
Xen in Linux 3.x (or PVOPS)
The Linux Foundation
 
PDF
DPDK In Depth
Kernel TLV
 
PPTX
Understanding DPDK
Denys Haryachyy
 
PDF
How to Speak Intel DPDK KNI for Web Services.
Naoto MATSUMOTO
 
The 7 Deadly Sins of Packet Processing - Venky Venkatesan and Bruce Richardson
harryvanhaaren
 
Linux rumpkernel - ABC2018 (AsiaBSDCon 2018)
Hajime Tazaki
 
Network stack personality in Android phone - netdev 2.2
Hajime Tazaki
 
Linux Kernel Library - Reusing Monolithic Kernel
Hajime Tazaki
 
NUSE (Network Stack in Userspace) at #osio
Hajime Tazaki
 
Make Your Containers Faster: Linux Container Performance Tools
Kernel TLV
 
Accelerating Neutron with Intel DPDK
Alexander Shalimov
 
Trip down the GPU lane with Machine Learning
Renaldas Zioma
 
YOW2021 Computing Performance
Brendan Gregg
 
Library Operating System for Linux #netdev01
Hajime Tazaki
 
Direct Code Execution @ CoNEXT 2013
Hajime Tazaki
 
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
Jim St. Leger
 
Ovs perf
Madhu c
 
Intel DPDK Step by Step instructions
Hisaki Ohara
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Thomas Graf
 
Xen in Linux 3.x (or PVOPS)
The Linux Foundation
 
DPDK In Depth
Kernel TLV
 
Understanding DPDK
Denys Haryachyy
 
How to Speak Intel DPDK KNI for Web Services.
Naoto MATSUMOTO
 

Viewers also liked (20)

PDF
Preemptable ticket spinlocks: improving consolidated performance in the cloud
Jiannan Ouyang, PhD
 
PDF
ENERGY EFFICIENCY OF ARM ARCHITECTURES FOR CLOUD COMPUTING APPLICATIONS
Stephan Cadene
 
PPT
Docker by demo
Hyderabad Scalability Meetup
 
PDF
Denser, cooler, faster, stronger: PHP on ARM microservers
Jez Halford
 
ODP
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
brouer
 
PPT
Cache profiling on ARM Linux
Prabindh Sundareson
 
PPTX
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Eric Van Hensbergen
 
PDF
SDN - OpenFlow + OpenVSwitch + Quantum
The Linux Foundation
 
PDF
Q2.12: Research Update on big.LITTLE MP Scheduling
Linaro
 
PPT
Effect of Virtualization on OS Interference
Eric Van Hensbergen
 
PDF
DOXLON November 2016: Facebook Engineering on cgroupv2
Outlyer
 
PDF
reference_guide_Kernel_Crash_Dump_Analysis
Buland Singh
 
PDF
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Hemanth Venkatesh
 
PDF
Memory Barriers in the Linux Kernel
Davidlohr Bueso
 
PDF
Linux cgroups and namespaces
Locaweb
 
PDF
How Ceph performs on ARM Microserver Cluster
Aaron Joue
 
PDF
SFO15-407: Performance Overhead of ARM Virtualization
Linaro
 
PPTX
Smarter Scheduling (Priorities, Preemptive Priority Scheduling, Lottery and S...
David Evans
 
PDF
Debugging linux kernel tools and techniques
Satpal Parmar
 
PDF
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
Linaro
 
Preemptable ticket spinlocks: improving consolidated performance in the cloud
Jiannan Ouyang, PhD
 
ENERGY EFFICIENCY OF ARM ARCHITECTURES FOR CLOUD COMPUTING APPLICATIONS
Stephan Cadene
 
Denser, cooler, faster, stronger: PHP on ARM microservers
Jez Halford
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
brouer
 
Cache profiling on ARM Linux
Prabindh Sundareson
 
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Eric Van Hensbergen
 
SDN - OpenFlow + OpenVSwitch + Quantum
The Linux Foundation
 
Q2.12: Research Update on big.LITTLE MP Scheduling
Linaro
 
Effect of Virtualization on OS Interference
Eric Van Hensbergen
 
DOXLON November 2016: Facebook Engineering on cgroupv2
Outlyer
 
reference_guide_Kernel_Crash_Dump_Analysis
Buland Singh
 
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Hemanth Venkatesh
 
Memory Barriers in the Linux Kernel
Davidlohr Bueso
 
Linux cgroups and namespaces
Locaweb
 
How Ceph performs on ARM Microserver Cluster
Aaron Joue
 
SFO15-407: Performance Overhead of ARM Virtualization
Linaro
 
Smarter Scheduling (Priorities, Preemptive Priority Scheduling, Lottery and S...
David Evans
 
Debugging linux kernel tools and techniques
Satpal Parmar
 
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
Linaro
 
Ad

Similar to Achieving Performance Isolation with Lightweight Co-Kernels (20)

PDF
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
ODP
Libra Library OS
Eric Van Hensbergen
 
PPT
Parallel_and_Cluster_Computing.ppt
MohmdUmer
 
PPTX
Cluster computer
Ashraful Hoda
 
PPT
Design and implementation of a reliable and cost-effective cloud computing in...
Francesco Taurino
 
PDF
An Introduce of OPNFV (Open Platform for NFV)
Mario Cho
 
PPTX
Container & kubernetes
Ted Jung
 
PPTX
OCP Engineering Workshop at UNH
호용 류
 
PPTX
ClickOS_EE80777777777777777777777777777.pptx
BiHongPhc
 
PPS
Xen Euro Par07
congvc
 
PPT
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
Vanika Kapoor
 
PPT
Clusters (Distributed computing)
Sri Prasanna
 
PDF
Linux container & docker
ejlp12
 
PDF
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
The Linux Foundation
 
PPTX
2014/09/02 Cisco UCS HPC @ ANL
dgoodell
 
PDF
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
OpenEBS
 
ODP
Systems Support for Many Task Computing
Eric Van Hensbergen
 
PDF
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
The Linux Foundation
 
PPT
NWU and HPC
Wilhelm van Belkum
 
PPTX
Walk Through a Software Defined Everything PoC
Ceph Community
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
PT Datacomm Diangraha
 
Libra Library OS
Eric Van Hensbergen
 
Parallel_and_Cluster_Computing.ppt
MohmdUmer
 
Cluster computer
Ashraful Hoda
 
Design and implementation of a reliable and cost-effective cloud computing in...
Francesco Taurino
 
An Introduce of OPNFV (Open Platform for NFV)
Mario Cho
 
Container & kubernetes
Ted Jung
 
OCP Engineering Workshop at UNH
호용 류
 
ClickOS_EE80777777777777777777777777777.pptx
BiHongPhc
 
Xen Euro Par07
congvc
 
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
Vanika Kapoor
 
Clusters (Distributed computing)
Sri Prasanna
 
Linux container & docker
ejlp12
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
The Linux Foundation
 
2014/09/02 Cisco UCS HPC @ ANL
dgoodell
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
OpenEBS
 
Systems Support for Many Task Computing
Eric Van Hensbergen
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
The Linux Foundation
 
NWU and HPC
Wilhelm van Belkum
 
Walk Through a Software Defined Everything PoC
Ceph Community
 
Ad

Recently uploaded (20)

PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
Employee salary prediction using Machine learning Project template.ppt
bhanuk27082004
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
What companies do with Pharo (ESUG 2025)
ESUG
 
PPT
Brief History of Python by Learning Python in three hours
adanechb21
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Employee salary prediction using Machine learning Project template.ppt
bhanuk27082004
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Activate_Methodology_Summary presentatio
annapureddyn
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Virtual Threads in Java: A New Dimension of Scalability and Performance
Tier1 app
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
What companies do with Pharo (ESUG 2025)
ESUG
 
Brief History of Python by Learning Python in three hours
adanechb21
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 

Achieving Performance Isolation with Lightweight Co-Kernels

  • 1. Jiannan Ouyang, Brian Kocoloski, John Lange The Prognostic Lab @ University of Pittsburgh Kevin Pedretti Sandia National Laboratories HPDC 2015 Achieving Performance Isolation with Lightweight Co-Kernels
  • 2. HPC Architecture 2 —  Move computation to data —  Improved data locality —  Reduced power consumption —  Reduced network traffic Compute Node Operating System and Runtimes (OS/R) Simulation Analytic / Visualization Supercomputer Shared Storage Cluster Processing Cluster Problem: massive data movement over interconnects Traditional In Situ Data Processing
  • 3. Challenge: Predictable High Performance 3 —  Tightly coupled HPC workloads are sensitive to OS noise and overhead [Petrini SC’03, Ferreira SC’08, Hoefler SC’10] —  Specialized kernels for predictable performance —  Tailored from Linux: CNL for Cray supercomputers —  Lightweight kernels (LWK) developed from scratch: IBM CNK, Kitten —  Data processing workloads favor Linux environments —  Cross workload interference —  Shared hardware (CPU time, cache, memory bandwidth) —  Shared system software How to provide both Linux and specialized kernels on the same node, while ensuring performance isolation??
  • 4. Approach: Lightweight Co-Kernels 4 —  Hardware resources on one node are dynamically composed into multiple partitions or enclaves —  Independent software stacks are deployed on each enclave —  Optimized for certain applications and hardware —  Performance isolation at both the software and hardware level Hardware Linux LWK Analytic / Visualization Hardware Linux Analytic / Visualization Simulation Simulation
  • 5. Agenda 5 —  Introduction —  The Pisces Lightweight Co-Kernel Architecture —  Implementation —  Evaluation —  RelatedWork —  Conclusion
  • 6. Building Blocks: Kitten and Palacios —  the Kitten Lightweight Kernel (LWK) —  Goal: provide predictable performance for massively parallel HPC applications —  Simple resource management policies —  Limited kernel I/O support + direct user-level network access —  the Palacios LightweightVirtual Machine Monitor (VMM) —  Goal: predictable performance —  Lightweight resource management policies —  Established history of providing virtualized environments for HPC [Lange et al. VEE ’11, Kocoloski and Lange ROSS‘12] Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
  • 7. The Pisces Lightweight Co-Kernel Architecture 7 Linux Hardware Isolated Virtual Machine Applications + Virtual MachinesPalacios VMM Kitten Co-kernel (1) Kitten Co-kernel (2) Isolated Application Pisces Pisces http://www.prognosticlab.org/pisces/ Pisces Design Goals —  Performance isolation at both software and hardware level —  Dynamic creation of resizable enclaves —  Isolated virtual environments
  • 8. Design Decisions 8 —  Elimination of cross OS dependencies —  Each enclave must implement its own complete set of supported system calls —  No system call forwarding is allowed —  Internalized management of I/O —  Each enclave must provide its own I/O device drivers and manage its hardware resources directly —  Userspace cross enclave communication —  Cross enclave communication is not a kernel provided feature —  Explicitly setup cross enclave shared memory at runtime (XEMEM) —  Using virtualization to provide missing OS features
  • 9. Cross Kernel Communication 9 Hardware'Par))on' Hardware'Par))on' User% Context% Kernel% Context% Linux' Cross%Kernel* Messages* Control' Process' Control' Process' Shared*Mem* *Ctrl*Channel* Linux' Compa)ble' Workloads' Isolated' Processes'' +' Virtual' Machines' Shared*Mem* Communica6on*Channels* Ki@en' CoAKernel' XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems [Kocoloski and Lange, HPDC‘15]
  • 10. Agenda 10 —  Introduction —  The Pisces Lightweight Co-Kernel Architecture —  Implementation —  Evaluation —  RelatedWork —  Conclusion
  • 11. Challenges & Approaches 11 —  How to boot a co-kernel? —  Hot-remove resources from Linux, and load co-kernel —  Reuse Linux boot code with modified target kernel address —  Restrict the Kitten co-kernel to access assigned resources only —  How to share hardware resources among kernels? —  Hot-remove from Linux + direct assignment and adjustment (e.g. CPU cores, memory blocks, PCI devices) —  Managed by Linux and Pisces (e.g. IOMMU) —  How to communicate with a co-kernel? —  Kernel level: IPI + shared memory, primarily for Pisces commands —  Application level: XEMEM [Kocoloski HPDC’15] —  How to route device interrupts?
  • 12. I/O Interrupt Routing 12 Legacy Device IO-APIC Management Kernel Co-Kernel IRQ Forwarder IRQ Handler MSI/MSI-X Device Management Kernel Co-Kernel IRQ Forwarder IRQ Handler MSI/MSI-X Device MSI MSI INTx IPI Legacy Interrupt Forwarding Direct Device Assignment (w/ MSI) •  Legacy interrupt vectors are potentially shared among multiple devices •  Pisces provides IRQ forwarding service •  IRQ forwarding is only used during initialization for PCI devices •  Modern PCI devices support dedicated interrupt vectors (MSI/MSI-X) •  Directly route to the corresponding enclave
  • 13. Implementation 13 —  Pisces —  Linux kernel module supports unmodified Linux kernels (2.6.3x – 3.x.y) —  Co-kernel initialization and management —  Kitten (~9000 LOC changes) —  Manage assigned hardware resources —  Dynamic resource assignment —  Kernel level communication channel —  Palacios (~5000 LOC changes) —  Dynamic resource assignment —  Command forwarding channel Pisces: http://www.prognosticlab.org/pisces/ Kitten: https://software.sandia.gov/trac/kitten Palacios: http://www.prognosticlab.org/palacios http://www.v3vee.org/
  • 14. Agenda 14 —  Introduction —  The Pisces Lightweight Co-Kernel Architecture —  Implementation —  Evaluation —  RelatedWork —  Conclusion
  • 15. Evaluation 15 —  8 node Dell R450 cluster —  Two six-core Intel “Ivy-Bridge” Xeon processors —  24GB RAM split across two NUMA domains —  QDR Infiniband —  CentOS 7, Linux kernel 3.16 —  For performance isolation experiments, the hardware is partitioned by NUMA domains. —  i.e. Linux on one NUMA domain, co-kernel on the other
  • 16. Fast Pisces Management Operations 16 Operations Latency (ms) Booting a co-kernel 265.98 Adding a single CPU core 33.74 Adding a 128MB memory block 82.66 Adding an Ethernet NIC 118.98
  • 17. Eliminating Cross Kernel Dependencies 17 solitary workloads (us) w/ other workloads (us) Linux 3.05 3.48 co-kernel fwd 6.12 14.00 co-kernel 0.39 0.36 ExecutionTime of getpid() —  Co-kernel has the best average performance —  Co-kernel has the most consistent performance —  System call forwarding has longer latency and suffers from cross stack performance interference
  • 18. Noise Analysis 18 0 5 10 15 20 0 1 2 3 4 5 Latency(us) Time (seconds) (a) without competing workloads 0 1 2 3 4 5 Time (seconds) (b) with competing workloads 0 5 10 15 20 0 1 2 3 4 5 Latency(us) Time (seconds) (a) without competing workloads 0 1 2 3 4 5 Time (seconds) (b) with competing workloads Linux Kitten co-kernel Co-Kernel: less noise + better isolation * Each point represents the latency of an OS interruption
  • 19. Single Node Performance 19 0 1 CentOS Kitten/KVM co-Kernel 82 83 84 85 CompletionTime(Seconds) without bg with bg 0 250 CentOS Kitten/KVM co-Kernel 20250 20500 20750 21000 21250 Throughput(GUPS) without bg with bg CoMD Performance Stream Performance Co-Kernel: consist performance + performance isolation
  • 20. 8 Node Performance 20 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 Throughput(GFLOP/s) Number of Nodes co-VMM native KVM co-VMM bg native bg KVM bg w/o bg: co-VMM achieves native Linux performance w/ bg: co-VMM outperforms native Linux
  • 21. Co-VMM for HPC in the Cloud 21 0 20 40 60 80 100 44 45 46 47 48 49 50 51 CDF(%) Runtime (seconds) Co-VMM Native KVM CDF of HPCCG Performance (running with Hadoop, 8 nodes) co-VMM: consistent performance + performance isolation
  • 22. Related Work 22 — Exascale operating systems and runtimes (OS/Rs) —  Hobbes (SNL, LBNL, LANL, ORNL, U. Pitt, various universities) —  Argo (ANL, LLNL, PNNL, various universities) —  FusedOS (Intel / IBM) —  mOS (Intel) —  McKernel (RIKENAICS, University ofTokyo) Our uniqueness: performance isolation, dynamic resource composition, lightweight virtualization
  • 23. Conclusion 23 —  Design and implementation of the Pisces co-kernel architecture —  Pisces framework —  Kitten co-kernel —  PalaciosVMM for Kitten co-kernel —  Demonstrated that the co-kernel architecture provides —  Optimized execution environments for in situ processing —  Performance isolation https://software.sandia.gov/trac/kitten http://www.prognosticlab.org/pisces/ http://www.prognosticlab.org/palacios
  • 24. Thank You Jiannan Ouyang —  Ph.D. Candidate @ University of Pittsburgh —  [email protected] —  http://people.cs.pitt.edu/~ouyang/ —  The Prognostic Lab @ U. Pittsburgh —  http://www.prognosticlab.org