SlideShare a Scribd company logo
Cooperative VM Migration
 for a virtualized HPC Cluster
 with VMM-bypass I/O devices

                 Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi,
                                  Yoshio Tanaka, and Tomohiro Kudoh

                                     Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST), Japan

               IEEE eScience 2012, Oct. 11 2012, Chicago
Background
•  HPC cloud is a promising e-Science platform.
   –  HPC users begin to take an interest in Cloud computing,
      e.g., Amazon EC2 Cluster Compute Instances.
•  Virtualization is a key technology.
   –  Pro: It makes migration of computing elements easy.
      •  VM migration is useful for achieving fault tolerance, server
         consolidation, etc.
   –  Con: It introduces a large overhead, spoiling I/O
      performance.
      •  VMM-bypass I/O technologies, e.g., PCI passthrough and
         SR-IOV, can significantly mitigate the overhead.

  VMM-bypass I/O makes it impossible to migrate a VM.
                                                                        2
Contribution
•  Goal:
   –  To realize VM migration and checkpoint/restart on a
      virtualized cluster with VMM-bypass I/O devices.
      •  E.g., VM migration on an Infiniband cluster
•  Contributions:
   –  We propose cooperative VM migration based on the
      Symbiotic Virtualization (SymVirt) mechanism.
   –  We demonstrate reactive/proactive fault tolerant (FT)
      systems.
   –  We show postcopy migration helps to reduce the
      service downtime in the proactive FT system.

                                                              3
Agenda
•    Background and Motivation
•    SymVirt: Symbiotic Virtualization Mechanism
•    Experiment
•    Related Work
•    Conclusion and Future Work




                                                   4
Motivating Observation
 •  Performance evaluation of HPC cloud
                                   –  (Para-)virtualized I/O incurs a large overhead.
                                   –  PCI passthrough significantly mitigate the overhead.
                                                                               KVM (IB)    KVM (virtio)
                             300
                                               BMM (IB)        BMM (10GbE)                 VM1
                                                                              VM1
                                               KVM (IB)        KVM (virtio)
                             250
                                                                               Guest OS     Guest OS
  Execution time [seconds]




                             200                                               Physical       Guest
                                                                                driver        driver
                             150

                             100                                              VMM          VMM

                              50                                                             Physical
                                                                                              driver
                               0
                                     BT   CG     EP       FT          LU
The overhead of I/O virtualization on the NAS                                 IB QDR HCA   10GbE NIC
Parallel Benchmarks 3.3.1 class C, 64 processes.
                                                                                            BMM: Bare Metal Machine
                                                                                                                      5
Para-virtualized                            VMM-Bypass I/O
 device (virtio_net)                PCI passthrough          SR-IOV
VM1              VM2               VM1           VM2           VM1                VM2
 Guest OS                           Guest OS                    Guest OS
                       …                               …                                …
  Guest                             Physical                        Physical
  driver                             driver                          driver


VMM                                VMM                         VMM
            vSwitch

            Physical
             driver

NIC                                NIC                         NIC

                                                                        Switch (VEB)

                       Para-virt             PCI           SR-IOV
                        device           passthrough                           We address
  Performance                                                                  this issue!
  Device sharing
  VM migration
                                                                                            6
Problem
VMM-bypass I/O technologies make VM migration
and checkpoint/restart impossible.
  1.  VMM does not know the time when VMM-bypass I/O
      devices are detached safely.
     •  To perform such migration without losing in-flight data,
        packet transmission to/from the VM should be stopped prior
        to detaching.
     •  With a VMM, it is hard to know the communication status of
        an application inside the VM, especially if VMM-bypass I/O
        devices are used.
  2.  VMM cannot migrate the state of VMM-bypass I/O
      devices from the source to the destination.
     •  With Infiniband, Local ID, QueuePair Numbers, etc.
                                                                     7
Goal
                                migration

                                Ethernet
Cluster 1                                                Cluster 2



 VM1        VM2     VM3   VM4


               detach                        re-attach
            Infiniband                      Infiniband




   We need a mechanism of combining VM migration and
   PCI device hot-plugging.

                                                                     8
SymVirt: Symbiotic Virtualization
Cooperative VM Migration
Existing VM Migration          Cooperative VM Migration
 (Black-box approach)            (Gray-box approach)
     Pro: portability              Pro: performance

  Guest OS                         Guest OS
                                                     Global
                                                  coordination
        Application                      Application
                                              Device setup

                                 Cooperation
               Global
  VMM        coordination          VMM                   VMM-bypass
                Device setup
                                                            I/O
 Migration                        Migration


  NIC                              NIC


                                                                  10
SymVirt: Symbiotic Virtualization
•  We focus on MPI programs.
•  We design and implement
   a symbiotic virtualization         Guest OS
                                                        Global
   (SymVirt) mechanism.                              coordination
                                            Application
  –  It is a cross-layer mechanism
                                                 Device setup
     between a VMM and an MPI
     runtime system.                             SymVirt
                                     Cooperation

                                      VMM                   VMM-bypass
                                                               I/O
                                     Migration


                                      NIC


                                                                    11
SymVirt: Overview
                                                               node #0                                         node #n
                                                                                                 ...
                                       Guest OS                                                              Guest OS


                                                  MPI app.                 MPI app.                             MPI app.
Cloud scheduler




                                                  MPI lib.                 MPI lib.                             MPI lib.
                  1) trigger events




                                                SymVirt                    SymVirt         2) coordination      SymVirt
                                              coordinator                coordinator                          coordinator


                                      3) SymVirt wait
                                                             5) SymVirt signal
                                       VMM                                                                   VMM
                                                                       4) do something, e.g.,
                                          SymVirt        SymVirt      migration/checkpointing                  SymVirt
                                         controller       agent                                                 agent




                                                                                                                            12
SymVirt wait and signal calls
•  SymVirt provides a simple Guest OS-to-VMM
   communication mechanism.
•  SymVirt coordinator issues a SymVirt wait call, the guest
   OS is blocked until a SymVirt signal call is issued.
•  In the meantime, SymVirt agent controls the VM via a
   VMM monitor interface.
    Application
                           confirm                           confirm linkup
                                      SymVirt coordinator

     SymVirt
      wait   SymVirt                                                Guest OS mode
              signal                                                     VMM mode
                  detach             migration         re-attach

                  SymVirt controller/agent
                                                                                    13
SymVirt: Implementation
•  We implemented SymVirt on top of QEMU/KVM
   and the Open MPI system.
•  User application and the MPI runtime system
   can work without any modifications.
•  QEMU/KVM is slightly modified for supporting
   SymVirt wait and signal calls.
  –  A SymVirt wait call is implemented by            SymVirt
                                                    coordinator
     using a VMCALL Intel VT-x instruction.
                                               SymVirt        SymVirt
  –  A SymVirt signal call is implemented as    wait           signal
     a new QEMU/KVM monitor command.
                                                         SymVirt
                                                          agent

                                                                    14
SymVirt: Implementation (cont’d)
•  SymVirt coordinator is heavily relied on the Open
   MPI checkpoint/restart (C/R) framework.
  –  Global coordination of SymVirt is the same as a
     coordination protocol for MPI programs.
  –  SymVirt executes VM-level migration or C/R instead
     of process-level C/R using the BLCR system.
  –  SymVirt does not need to take care of changing LIDs
     and QPNs after a migration, because Open MPI’s
     BTL modules are re-constructed and connections are
     re-established at continue or restart phases.


                            BTL: Point-to-Point Byte Transfer Layer
                                                                      15
SymVirt: Implementation (cont’d)
•  SymVirt controller and agent are written in Python.
import symvirt!                            !
agent_list = [migrate_from]!               # device attach!
ctl = symvirt.Controller(agent_list)!      ctl.append_agent(migrate_to)!
!                                          ctl.wait_all()!
# device detach!                           kwargs = {'pci_id':'04:00.0',
ctl.wait_all()!                            'tag':'vf0'}!
kwargs = {'tag':'vf0'}!                    ctl.device_attach(**kwargs)!
ctl.device_detach(**kwargs)!               ctl.signal()!
ctl.signal()!                              !
!                                          ctl.close()
# vm migration!
ctl.wait_all()!
kwargs = {'postcopy':True, 'uri':'tcp:%s:%d' !
          % (migrate_to[0], migrate_port)}!
ctl.migrate(**kwargs)!
ctl.remove_agent(migrate_from)!

                                                                           16
SymPFT: Proactive FT system
•  A VM-level fault tolerant (FT) system is a use
   case of SymVirt.


      Cloud&scheduler
           allocation

                           User requires a virtualized cluster consists
                           of 4 nodes (16 CPUs).




         global&storage&
          (VM&images)
                                                                          17
SymPFT: Proactive FT system
•  A VM-level fault tolerant (FT) system is a use
   case of SymVirt.
•  A VM is migrated from a “unhealthy” node to a
   “healthy” node before the node crashes.
      Cloud&scheduler                                Cloud&scheduler
           allocation                                     re-allocation
                                         Failure
                                 re!!
                           Failu        prediction



                                                     VM migration

         global&storage&                                global&storage&
          (VM&images)                                    (VM&images)
                                                                          18
Experiment
Experiment
•  The overhead of SymPFT
  –  We used 8 VMs on an Infiniband cluster.
  –  We migrated a VM once during a benchmark
     execution.
•  Two benchmark programs written in MPI
  –  memtest: a simple memory intensive benchmark
  –  NAS Parallel Benchmarks (NPB) version 3.3.1


•  Overhead reduction using postcopy migration


                                                    20
Experimental setting
      We used a 16 node Infiniband cluster, which is
      a part of the AIST Green Cloud.
      Blade server Dell PowerEdge M610              Host machine environment
CPU        Intel quad-core Xeon E5540/2.53GHz x2   OS             Debian 7.0

Chipset    Intel 5520                              Linux kernel   3.2.18

Memory     48 GB DDR3                              QEMU/KVM       1.1-rc3

InfiniBand Mellanox ConnectX (MT26428)             MPI            Open MPI 1.6
                                                   OFED           1.5.4.1
                 Blade switch
                                                   Compiler       gcc/gfortran 4.4.6
InfiniBand Mellanox M3601Q (QDR 16 ports)                   VM environment
                                                   VCPU       8
 Only 1 VM runs on 1 host, and an IB HCA is !
 assigned to the VM by using PCI passthrough.      Memory     20 GB
                                                                                   21
Result: memtest
•  The migration time is dependent on the memory footprint.
   –  The migration throughput is less than 3 Gbps.
•  Both hotplug and link-up times are approximately constant.
   –  The link-up time is not a negligible overhead. c.f., Ethernet
                     100
                           migration   hotplug   linkup
                      80
    Execution Time
      [Seconds]




                                                          44.2     53.7
                      60   35.9           38.7

                      40
                           14.6           13.5            12.5     11.3
                      20
                           28.5           28.5            28.5     28.6
                       0
                           2GB            4GB             8GB     16GB
                                         memory footprint
                                                 This result does not include our proceeding.
                                                                                           22
Result: NAS Parallel Benchmarks
                           1400
                                                                                 linkup                        postcopy migration
                           1200                                                  hotplug                       precopy migration
                                  +105 s
Execution time [seconds]




                                                                                 application
                           1000
                                                   There is no overhead                                         +103 s

                            800                    during normal operations
                                                             +97 s
                                                                                       +299 s
                            600

                            400
                                                                                                       The overhead is proportional
                            200                                                                        to the memory footprint.
                              0
                                  baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy
                                             BT                       CG                         FT                       LU

                                                                             Transferred Memory Size during VM Migration [MB]
                                                                                  BT             CG              FT             LU
                                                                                4417            3394           15678           2348

                                                                                                                                            23
Integration with postcopy migration
 •  In contrast to precopy migration, postcopy migration
    transfers memory pages on demand after the execution
    node is switched to the destination.
 •  Postcopy migration can hide the overhead of the hot-add
    and link-up times by overlapping them and migration.
 •  We used our postcopy migration implementation for
    QEMU/KVM, Yabusame.

   SymPFT      a) hot-del   b) migration   c) hot-add d) link-up
  (precopy)


   SymPFT                                      overhead mitigation
  (postcopy)


                                                                     24
Result: Effect of postcopy migration
                           1400
                                                                                 linkup                         postcopy migration
                                                     -15 %
                           1200                                                  hotplug                        precopy migration
Execution time [seconds]




                                                                                 application                                      -13 s
                           1000

                            800                                                -14 %
                                                                                                        -53 %
                            600

                            400
                                                                                        Postcopy migration can hide the
                            200                                                         overhead of hotplug and link-up by
                              0                                                         overlapping them and migration.
                                  baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy
                                             BT                       CG                         FT                       LU

                                                                             Transferred Memory Size during VM Migration [MB]
                                                                                  BT             CG              FT              LU
                                                                                 4417           3394            15678           2348

                                                                                                                                            25
Related Work
•  Some VM-level reactive and proactive FT systems have
   been proposed for HPC systems.
   –  E.g., VNsnap: a distributed snapshots of VMs
      •  The coordination is executed by snooping the traffic of a software
         switch outside the VMs.
   –  They do not support VMM-bypass I/O devices.
•  Mercury: a self-virtualization technique
   –  An OS can turn virtualization on and off on demand.
   –  It lacks a coordination mechanism among distributed VMM.
•  SymCall: an upcall mechanism from a VMM to the guest
   OS, using a nested VM Exit call
   –  SymVirt is a simple hypercall mechanism from a guest OS to the
      VMM, assuming it works in cooperation with a cloud scheduler.
                                                                              26
Conclusion and Future Work
Conclustion
•  We have proposed a cooperative VM migration
   mechanism that enables us to migrate VMs with VMM-
   bypass I/O devices, using a simple Guest OS-to-VMM
   communication mechanism, called SymVirt.
•  Using the proposed mechanism, we demonstrated a
   proactive FT system in a virtualized Infiniband cluster.
•  We also confirmed that postcopy migration helps to
   reduce the downtime in the proactive FT system.

•  SymVirt can be useful for not only fault tolerant but also
   load balancing and server consolidation.

                                                                28
Future Work
•  Interconnect-transparent migration, called
   “Ninja migration”
  –  We have submitted another conference paper.
•  Overhead mitigation of SymVirt
  –  Very long link-up time problem
  –  Better integration with postcopy migration
•  A generic communication layer supporting
   cooperative VM migration
  –  It is independent on an MPI runtime system.


                                                   29
Thanks for your attention!




 This work was partly supported by JSPS KAKENHI 24700040
 and ARGO GRAPHICS, Inc.

                                                           30

More Related Content

What's hot (20)

PDF
IBM PowerVM Best Practices
IBM India Smarter Computing
 
PDF
VPM in action
COMMON Europe
 
PDF
IBM PowerVM Virtualization Introduction and Configuration
IBM India Smarter Computing
 
PDF
Mobile Virtualization using the Xen Technologies
The Linux Foundation
 
PPTX
Link Virtualization based on Xen
The Linux Foundation
 
PDF
Graphics virtualization
The Linux Foundation
 
PDF
XS Boston 2008 Fault Tolerance
The Linux Foundation
 
PDF
Williams xen summit 2010
The Linux Foundation
 
PPTX
Keynote Speech: Xen ARM Virtualization
The Linux Foundation
 
PDF
Minimizing I/O Latency in Xen-ARM
The Linux Foundation
 
PPTX
Hyper V R2 Deep Dive
Aidan Finn
 
PPTX
Presentation power vm virtualization without limits
solarisyougood
 
PDF
z/VM 6.2: Increasing the Endless Possibilities of Virtualization
IBM India Smarter Computing
 
PDF
Introduction to the Linux on System z Terminal Server using z/VM IUCV
IBM India Smarter Computing
 
PDF
Track A-Shmuel Panijel, Windriver
chiportal
 
PDF
Windows server 8 hyper v networking (aidan finn)
hypervnu
 
PDF
Linux PV on HVM
The Linux Foundation
 
PPTX
Windows Server 8 Hyper V Networking
Aidan Finn
 
PDF
Ian Pratt Usenix 08 Keynote
The Linux Foundation
 
PPTX
i//:squared Business Continuity Event
Jonathan Allmayer
 
IBM PowerVM Best Practices
IBM India Smarter Computing
 
VPM in action
COMMON Europe
 
IBM PowerVM Virtualization Introduction and Configuration
IBM India Smarter Computing
 
Mobile Virtualization using the Xen Technologies
The Linux Foundation
 
Link Virtualization based on Xen
The Linux Foundation
 
Graphics virtualization
The Linux Foundation
 
XS Boston 2008 Fault Tolerance
The Linux Foundation
 
Williams xen summit 2010
The Linux Foundation
 
Keynote Speech: Xen ARM Virtualization
The Linux Foundation
 
Minimizing I/O Latency in Xen-ARM
The Linux Foundation
 
Hyper V R2 Deep Dive
Aidan Finn
 
Presentation power vm virtualization without limits
solarisyougood
 
z/VM 6.2: Increasing the Endless Possibilities of Virtualization
IBM India Smarter Computing
 
Introduction to the Linux on System z Terminal Server using z/VM IUCV
IBM India Smarter Computing
 
Track A-Shmuel Panijel, Windriver
chiportal
 
Windows server 8 hyper v networking (aidan finn)
hypervnu
 
Linux PV on HVM
The Linux Foundation
 
Windows Server 8 Hyper V Networking
Aidan Finn
 
Ian Pratt Usenix 08 Keynote
The Linux Foundation
 
i//:squared Business Continuity Event
Jonathan Allmayer
 

Viewers also liked (6)

PDF
クラウドを利用した電力可視化システムの構築
Ryousei Takano
 
PDF
インタークラウドにおける仮想インフラ構築システムの提案
Ryousei Takano
 
PDF
SoNIC: Precise Realtime Software Access and Control of Wired Networks
Ryousei Takano
 
PDF
マッコスXで始めるPlan 9
Ryousei Takano
 
PDF
Expectations for optical network from the viewpoint of system software research
Ryousei Takano
 
PDF
InfiniBandをPCIパススルーで用いるHPC仮想化クラスタの性能評価
Ryousei Takano
 
クラウドを利用した電力可視化システムの構築
Ryousei Takano
 
インタークラウドにおける仮想インフラ構築システムの提案
Ryousei Takano
 
SoNIC: Precise Realtime Software Access and Control of Wired Networks
Ryousei Takano
 
マッコスXで始めるPlan 9
Ryousei Takano
 
Expectations for optical network from the viewpoint of system software research
Ryousei Takano
 
InfiniBandをPCIパススルーで用いるHPC仮想化クラスタの性能評価
Ryousei Takano
 
Ad

Similar to Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices (20)

PDF
VMware Performance for Gurus - A Tutorial
Richard McDougall
 
PDF
Virtualization Primer for Java Developers
Richard McDougall
 
PDF
XS Boston 2008 SR-IOV
The Linux Foundation
 
PDF
Virtualization Technology Overview
OpenCity Community
 
PPTX
Hyper V And Scvmm Best Practis
Blauge
 
PDF
Apache Hadoop on Virtual Machines
DataWorks Summit
 
PDF
XS Japan 2008 BitVisor English
The Linux Foundation
 
PPT
Virtual Server 2004 Overview
webhostingguy
 
PPT
Virtual Server 2004 Overview
webhostingguy
 
PPT
Virtual Server 2005 Overview Rich McBrine, CISSP
webhostingguy
 
PDF
Windows Server 2012 Hyper-V
Microsoftid
 
PPTX
Citrix XenDesktop on vSphere - Virsto Launch May 9, 2012
Virsto Software
 
PDF
Juniper and VMware: Taking Data Centre Networks to the Next Level
Juniper Networks
 
PPTX
Building a Virtualized Development and Testing Environments
Lai Yoong Seng
 
PPTX
What is Coming in Hyper-V Vnext on Windows 8
Lai Yoong Seng
 
ODP
Virtual Network Performance Challenge
Stephen Hemminger
 
PPTX
System Center Virtual Machine Manager 2008 R2
aralves
 
PPTX
Windows server 2012 Seminar 3: Hyper-V replica
CompuTrain. De IT opleider.
 
PPTX
VMUG ISRAEL November 2012, EMC session by Itzik Reich
Itzik Reich
 
PPTX
Hyper V - Minasi Forum 2009
Aidan Finn
 
VMware Performance for Gurus - A Tutorial
Richard McDougall
 
Virtualization Primer for Java Developers
Richard McDougall
 
XS Boston 2008 SR-IOV
The Linux Foundation
 
Virtualization Technology Overview
OpenCity Community
 
Hyper V And Scvmm Best Practis
Blauge
 
Apache Hadoop on Virtual Machines
DataWorks Summit
 
XS Japan 2008 BitVisor English
The Linux Foundation
 
Virtual Server 2004 Overview
webhostingguy
 
Virtual Server 2004 Overview
webhostingguy
 
Virtual Server 2005 Overview Rich McBrine, CISSP
webhostingguy
 
Windows Server 2012 Hyper-V
Microsoftid
 
Citrix XenDesktop on vSphere - Virsto Launch May 9, 2012
Virsto Software
 
Juniper and VMware: Taking Data Centre Networks to the Next Level
Juniper Networks
 
Building a Virtualized Development and Testing Environments
Lai Yoong Seng
 
What is Coming in Hyper-V Vnext on Windows 8
Lai Yoong Seng
 
Virtual Network Performance Challenge
Stephen Hemminger
 
System Center Virtual Machine Manager 2008 R2
aralves
 
Windows server 2012 Seminar 3: Hyper-V replica
CompuTrain. De IT opleider.
 
VMUG ISRAEL November 2012, EMC session by Itzik Reich
Itzik Reich
 
Hyper V - Minasi Forum 2009
Aidan Finn
 
Ad

More from Ryousei Takano (20)

PDF
Error Permissive Computing
Ryousei Takano
 
PDF
Opportunities of ML-based data analytics in ABCI
Ryousei Takano
 
PDF
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
Ryousei Takano
 
PDF
ABCI Data Center
Ryousei Takano
 
PDF
クラウド環境におけるキャッシュメモリQoS制御の評価
Ryousei Takano
 
PDF
USENIX NSDI 2016 (Session: Resource Sharing)
Ryousei Takano
 
PDF
User-space Network Processing
Ryousei Takano
 
PDF
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Ryousei Takano
 
PDF
A Look Inside Google’s Data Center Networks
Ryousei Takano
 
PDF
クラウド時代の半導体メモリー技術
Ryousei Takano
 
PDF
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
PDF
IEEE CloudCom 2014参加報告
Ryousei Takano
 
PDF
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
PDF
不揮発メモリとOS研究にまつわる何か
Ryousei Takano
 
PDF
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
Ryousei Takano
 
PDF
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
Ryousei Takano
 
PDF
From Rack scale computers to Warehouse scale computers
Ryousei Takano
 
PDF
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
Ryousei Takano
 
PDF
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Ryousei Takano
 
PDF
IEEE/ACM SC2013報告
Ryousei Takano
 
Error Permissive Computing
Ryousei Takano
 
Opportunities of ML-based data analytics in ABCI
Ryousei Takano
 
ABCI: An Open Innovation Platform for Advancing AI Research and Deployment
Ryousei Takano
 
ABCI Data Center
Ryousei Takano
 
クラウド環境におけるキャッシュメモリQoS制御の評価
Ryousei Takano
 
USENIX NSDI 2016 (Session: Resource Sharing)
Ryousei Takano
 
User-space Network Processing
Ryousei Takano
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Ryousei Takano
 
A Look Inside Google’s Data Center Networks
Ryousei Takano
 
クラウド時代の半導体メモリー技術
Ryousei Takano
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
Ryousei Takano
 
IEEE CloudCom 2014参加報告
Ryousei Takano
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Ryousei Takano
 
不揮発メモリとOS研究にまつわる何か
Ryousei Takano
 
High-resolution Timer-based Packet Pacing Mechanism on the Linux Operating Sy...
Ryousei Takano
 
クラウドの垣根を超えた高性能計算に向けて~AIST Super Green Cloudでの試み~
Ryousei Takano
 
From Rack scale computers to Warehouse scale computers
Ryousei Takano
 
高性能かつスケールアウト可能なHPCクラウド AIST Super Green Cloud
Ryousei Takano
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Ryousei Takano
 
IEEE/ACM SC2013報告
Ryousei Takano
 

Recently uploaded (20)

PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Python basic programing language for automation
DanialHabibi2
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
July Patch Tuesday
Ivanti
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

  • 1. Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan IEEE eScience 2012, Oct. 11 2012, Chicago
  • 2. Background •  HPC cloud is a promising e-Science platform. –  HPC users begin to take an interest in Cloud computing, e.g., Amazon EC2 Cluster Compute Instances. •  Virtualization is a key technology. –  Pro: It makes migration of computing elements easy. •  VM migration is useful for achieving fault tolerance, server consolidation, etc. –  Con: It introduces a large overhead, spoiling I/O performance. •  VMM-bypass I/O technologies, e.g., PCI passthrough and SR-IOV, can significantly mitigate the overhead. VMM-bypass I/O makes it impossible to migrate a VM. 2
  • 3. Contribution •  Goal: –  To realize VM migration and checkpoint/restart on a virtualized cluster with VMM-bypass I/O devices. •  E.g., VM migration on an Infiniband cluster •  Contributions: –  We propose cooperative VM migration based on the Symbiotic Virtualization (SymVirt) mechanism. –  We demonstrate reactive/proactive fault tolerant (FT) systems. –  We show postcopy migration helps to reduce the service downtime in the proactive FT system. 3
  • 4. Agenda •  Background and Motivation •  SymVirt: Symbiotic Virtualization Mechanism •  Experiment •  Related Work •  Conclusion and Future Work 4
  • 5. Motivating Observation •  Performance evaluation of HPC cloud –  (Para-)virtualized I/O incurs a large overhead. –  PCI passthrough significantly mitigate the overhead. KVM (IB) KVM (virtio) 300 BMM (IB) BMM (10GbE) VM1 VM1 KVM (IB) KVM (virtio) 250 Guest OS Guest OS Execution time [seconds] 200 Physical Guest driver driver 150 100 VMM VMM 50 Physical driver 0 BT CG EP FT LU The overhead of I/O virtualization on the NAS IB QDR HCA 10GbE NIC Parallel Benchmarks 3.3.1 class C, 64 processes. BMM: Bare Metal Machine 5
  • 6. Para-virtualized VMM-Bypass I/O device (virtio_net) PCI passthrough SR-IOV VM1 VM2 VM1 VM2 VM1 VM2 Guest OS Guest OS Guest OS … … … Guest Physical Physical driver driver driver VMM VMM VMM vSwitch Physical driver NIC NIC NIC Switch (VEB) Para-virt PCI SR-IOV device passthrough We address Performance this issue! Device sharing VM migration 6
  • 7. Problem VMM-bypass I/O technologies make VM migration and checkpoint/restart impossible. 1.  VMM does not know the time when VMM-bypass I/O devices are detached safely. •  To perform such migration without losing in-flight data, packet transmission to/from the VM should be stopped prior to detaching. •  With a VMM, it is hard to know the communication status of an application inside the VM, especially if VMM-bypass I/O devices are used. 2.  VMM cannot migrate the state of VMM-bypass I/O devices from the source to the destination. •  With Infiniband, Local ID, QueuePair Numbers, etc. 7
  • 8. Goal migration Ethernet Cluster 1 Cluster 2 VM1 VM2 VM3 VM4 detach re-attach Infiniband Infiniband We need a mechanism of combining VM migration and PCI device hot-plugging. 8
  • 10. Cooperative VM Migration Existing VM Migration Cooperative VM Migration (Black-box approach) (Gray-box approach) Pro: portability Pro: performance Guest OS Guest OS Global coordination Application Application Device setup Cooperation Global VMM coordination VMM VMM-bypass Device setup I/O Migration Migration NIC NIC 10
  • 11. SymVirt: Symbiotic Virtualization •  We focus on MPI programs. •  We design and implement a symbiotic virtualization Guest OS Global (SymVirt) mechanism. coordination Application –  It is a cross-layer mechanism Device setup between a VMM and an MPI runtime system. SymVirt Cooperation VMM VMM-bypass I/O Migration NIC 11
  • 12. SymVirt: Overview node #0 node #n ... Guest OS Guest OS MPI app. MPI app. MPI app. Cloud scheduler MPI lib. MPI lib. MPI lib. 1) trigger events SymVirt SymVirt 2) coordination SymVirt coordinator coordinator coordinator 3) SymVirt wait 5) SymVirt signal VMM VMM 4) do something, e.g., SymVirt SymVirt migration/checkpointing SymVirt controller agent agent 12
  • 13. SymVirt wait and signal calls •  SymVirt provides a simple Guest OS-to-VMM communication mechanism. •  SymVirt coordinator issues a SymVirt wait call, the guest OS is blocked until a SymVirt signal call is issued. •  In the meantime, SymVirt agent controls the VM via a VMM monitor interface. Application confirm confirm linkup SymVirt coordinator SymVirt wait SymVirt Guest OS mode signal VMM mode detach migration re-attach SymVirt controller/agent 13
  • 14. SymVirt: Implementation •  We implemented SymVirt on top of QEMU/KVM and the Open MPI system. •  User application and the MPI runtime system can work without any modifications. •  QEMU/KVM is slightly modified for supporting SymVirt wait and signal calls. –  A SymVirt wait call is implemented by SymVirt coordinator using a VMCALL Intel VT-x instruction. SymVirt SymVirt –  A SymVirt signal call is implemented as wait signal a new QEMU/KVM monitor command. SymVirt agent 14
  • 15. SymVirt: Implementation (cont’d) •  SymVirt coordinator is heavily relied on the Open MPI checkpoint/restart (C/R) framework. –  Global coordination of SymVirt is the same as a coordination protocol for MPI programs. –  SymVirt executes VM-level migration or C/R instead of process-level C/R using the BLCR system. –  SymVirt does not need to take care of changing LIDs and QPNs after a migration, because Open MPI’s BTL modules are re-constructed and connections are re-established at continue or restart phases. BTL: Point-to-Point Byte Transfer Layer 15
  • 16. SymVirt: Implementation (cont’d) •  SymVirt controller and agent are written in Python. import symvirt! ! agent_list = [migrate_from]! # device attach! ctl = symvirt.Controller(agent_list)! ctl.append_agent(migrate_to)! ! ctl.wait_all()! # device detach! kwargs = {'pci_id':'04:00.0', ctl.wait_all()! 'tag':'vf0'}! kwargs = {'tag':'vf0'}! ctl.device_attach(**kwargs)! ctl.device_detach(**kwargs)! ctl.signal()! ctl.signal()! ! ! ctl.close() # vm migration! ctl.wait_all()! kwargs = {'postcopy':True, 'uri':'tcp:%s:%d' ! % (migrate_to[0], migrate_port)}! ctl.migrate(**kwargs)! ctl.remove_agent(migrate_from)! 16
  • 17. SymPFT: Proactive FT system •  A VM-level fault tolerant (FT) system is a use case of SymVirt. Cloud&scheduler allocation User requires a virtualized cluster consists of 4 nodes (16 CPUs). global&storage& (VM&images) 17
  • 18. SymPFT: Proactive FT system •  A VM-level fault tolerant (FT) system is a use case of SymVirt. •  A VM is migrated from a “unhealthy” node to a “healthy” node before the node crashes. Cloud&scheduler Cloud&scheduler allocation re-allocation Failure re!! Failu prediction VM migration global&storage& global&storage& (VM&images) (VM&images) 18
  • 20. Experiment •  The overhead of SymPFT –  We used 8 VMs on an Infiniband cluster. –  We migrated a VM once during a benchmark execution. •  Two benchmark programs written in MPI –  memtest: a simple memory intensive benchmark –  NAS Parallel Benchmarks (NPB) version 3.3.1 •  Overhead reduction using postcopy migration 20
  • 21. Experimental setting We used a 16 node Infiniband cluster, which is a part of the AIST Green Cloud. Blade server Dell PowerEdge M610 Host machine environment CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 7.0 Chipset Intel 5520 Linux kernel 3.2.18 Memory 48 GB DDR3 QEMU/KVM 1.1-rc3 InfiniBand Mellanox ConnectX (MT26428) MPI Open MPI 1.6 OFED 1.5.4.1 Blade switch Compiler gcc/gfortran 4.4.6 InfiniBand Mellanox M3601Q (QDR 16 ports) VM environment VCPU 8 Only 1 VM runs on 1 host, and an IB HCA is ! assigned to the VM by using PCI passthrough. Memory 20 GB 21
  • 22. Result: memtest •  The migration time is dependent on the memory footprint. –  The migration throughput is less than 3 Gbps. •  Both hotplug and link-up times are approximately constant. –  The link-up time is not a negligible overhead. c.f., Ethernet 100 migration hotplug linkup 80 Execution Time [Seconds] 44.2 53.7 60 35.9 38.7 40 14.6 13.5 12.5 11.3 20 28.5 28.5 28.5 28.6 0 2GB 4GB 8GB 16GB memory footprint This result does not include our proceeding. 22
  • 23. Result: NAS Parallel Benchmarks 1400 linkup postcopy migration 1200 hotplug precopy migration +105 s Execution time [seconds] application 1000 There is no overhead +103 s 800 during normal operations +97 s +299 s 600 400 The overhead is proportional 200 to the memory footprint. 0 baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy BT CG FT LU Transferred Memory Size during VM Migration [MB] BT CG FT LU 4417 3394 15678 2348 23
  • 24. Integration with postcopy migration •  In contrast to precopy migration, postcopy migration transfers memory pages on demand after the execution node is switched to the destination. •  Postcopy migration can hide the overhead of the hot-add and link-up times by overlapping them and migration. •  We used our postcopy migration implementation for QEMU/KVM, Yabusame. SymPFT a) hot-del b) migration c) hot-add d) link-up (precopy) SymPFT overhead mitigation (postcopy) 24
  • 25. Result: Effect of postcopy migration 1400 linkup postcopy migration -15 % 1200 hotplug precopy migration Execution time [seconds] application -13 s 1000 800 -14 % -53 % 600 400 Postcopy migration can hide the 200 overhead of hotplug and link-up by 0 overlapping them and migration. baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy baseline precopy postcopy BT CG FT LU Transferred Memory Size during VM Migration [MB] BT CG FT LU 4417 3394 15678 2348 25
  • 26. Related Work •  Some VM-level reactive and proactive FT systems have been proposed for HPC systems. –  E.g., VNsnap: a distributed snapshots of VMs •  The coordination is executed by snooping the traffic of a software switch outside the VMs. –  They do not support VMM-bypass I/O devices. •  Mercury: a self-virtualization technique –  An OS can turn virtualization on and off on demand. –  It lacks a coordination mechanism among distributed VMM. •  SymCall: an upcall mechanism from a VMM to the guest OS, using a nested VM Exit call –  SymVirt is a simple hypercall mechanism from a guest OS to the VMM, assuming it works in cooperation with a cloud scheduler. 26
  • 28. Conclustion •  We have proposed a cooperative VM migration mechanism that enables us to migrate VMs with VMM- bypass I/O devices, using a simple Guest OS-to-VMM communication mechanism, called SymVirt. •  Using the proposed mechanism, we demonstrated a proactive FT system in a virtualized Infiniband cluster. •  We also confirmed that postcopy migration helps to reduce the downtime in the proactive FT system. •  SymVirt can be useful for not only fault tolerant but also load balancing and server consolidation. 28
  • 29. Future Work •  Interconnect-transparent migration, called “Ninja migration” –  We have submitted another conference paper. •  Overhead mitigation of SymVirt –  Very long link-up time problem –  Better integration with postcopy migration •  A generic communication layer supporting cooperative VM migration –  It is independent on an MPI runtime system. 29
  • 30. Thanks for your attention! This work was partly supported by JSPS KAKENHI 24700040 and ARGO GRAPHICS, Inc. 30