SlideShare a Scribd company logo
The case for Docker in multi-
cloud enabled
bioinformatics applications
Ahmed Ali, Mohamed M. ElKalioby, Mohamed Abouelhoda
Nile University, Egypt
Presented By
Mohamed M. El-Kalioby, MSc
1
Introduction
● Next generation sequencing technology has changed the
traditional bioinformatics practice
● Sophisticated multi-step workflows used to transform the raw
sequence data into knowledge.
● One NGS workflow can include tens of tasks and hundreds of
information sources integrated together to achieve the analysis
goals.
● Medical Variant Detection Workflow is an example of such
workflows.
2
Medical Variant Detection Workflow
(MVDW)
3
Medical Variant Detection Workflow (2)
● Multiple Versions and Instances of the workflow needed
● Tools and parameters can be changed
● per user, where each one may require certain modules, annotation
databases, and special post-processing;
● per experiment type, e.g., whole genome, whole exome, or RNAseq
in a single or multiplexed mode
● per sequencing platforms, illumina, IonTorrent, or any other one.
4
Requirements5
● Efficient Dynamic Deployment Strategy
● The deployed system should use HPC resources
● Able to consume cloud computing resources (private and public
clouds)
Virtualization Technology
● the whole system with all modules, databases and the
related dependencies are packaged in a virtual machine
(VM) image.
● These images can be then used to instantiate a virtual
machine running in private or public cloud.
● Examples from sequence analysis
● Crossbow for NGS read alignment & SNP calling,
● RSD-Cloud for comparative genomics
● … many more
6
Virtual Technology (2)
● The traditional engine for running the virtual machine
instances is based either on
● Oracle Virtual Box,
● KVM,
● Xen Hypervisor
● VMware
7
Docker8
● Docker provides a new level of virtualization
● the computing machine (including the operating system) is
not virtualized,
● Only the application and the related dependencies are
encapsulated in a ’virtual’ isolated process
INFRASTRUCTURE
Operating System
Virtual Machine Hypervisor
VM1 VM2 … VMn
APP1 APP2 …. APPn
INFRASTRUCTURE
Operating System
Container Container … Container
APPnAPP1 APP2 …
Container
Engine
Software Stack with Virtual Machines Software Stack with Containers
(a) (b)
Usage of Docker
9
Dockerclient
DockerServer
(Daemon)
Pull Image
Download/upload
Images
Build Image
Run Container
Build/Push container
images to local registry
Terminate Container
Docker
public
registry
Local registry
Infrastructure
Operating System
container container
Run containers
Why Docker10
● Reduced execution overhead compared to traditional whole
machine virtualization
● Provides an effective solution to the image portability
problem.
● Virtual machine images running in Amazon are not compatible
with those running in Google and vice versa which directly lead
to duplication of work to prepare new images with each
deployment.
Challenges
● Extra layers need to be built on top of Docker to enable the use of HPC resources
(computer cluster) and multi-cloud platforms
● Deployment in different commercial clouds is not an easy task.
● Each cloud platforms has different APIs and different business models.
● Images are compatible with different providers
11
Contribution
● Define use case scenario for using Docker within a computer cluster for
bioinformatics workflows.
● Evaluate its performance in comparison to the use of native hardware and usual
virtual machines, in private and public cloud.
● We also present a new version of our multicloud elasticHPC, referred to as
elasticHPC-Docker
1. enable the user deploy and run multi-step whole analysis workflows,
2. create computer cluster with Docker based applications and define a use case scenario
for that
3. support the use of private clouds as well as commercial clouds like Amazon and Google.
12
Containers in the Cloud13
Google
● Google Cloud offers a container service in the form of two products
1. container-optimized virtual machine images, which includes programs to run standard Docker
images, according to a user defined file in YAML format.
2. Google Kubernetes Engine (GKE) to create a cluster of virtual machines that can run Docker
images. GKE is based on pods,
● Google has established Google container registry (GCR).
● Cost:
● The optimized container images and GKE run at no extra cost. pays usual price of virtual
machines.
● GKE charges an extra fee of $0.15 per hour per cluster on top of the usual machine price (for
cluster size > 5 nodes).
● GKE has two limitations:
1. It does not support Docker’s private images.
2. The cluster size in GKE cannot exceed 100 nodes.
14
Amazon
● Amazon provides Elastic Container Service (ECS).
● ECS enables the deployment of Docker containers on Amazon EC2.
● Amazon uses docker-compose to manage docker containers.
● Docker-compose facilitates the process of setting up a multi-container application
by defining the application and all its dependencies in a single file using YAML
format.
● The instantiated machines include programs to automatically configure the
Docker environment.
● Amazon has its own images registry.
● Cost:
● the user pays for same as that of the usual instance types.
● If the load balancing service is selected, the user pays an extra small cost of $0.025 per
hour and $0.008 per GB transferred between instances
● Limitations:
● It does not support attaching EBS volumes to the running containers.
15
ElasticHPC-Docker
Features
● Ability to port and run any docker image to either private or commercial clouds.
● Creation and management of a cluster of containers. The cluster can use single or
multiple machines.
● The computer cluster can have nodes from different cloud providers; i.e. some
nodes can come from Amazon and some can come from Google.
● Ability to create and destroy containers in the run-time. This makes it possible to
run multiple containers on the same machine, one at a time.
● The package supports scaling up/down of virtual machines (worker nodes) in a
running clusters.
16
ElasticHPC-Docker
Features (2)
17
● The package allows mounting of virtual disks and establishment of a
shared file system to the containers (Default option is the NFS). In AWS, we
use EBS volumes and in Google we use persistent storage disks.
● elasticHPC-Docker automatically configures a job scheduler (including
security settings among the different providers) among the containers. The
default job schedule is PBS Torque, but SGE is also supported.
● The current package includes many Docker specification files (DockerFile)
for the most important tools for NGS data analysis. These include Fastx,
BWA, GATK .
● It includes a number of structural bioinformatics tools, including AutoDock,
Frodock, and AMBER GROMACS,, among others;.
EHPC-Docker (Use Case)18
EHPC-Client
EHPC-VM
Manager
Port 5000
Communication
with VM Manager
Port 5555
Ports1:4999,
5001:65535
Container
Communication with
Container service
Master Node
Communication
Among conainer
Service
Communication
Among Containerized
Services
Attached
Data
Volume
Shared File System
(Block Storage)
Running on
Users PC
EHPC-VM
Manager
Port 5000
Port 5555
Ports1:4999,
5001:65535
Container
Slave Node Worker Node
Attached
Data
Volume
EHPC-VM
Manager
Port 5000
Port 5555
Ports1:4999,
5001:65535
Container
Slave Node Worker Node
Attached
Data
Volume
EHPC-VM
Manager
Port 5000
Port 5555
Ports1:4999,
5001:65535
Container
Slave Node Worker Node
Attached
Data
Volume
1. User downloads the EHPC-Docker client2. User runs the client to create a cluster on a supported clouda. The client starts Master nodeb. Master node creates the rest of the cluster in parallelc. Master node distributes the URL of the image registryd. Master and worker nodes retrieve the image and start the containers.
e. Once done, the master node sets up the ports and finalizes the configuration of in
terms of setting up the job scheduler and the shared storage.Cluster is ready
Experiments
● We conducted two experiments:
1. Measure the time for establishing container clusters over different cloud platforms.
2. Measure the performance of using Docker when running the variant detection workflow.
19
Experiment 120
1. GKE is faster than ECS
2. elasticHPC is faster than GKE
3. elasticHPC is close to ECS
Experiment 2
● For this experiment, we used an exome dataset from DePristo et al. of size ~ 9 GB.
● The exome is a set of NGS reads sequenced only from the whole coding regions of a
genome.)
● The workflow was executed three times independently on Google, AWS, and private
cloud based on OpenStack.
● In each cloud, the 9 GB input data is divided into blocks to be processed in parallel
over the cluster nodes.
● For fair comparison, we used machines of as similar specifications as possible.
● Amazon: m3.2xlarge (8 C, Intel 2.5 GHz, 30 GB RAM, SSD disks, $0.532/hour),
● Google: n1-highmem-8(8 C, Intel 2.5 GHz, 52 GB RAM, SSD disks,$0.504/hour)
● OpenStack: we used local machine with 8 Cores, 56 GB RAM.
21
Experiment 2
Physical Servers
22
Docker is too close to physical
Experiment 2
Google Cloud
23
ElasticHPC is faster than
GCE Containers
Experiment 2
Amazon Cloud
24
ElasticHPC is very close to Amazon ECS
Conclusion
● We introduced elasticHPC-Docker based on container technology.
● Our package enables the creation of a computer cluster with containerized
applications and workflows in private and in different commercial clouds using
single interface.
● It includes options to run bioinformatics applications and workflows for large
datasets
● Through the container technology, elasticHPC-Docker provides an efficient
solution to the inter-operability among commercial clouds,
● It is efficient in practice with reduced overhead especially on local infrastructures.
● It is available on http://www.elastichpc.org
25
26
Thank You

More Related Content

What's hot (19)

PPTX
Kubernetes Basics
Rishabh Kumar
 
PDF
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Jelastic Multi-Cloud PaaS
 
PDF
Meteor South Bay Meetup - Kubernetes & Google Container Engine
Kit Merker
 
PPTX
Kubernetes Basics
Antonin Stoklasek
 
PPTX
Introduction to kubernetes
Rishabh Indoria
 
PDF
DevOps in AWS with Kubernetes
Oleg Chunikhin
 
PDF
ARCHITECTING TENANT BASED QOS IN MULTI-TENANT CLOUD PLATFORMS
Arun prasath
 
PDF
kubernetes 101
SeungWoo Lee
 
PDF
Getting started with kubernetes
Bob Killen
 
PDF
Kubernetes Requests and Limits
Ahmed AbouZaid
 
PPTX
Containers kuberenetes
Gayan Gunarathne
 
PPTX
Kubernetes for Beginners: An Introductory Guide
Bytemark
 
PDF
Quantifying the Noisy Neighbor Problem in Openstack
Nodir Kodirov
 
PDF
Evolution of containers to kubernetes
Krishna-Kumar
 
PDF
Microsoft Azure in HPC scenarios
mictc
 
PDF
Kubernetes Basics
Eueung Mulyana
 
PDF
Federated Kubernetes: As a Platform for Distributed Scientific Computing
Bob Killen
 
PDF
Kubernetes a comprehensive overview
Gabriel Carro
 
PDF
Kubernetes
erialc_w
 
Kubernetes Basics
Rishabh Kumar
 
Scaling Jakarta EE Applications Vertically and Horizontally with Jelastic PaaS
Jelastic Multi-Cloud PaaS
 
Meteor South Bay Meetup - Kubernetes & Google Container Engine
Kit Merker
 
Kubernetes Basics
Antonin Stoklasek
 
Introduction to kubernetes
Rishabh Indoria
 
DevOps in AWS with Kubernetes
Oleg Chunikhin
 
ARCHITECTING TENANT BASED QOS IN MULTI-TENANT CLOUD PLATFORMS
Arun prasath
 
kubernetes 101
SeungWoo Lee
 
Getting started with kubernetes
Bob Killen
 
Kubernetes Requests and Limits
Ahmed AbouZaid
 
Containers kuberenetes
Gayan Gunarathne
 
Kubernetes for Beginners: An Introductory Guide
Bytemark
 
Quantifying the Noisy Neighbor Problem in Openstack
Nodir Kodirov
 
Evolution of containers to kubernetes
Krishna-Kumar
 
Microsoft Azure in HPC scenarios
mictc
 
Kubernetes Basics
Eueung Mulyana
 
Federated Kubernetes: As a Platform for Distributed Scientific Computing
Bob Killen
 
Kubernetes a comprehensive overview
Gabriel Carro
 
Kubernetes
erialc_w
 

Viewers also liked (20)

PDF
Head first docker
Han Qin
 
PPTX
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
PDF
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Vincenzo Ferme
 
PDF
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
UNESCO Venice Office
 
PPT
Kallio Chipster Bosc2009
bosc
 
PDF
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
Prof. Tafida Ghanem
 
PPTX
Lt npsti process-and_forms_april_2011
Mosab-Khayat
 
PPT
Dr Justin Schonfeld - Bioinformatics Applications
Consortium for the Barcode of Life (CBOL)
 
PDF
الهوية الرقمية على مواقع التواصل الاجتماعي
Fatma Esa
 
PPTX
Delivering Bioinformatics MapReduce Applications in the Cloud
Lukas Forer
 
PPT
مهارات+1
Mosab-Khayat
 
PPTX
Present
Ahmed Salah
 
PPTX
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
Consortium for the Barcode of Life (CBOL)
 
PDF
e justice
Mohamed Elharty
 
PPT
Bioinformatics lecture 1
Hamid Ur-Rahman
 
PPTX
Brin bws13 quiz mmc
USD Bioinformatics
 
PPTX
Visual Studio
Basel Issmail
 
PDF
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
Prof. Sherif Shaheen
 
PDF
تسويق خدمات المعلومات
u083125
 
PPT
الثقافة التقنية والمواطنة الالكترونية
Nazzal Th. Alenezi
 
Head first docker
Han Qin
 
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
Using Docker Containers to Improve Reproducibility in Software and Web Engine...
Vincenzo Ferme
 
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
UNESCO Venice Office
 
Kallio Chipster Bosc2009
bosc
 
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
Prof. Tafida Ghanem
 
Lt npsti process-and_forms_april_2011
Mosab-Khayat
 
Dr Justin Schonfeld - Bioinformatics Applications
Consortium for the Barcode of Life (CBOL)
 
الهوية الرقمية على مواقع التواصل الاجتماعي
Fatma Esa
 
Delivering Bioinformatics MapReduce Applications in the Cloud
Lukas Forer
 
مهارات+1
Mosab-Khayat
 
Present
Ahmed Salah
 
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
Consortium for the Barcode of Life (CBOL)
 
e justice
Mohamed Elharty
 
Bioinformatics lecture 1
Hamid Ur-Rahman
 
Brin bws13 quiz mmc
USD Bioinformatics
 
Visual Studio
Basel Issmail
 
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
Prof. Sherif Shaheen
 
تسويق خدمات المعلومات
u083125
 
الثقافة التقنية والمواطنة الالكترونية
Nazzal Th. Alenezi
 
Ad

Similar to The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications (20)

PDF
HPC Cloud Burst Using Docker
IRJET Journal
 
PDF
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Animesh Singh
 
PPTX
Introductio to Docker and usage in HPC applications
Richie Varghese
 
PPTX
Docker OpenStack Cloud Foundry
Animesh Singh
 
PDF
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
Kento Aoyama
 
PDF
ss2021-docker-Pdf for docer Studyintroduction.pdf
callbackff200002
 
PPTX
doitUNIT I - Docker-Containerization.pptx
aman0710p
 
PPTX
Fits docker into devops
Evans Ye
 
PPTX
Containers and Docker
Damian T. Gordon
 
PDF
Docker 0.11 at MaxCDN meetup in Los Angeles
Jérôme Petazzoni
 
PDF
Week 8 lecture material
Ankit Gupta
 
PDF
week8_watermark.pdfhowcanitbe minimum 40 i
sec22ci043
 
PPTX
2014, April 15, Atlanta Java Users Group
Todd Fritz
 
PDF
Beyond static configuration
Stefan Schimanski
 
PPTX
Virtualization, Containers, Docker and scalable container management services
abhishek chawla
 
PDF
Docker dev ops for cd meetup 12-14
Simon Storm
 
PDF
24 23 jun17 2may17 16231 ijeecs latest_version (1) edit septian
IAESIJEECS
 
PPTX
docker : how to deploy Digital Experience in a container drinking a cup of co...
Matteo Bisi
 
PDF
Docker build, test and deploy saa s applications
william_greenly
 
PPTX
Docker & aPaaS: Enterprise Innovation and Trends for 2015
WaveMaker, Inc.
 
HPC Cloud Burst Using Docker
IRJET Journal
 
Cloud foundry Docker Openstack - Leading Open Source Triumvirate
Animesh Singh
 
Introductio to Docker and usage in HPC applications
Richie Varghese
 
Docker OpenStack Cloud Foundry
Animesh Singh
 
Journal Seminar: Is Singularity-based Container Technology Ready for Running ...
Kento Aoyama
 
ss2021-docker-Pdf for docer Studyintroduction.pdf
callbackff200002
 
doitUNIT I - Docker-Containerization.pptx
aman0710p
 
Fits docker into devops
Evans Ye
 
Containers and Docker
Damian T. Gordon
 
Docker 0.11 at MaxCDN meetup in Los Angeles
Jérôme Petazzoni
 
Week 8 lecture material
Ankit Gupta
 
week8_watermark.pdfhowcanitbe minimum 40 i
sec22ci043
 
2014, April 15, Atlanta Java Users Group
Todd Fritz
 
Beyond static configuration
Stefan Schimanski
 
Virtualization, Containers, Docker and scalable container management services
abhishek chawla
 
Docker dev ops for cd meetup 12-14
Simon Storm
 
24 23 jun17 2may17 16231 ijeecs latest_version (1) edit septian
IAESIJEECS
 
docker : how to deploy Digital Experience in a container drinking a cup of co...
Matteo Bisi
 
Docker build, test and deploy saa s applications
william_greenly
 
Docker & aPaaS: Enterprise Innovation and Trends for 2015
WaveMaker, Inc.
 
Ad

Recently uploaded (20)

PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Biography of Daniel Podor.pdf
Daniel Podor
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 

The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications

  • 1. The case for Docker in multi- cloud enabled bioinformatics applications Ahmed Ali, Mohamed M. ElKalioby, Mohamed Abouelhoda Nile University, Egypt Presented By Mohamed M. El-Kalioby, MSc 1
  • 2. Introduction ● Next generation sequencing technology has changed the traditional bioinformatics practice ● Sophisticated multi-step workflows used to transform the raw sequence data into knowledge. ● One NGS workflow can include tens of tasks and hundreds of information sources integrated together to achieve the analysis goals. ● Medical Variant Detection Workflow is an example of such workflows. 2
  • 3. Medical Variant Detection Workflow (MVDW) 3
  • 4. Medical Variant Detection Workflow (2) ● Multiple Versions and Instances of the workflow needed ● Tools and parameters can be changed ● per user, where each one may require certain modules, annotation databases, and special post-processing; ● per experiment type, e.g., whole genome, whole exome, or RNAseq in a single or multiplexed mode ● per sequencing platforms, illumina, IonTorrent, or any other one. 4
  • 5. Requirements5 ● Efficient Dynamic Deployment Strategy ● The deployed system should use HPC resources ● Able to consume cloud computing resources (private and public clouds)
  • 6. Virtualization Technology ● the whole system with all modules, databases and the related dependencies are packaged in a virtual machine (VM) image. ● These images can be then used to instantiate a virtual machine running in private or public cloud. ● Examples from sequence analysis ● Crossbow for NGS read alignment & SNP calling, ● RSD-Cloud for comparative genomics ● … many more 6
  • 7. Virtual Technology (2) ● The traditional engine for running the virtual machine instances is based either on ● Oracle Virtual Box, ● KVM, ● Xen Hypervisor ● VMware 7
  • 8. Docker8 ● Docker provides a new level of virtualization ● the computing machine (including the operating system) is not virtualized, ● Only the application and the related dependencies are encapsulated in a ’virtual’ isolated process INFRASTRUCTURE Operating System Virtual Machine Hypervisor VM1 VM2 … VMn APP1 APP2 …. APPn INFRASTRUCTURE Operating System Container Container … Container APPnAPP1 APP2 … Container Engine Software Stack with Virtual Machines Software Stack with Containers (a) (b)
  • 9. Usage of Docker 9 Dockerclient DockerServer (Daemon) Pull Image Download/upload Images Build Image Run Container Build/Push container images to local registry Terminate Container Docker public registry Local registry Infrastructure Operating System container container Run containers
  • 10. Why Docker10 ● Reduced execution overhead compared to traditional whole machine virtualization ● Provides an effective solution to the image portability problem. ● Virtual machine images running in Amazon are not compatible with those running in Google and vice versa which directly lead to duplication of work to prepare new images with each deployment.
  • 11. Challenges ● Extra layers need to be built on top of Docker to enable the use of HPC resources (computer cluster) and multi-cloud platforms ● Deployment in different commercial clouds is not an easy task. ● Each cloud platforms has different APIs and different business models. ● Images are compatible with different providers 11
  • 12. Contribution ● Define use case scenario for using Docker within a computer cluster for bioinformatics workflows. ● Evaluate its performance in comparison to the use of native hardware and usual virtual machines, in private and public cloud. ● We also present a new version of our multicloud elasticHPC, referred to as elasticHPC-Docker 1. enable the user deploy and run multi-step whole analysis workflows, 2. create computer cluster with Docker based applications and define a use case scenario for that 3. support the use of private clouds as well as commercial clouds like Amazon and Google. 12
  • 13. Containers in the Cloud13
  • 14. Google ● Google Cloud offers a container service in the form of two products 1. container-optimized virtual machine images, which includes programs to run standard Docker images, according to a user defined file in YAML format. 2. Google Kubernetes Engine (GKE) to create a cluster of virtual machines that can run Docker images. GKE is based on pods, ● Google has established Google container registry (GCR). ● Cost: ● The optimized container images and GKE run at no extra cost. pays usual price of virtual machines. ● GKE charges an extra fee of $0.15 per hour per cluster on top of the usual machine price (for cluster size > 5 nodes). ● GKE has two limitations: 1. It does not support Docker’s private images. 2. The cluster size in GKE cannot exceed 100 nodes. 14
  • 15. Amazon ● Amazon provides Elastic Container Service (ECS). ● ECS enables the deployment of Docker containers on Amazon EC2. ● Amazon uses docker-compose to manage docker containers. ● Docker-compose facilitates the process of setting up a multi-container application by defining the application and all its dependencies in a single file using YAML format. ● The instantiated machines include programs to automatically configure the Docker environment. ● Amazon has its own images registry. ● Cost: ● the user pays for same as that of the usual instance types. ● If the load balancing service is selected, the user pays an extra small cost of $0.025 per hour and $0.008 per GB transferred between instances ● Limitations: ● It does not support attaching EBS volumes to the running containers. 15
  • 16. ElasticHPC-Docker Features ● Ability to port and run any docker image to either private or commercial clouds. ● Creation and management of a cluster of containers. The cluster can use single or multiple machines. ● The computer cluster can have nodes from different cloud providers; i.e. some nodes can come from Amazon and some can come from Google. ● Ability to create and destroy containers in the run-time. This makes it possible to run multiple containers on the same machine, one at a time. ● The package supports scaling up/down of virtual machines (worker nodes) in a running clusters. 16
  • 17. ElasticHPC-Docker Features (2) 17 ● The package allows mounting of virtual disks and establishment of a shared file system to the containers (Default option is the NFS). In AWS, we use EBS volumes and in Google we use persistent storage disks. ● elasticHPC-Docker automatically configures a job scheduler (including security settings among the different providers) among the containers. The default job schedule is PBS Torque, but SGE is also supported. ● The current package includes many Docker specification files (DockerFile) for the most important tools for NGS data analysis. These include Fastx, BWA, GATK . ● It includes a number of structural bioinformatics tools, including AutoDock, Frodock, and AMBER GROMACS,, among others;.
  • 18. EHPC-Docker (Use Case)18 EHPC-Client EHPC-VM Manager Port 5000 Communication with VM Manager Port 5555 Ports1:4999, 5001:65535 Container Communication with Container service Master Node Communication Among conainer Service Communication Among Containerized Services Attached Data Volume Shared File System (Block Storage) Running on Users PC EHPC-VM Manager Port 5000 Port 5555 Ports1:4999, 5001:65535 Container Slave Node Worker Node Attached Data Volume EHPC-VM Manager Port 5000 Port 5555 Ports1:4999, 5001:65535 Container Slave Node Worker Node Attached Data Volume EHPC-VM Manager Port 5000 Port 5555 Ports1:4999, 5001:65535 Container Slave Node Worker Node Attached Data Volume 1. User downloads the EHPC-Docker client2. User runs the client to create a cluster on a supported clouda. The client starts Master nodeb. Master node creates the rest of the cluster in parallelc. Master node distributes the URL of the image registryd. Master and worker nodes retrieve the image and start the containers. e. Once done, the master node sets up the ports and finalizes the configuration of in terms of setting up the job scheduler and the shared storage.Cluster is ready
  • 19. Experiments ● We conducted two experiments: 1. Measure the time for establishing container clusters over different cloud platforms. 2. Measure the performance of using Docker when running the variant detection workflow. 19
  • 20. Experiment 120 1. GKE is faster than ECS 2. elasticHPC is faster than GKE 3. elasticHPC is close to ECS
  • 21. Experiment 2 ● For this experiment, we used an exome dataset from DePristo et al. of size ~ 9 GB. ● The exome is a set of NGS reads sequenced only from the whole coding regions of a genome.) ● The workflow was executed three times independently on Google, AWS, and private cloud based on OpenStack. ● In each cloud, the 9 GB input data is divided into blocks to be processed in parallel over the cluster nodes. ● For fair comparison, we used machines of as similar specifications as possible. ● Amazon: m3.2xlarge (8 C, Intel 2.5 GHz, 30 GB RAM, SSD disks, $0.532/hour), ● Google: n1-highmem-8(8 C, Intel 2.5 GHz, 52 GB RAM, SSD disks,$0.504/hour) ● OpenStack: we used local machine with 8 Cores, 56 GB RAM. 21
  • 22. Experiment 2 Physical Servers 22 Docker is too close to physical
  • 23. Experiment 2 Google Cloud 23 ElasticHPC is faster than GCE Containers
  • 24. Experiment 2 Amazon Cloud 24 ElasticHPC is very close to Amazon ECS
  • 25. Conclusion ● We introduced elasticHPC-Docker based on container technology. ● Our package enables the creation of a computer cluster with containerized applications and workflows in private and in different commercial clouds using single interface. ● It includes options to run bioinformatics applications and workflows for large datasets ● Through the container technology, elasticHPC-Docker provides an efficient solution to the inter-operability among commercial clouds, ● It is efficient in practice with reduced overhead especially on local infrastructures. ● It is available on http://www.elastichpc.org 25