Cloud Computing:
Safe Haven from the Data Deluge?
Toby Bloom, Ph.D.
Clouds: the solution to all problems?
Agenda
• What is the “cloud”?
• When to use it?
• An example: moving our analysis pipeline to
the cloud
• What works; what doesn’t
What is Cloud Computing?
• Pay-as-you-go compute infrastructure
– Compute servers by the hour
– Storage services by the month
– Network transfers by the byte
• Wide range of other services offered by cloud
providers
• Other definitions:
– Google cloud
• Google apps, pay-as-you-go
• “applications as a service”
–
Why clouds?
• Small research centers
– 1 or 2 Illuminas can overwhelm IT infrastructure
• Spikes in load
– The week before Marco, the compute queues get very long
• Uneven load
– If load goes up and down unpredictably, don’t want to buy
resources to handle the peaks and leave them idle much of
the time
• Large collaborative projects
– Avoid repeatedly transferring data between centers
– make computational resources available in one place
– easier to share all results quickly
The advantage for large projects
1000G Pilot - Fastq lifecycle
Generate
Fastq
Fastq to
NCBI
Replicate to
EBI
Download to
Sanger
Upload BAM
to EBI
Replicate to
NCBI
Mirror to 3+
analyis sites
Goal on Cloud
10+ copies +
backups
Generate Fastq
NCBI EBI
All further
processing on
Cloud
2 files + replicas
Our Experiment: Analysis on the Cloud
• Implement our current Illumina production
analysis pipeline (Picard) on the Amazon cloud
• Compare performance & cost to local
pipelines.
• Tune architecture for the cloud
– How to change the implementation to work best
on the cloud
– Identify general “rules” for cloud implementations
• Test use on some real projects
The Pipeline
Extract
Illumina Data
to Standard
Format
Align reads
with BWA or
MAQ
Mark
Duplicate
Reads
Re-align reads
around known
indels
Calibrate
Quality Scores
Collect Metrics
about Libraries
and Run
Verify Sample
Identity
Summary
Report
Aggregation
Pipeline
Merge all data
for each library
Mark Duplicate
Reads per
library
Collect Metrics
per library
Merge all
libraries for a
sample
Collect Metrics
about the
Sample
Downstream
pipelines and
analysts
Run Level
Pipeline
Lane-Level Analysis
Sample-Level Aggregation
Current Status:
• Pipeline Manager and Picard Alignment
Pipeline are running on the Amazon cloud
• Currently running 1000 Genomes Exomes
through Picard on the cloud
– As a high-volume test case
– But also the actual pipeline for the Exome DCC
– ~110 Exomes processed.
• Still restructuring / optimizing
• Cloud capabilities always changing
Challenges of porting to the cloud
• May require substantial re-architecture of
your application
• Getting the data there
• Security/ privacy issues
• Efficient utilization of cloud resources
• Predicting usage needs and costs
IT Architecture Differences
Isilon Storage –
Petabytes in one file system
Compute Blades:
One farm, little local storage
Photos from Chris Dagdigian
Broad IT Architecture:
Load Management Software (LSF/ SGE)
Amazon Cloud Virtual Architecture
Elastic Block Storage
(EBS)
EBS
EBS
Compute servers
Simple Storage Service (S3)
Load Management Software (LSF/ SGE)
Quick Comparison
Broad
• Ease of development
– Data is all in the same place
all the time
– All servers can access all data
uniformly
– LSF does lots of the work
• Very high throughput
• Easy to add more compute
or more storage, but costly
• But
– Heavy network load
– Response time secondary to
throughput
Amazon Cloud
• Can add more compute or
storage as needed
• Don’t pay for what you don’t
use
• Need to explicitly assign
analyses to specific servers
– And move data there
• Faster turnaround
– Local storage
• But
– Need to make sure you have
enough local storage for each
job
Why does system architecture matter?
Extract Illumina
Data to Standard
Format
Align reads
with BWA
or MAQ
Mark Duplicate Reads
Re-align reads
around known
indels
…
.
Merge all data for
each library
Mark Duplicate Reads per
library
…
Disk needed
Compute needed
Possible Solutions
• NFS
• Gluster
• Move EBS drives
• Use S3 for interchange
• Custom inter-node transfer
Moving the Alignment Pipeline to the
Cloud
Elastic Block Storage
(EBS)
EBS
EBS
Compute servers
Simple Storage Service (S3)
Move Fastq’s
from Broad
to S3
Find
allocated
server with
capacity
OR request &
initialize new
server
Move fastqs
to server
Run lane-level
pipeline
Write BAM
results back
to S3
Release
Server?
Ready to
aggregate?
Copy BAMs
from S3 to
server
Allocate existing
server or request
new one
Run aggregation
pipeline
Pipeline Manager
Move
BAMs back
to S3
Release
Servers
as needed
Challenges of porting to the cloud
• May require substantial re-architecture of
your application
• Getting the data there: network issues
• Security/ privacy issues
• Efficient utilization of cloud resources
• Predicting usage needs and costs
Network Capacity and Data Transfer
• Latest test:
– Transfer of 110 exome fastqs, 800GBytes zipped
– 15 hours to upload, using 2 cores (and 2 streams)
• Transfer times are very variable
• Pay for transfer in&out, and storage monthly
 A small center should not have difficult transferring
data cycle by cycle for a single machine
Broad
Amazon
S3
1Gb, S3FTP
Security!!
• Neither the Amazon cloud nor any other cloud
is currently approved for storing controlled-
access genomic data
• Okay for 1000 Genomes, not for TCGA
• Major limitation of cloud right now
• Not necessarily a technical issue
Job Times and Node Utilization for BWA Alignment of 4 lanes on 1 CC1 node
0
10
20
30
40
50
60
70
80
90
100
4:43:21PM
5:28:21PM
6:13:21PM
6:58:21PM
7:43:21PM
8:28:22PM
9:13:22PM
9:58:22PM
10:43:22PM
11:28:22PM
12:13:22AM
12:58:22AM
1:43:22AM
2:28:22AM
3:13:22AM
3:58:22AM
4:43:22AM
5:28:23AM
6:13:23AM
6:58:23AM
7:43:23AM
8:28:23AM
9:13:23AM
9:58:23AM
10:43:23AM
11:28:23AM
12:13:23PM
12:58:23PM
1:43:23PM
2:28:24PM
3:13:24PM
3:58:24PM
4:43:24PM
5:28:24PM
6:13:24PM
6:58:24PM
7:43:24PM
8:28:24PM
9:13:24PM
9:58:24PM
10:43:24PM
11:28:25PM
12:13:25AM
12:58:25AM
1:43:25AM
2:28:25AM
3:13:25AM
3:58:25AM
%user
%iowait
Costs??
• Best estimate:
– Cloud is 2-4X the cost of local compute for our
pipeline
• BUT apples to apples comparison is difficult
• Comparison is more favorable for smaller centers
• Potential big savings for big collaborations
• Cloud costs going down more rapidly than local costs
• Much cheaper if you can predict capacity for next 1-3
years.
Costs
• Efficient utilization of compute can be difficult
• Noisy neighbors affect utilization & efficiency
• Changing data sizes affect utilization rates and
resource constraints
The Gotchas
• No way to share data among multiple
compute servers at once.
– Need to move data if using different servers for
different steps.
• Network speed variability
• Noisy neighbors
– Need to use the largest machines always
• Security regulations
Conclusions
• Definitely a viable option for small centers using
standard software
• Potential to save costs for large collaborations
• Maybe not cost effective for spikes
• Moving to the cloud is non-trivial
• Large datasets pose challenges
• Security rules need to be resolved
• Costs are hard to predict/ difficult to compare
Acknowledgements
• Zach Leber
• Seva Kashin
• Thaniel Novod
• Frans Lawaetz
• John Hanks
• Matthew Trunnell
• Tim Fennell
• Kathleen Tibbetts
• Alex Wysoker
• Kiran Giramella
• Chris Dagdigian
• Vivien Bonazzi
Funding from NHGRI

More Related Content

PDF
Leveraging Databricks for Spark pipelines
PDF
Data Stores @ Netflix
PDF
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
PDF
Openstack summit 2015
PDF
openstack, devops and people
PDF
Way to cloud
PDF
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Leveraging Databricks for Spark pipelines
Data Stores @ Netflix
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
Openstack summit 2015
openstack, devops and people
Way to cloud
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
Revolutionary Storage for Modern Databases, Applications and Infrastrcture

What's hot (18)

PDF
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
PDF
GPU cloud with Job scheduler and Container
PDF
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
PDF
Managing Cloud networking costs for data-intensive applications by provisioni...
PDF
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
PDF
Embracing clouds
PDF
Using commercial Clouds to process IceCube jobs
PPTX
Experience with Kafka & Storm
PPTX
Enterprise Grade Streaming under 2ms on Hadoop
PDF
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
ODP
Zero Downtime JEE Architectures
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
PDF
Suning OpenStack Cloud and Heat
PPTX
Netflix viewing data architecture evolution - EBJUG Nov 2014
PPTX
EVCache at Netflix
PDF
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
PDF
Netflix at-disney-09-26-2014
PPTX
Resource Aware Scheduling in Apache Storm
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
GPU cloud with Job scheduler and Container
OpenNebulaConf2015 1.03 Private, Public, Hybrid: The Real Economics of Open S...
Managing Cloud networking costs for data-intensive applications by provisioni...
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Embracing clouds
Using commercial Clouds to process IceCube jobs
Experience with Kafka & Storm
Enterprise Grade Streaming under 2ms on Hadoop
Cloud: From Unmanned Data Center to Algorithmic Economy using Openstack
Zero Downtime JEE Architectures
Unbounded bounded-data-strangeloop-2016-monal-daxini
Suning OpenStack Cloud and Heat
Netflix viewing data architecture evolution - EBJUG Nov 2014
EVCache at Netflix
Cloud Infrastructures Slide Set 8 - More Cloud Technologies - Mesos, Spark | ...
Netflix at-disney-09-26-2014
Resource Aware Scheduling in Apache Storm
Ad

Viewers also liked (7)

PDF
Metro nome agbt-poster
PDF
AGBT Precision Medicine 2016 Cohort Indentification
PDF
Informatics Infrastructure for Clinical Genomics
PDF
Bio it 2014-published
PDF
2015 Upload Campaigns Calendar - SlideShare
PPTX
What to Upload to SlideShare
PDF
Getting Started With SlideShare
Metro nome agbt-poster
AGBT Precision Medicine 2016 Cohort Indentification
Informatics Infrastructure for Clinical Genomics
Bio it 2014-published
2015 Upload Campaigns Calendar - SlideShare
What to Upload to SlideShare
Getting Started With SlideShare
Ad

Similar to Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011 (20)

PPTX
Mapping Life Science Informatics to the Cloud
PPTX
2016 05 sanger
PDF
Adoption of Cloud Computing in Scientific Research
PPTX
4 C’s for Using Cloud to Support Scientific Research
PPTX
2015 09 emc lsug
PDF
Cloud Computing
PDF
Cloud Overview
PDF
Cloud computing workshop at IIT bombay
PPTX
Everything comes in 3's
PPT
Computing Outside The Box September 2009
ODP
Clouds, Grids and Data
PDF
Architecting applications in the AWS cloud
PPTX
Big data journey to the cloud 5.30.18 asher bartch
PDF
Open Nebula OW2 Conference Nov10
 
PDF
Scientific Cloud Computing: Present & Future
PPTX
2019 BioIt World - Post cloud legacy edition
PPT
Computing Outside The Box June 2009
ODP
Clouds: All fluff and no substance?
PPT
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
PDF
Migrating EBI into the cloud - lessons learned, so far
Mapping Life Science Informatics to the Cloud
2016 05 sanger
Adoption of Cloud Computing in Scientific Research
4 C’s for Using Cloud to Support Scientific Research
2015 09 emc lsug
Cloud Computing
Cloud Overview
Cloud computing workshop at IIT bombay
Everything comes in 3's
Computing Outside The Box September 2009
Clouds, Grids and Data
Architecting applications in the AWS cloud
Big data journey to the cloud 5.30.18 asher bartch
Open Nebula OW2 Conference Nov10
 
Scientific Cloud Computing: Present & Future
2019 BioIt World - Post cloud legacy edition
Computing Outside The Box June 2009
Clouds: All fluff and no substance?
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
Migrating EBI into the cloud - lessons learned, so far

Recently uploaded (20)

PDF
Gynecologic Malignancies.Dawit.pdf............
PPTX
BIOCOMPATIBILITY & BIOLOGICAL CONSIDERATION OF DENTAL MATERIALS.pptx
PDF
Geriatrics Chapter 1 powerpoint for PA-S
PPTX
AWMI case presentation ppt AWMI case presentation ppt
PPT
fiscal planning in nursing and administration
PPTX
Wheat allergies and Disease in gastroenterology
PPTX
@K. CLINICAL TRIAL(NEW DRUG DISCOVERY)- KIRTI BHALALA.pptx
PPTX
HOP RELATED TO NURSING EDUCATION FOR BSC
PDF
New-Child for VP Shunt Placement – Anaesthetic Management - Copy (1).pdf
PPTX
Neoplasia III.pptxjhghgjhfj fjfhgfgdfdfsrbvhv
PPTX
SHOCK- lectures on types of shock ,and complications w
PPTX
Peripheral Arterial Diseases PAD-WPS Office.pptx
PPTX
NUCLEAR-MEDICINE-Copy.pptxbabaabahahahaahha
PPTX
ANESTHETIC CONSIDERATION IN ALCOHOLIC ASSOCIATED LIVER DISEASE.pptx
PPT
Opthalmology presentation MRCP preparation.ppt
PPTX
abgs and brain death dr js chinganga.pptx
PPTX
PARASYMPATHETIC NERVOUS SYSTEM and its correlation with HEART .pptx
PPTX
presentation on causes and treatment of glomerular disorders
PPTX
ARTHRITIS and Types,causes,pathophysiology,clinicalanifestations,diagnostic e...
PPTX
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...
Gynecologic Malignancies.Dawit.pdf............
BIOCOMPATIBILITY & BIOLOGICAL CONSIDERATION OF DENTAL MATERIALS.pptx
Geriatrics Chapter 1 powerpoint for PA-S
AWMI case presentation ppt AWMI case presentation ppt
fiscal planning in nursing and administration
Wheat allergies and Disease in gastroenterology
@K. CLINICAL TRIAL(NEW DRUG DISCOVERY)- KIRTI BHALALA.pptx
HOP RELATED TO NURSING EDUCATION FOR BSC
New-Child for VP Shunt Placement – Anaesthetic Management - Copy (1).pdf
Neoplasia III.pptxjhghgjhfj fjfhgfgdfdfsrbvhv
SHOCK- lectures on types of shock ,and complications w
Peripheral Arterial Diseases PAD-WPS Office.pptx
NUCLEAR-MEDICINE-Copy.pptxbabaabahahahaahha
ANESTHETIC CONSIDERATION IN ALCOHOLIC ASSOCIATED LIVER DISEASE.pptx
Opthalmology presentation MRCP preparation.ppt
abgs and brain death dr js chinganga.pptx
PARASYMPATHETIC NERVOUS SYSTEM and its correlation with HEART .pptx
presentation on causes and treatment of glomerular disorders
ARTHRITIS and Types,causes,pathophysiology,clinicalanifestations,diagnostic e...
HYPERSENSITIVITY REACTIONS - Pathophysiology Notes for Second Year Pharm D St...

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011

  • 1. Cloud Computing: Safe Haven from the Data Deluge? Toby Bloom, Ph.D.
  • 2. Clouds: the solution to all problems?
  • 3. Agenda • What is the “cloud”? • When to use it? • An example: moving our analysis pipeline to the cloud • What works; what doesn’t
  • 4. What is Cloud Computing? • Pay-as-you-go compute infrastructure – Compute servers by the hour – Storage services by the month – Network transfers by the byte • Wide range of other services offered by cloud providers • Other definitions: – Google cloud • Google apps, pay-as-you-go • “applications as a service” –
  • 5. Why clouds? • Small research centers – 1 or 2 Illuminas can overwhelm IT infrastructure • Spikes in load – The week before Marco, the compute queues get very long • Uneven load – If load goes up and down unpredictably, don’t want to buy resources to handle the peaks and leave them idle much of the time • Large collaborative projects – Avoid repeatedly transferring data between centers – make computational resources available in one place – easier to share all results quickly
  • 6. The advantage for large projects 1000G Pilot - Fastq lifecycle Generate Fastq Fastq to NCBI Replicate to EBI Download to Sanger Upload BAM to EBI Replicate to NCBI Mirror to 3+ analyis sites Goal on Cloud 10+ copies + backups Generate Fastq NCBI EBI All further processing on Cloud 2 files + replicas
  • 7. Our Experiment: Analysis on the Cloud • Implement our current Illumina production analysis pipeline (Picard) on the Amazon cloud • Compare performance & cost to local pipelines. • Tune architecture for the cloud – How to change the implementation to work best on the cloud – Identify general “rules” for cloud implementations • Test use on some real projects
  • 8. The Pipeline Extract Illumina Data to Standard Format Align reads with BWA or MAQ Mark Duplicate Reads Re-align reads around known indels Calibrate Quality Scores Collect Metrics about Libraries and Run Verify Sample Identity Summary Report Aggregation Pipeline Merge all data for each library Mark Duplicate Reads per library Collect Metrics per library Merge all libraries for a sample Collect Metrics about the Sample Downstream pipelines and analysts Run Level Pipeline Lane-Level Analysis Sample-Level Aggregation
  • 9. Current Status: • Pipeline Manager and Picard Alignment Pipeline are running on the Amazon cloud • Currently running 1000 Genomes Exomes through Picard on the cloud – As a high-volume test case – But also the actual pipeline for the Exome DCC – ~110 Exomes processed. • Still restructuring / optimizing • Cloud capabilities always changing
  • 10. Challenges of porting to the cloud • May require substantial re-architecture of your application • Getting the data there • Security/ privacy issues • Efficient utilization of cloud resources • Predicting usage needs and costs
  • 11. IT Architecture Differences Isilon Storage – Petabytes in one file system Compute Blades: One farm, little local storage Photos from Chris Dagdigian Broad IT Architecture: Load Management Software (LSF/ SGE)
  • 12. Amazon Cloud Virtual Architecture Elastic Block Storage (EBS) EBS EBS Compute servers Simple Storage Service (S3) Load Management Software (LSF/ SGE)
  • 13. Quick Comparison Broad • Ease of development – Data is all in the same place all the time – All servers can access all data uniformly – LSF does lots of the work • Very high throughput • Easy to add more compute or more storage, but costly • But – Heavy network load – Response time secondary to throughput Amazon Cloud • Can add more compute or storage as needed • Don’t pay for what you don’t use • Need to explicitly assign analyses to specific servers – And move data there • Faster turnaround – Local storage • But – Need to make sure you have enough local storage for each job
  • 14. Why does system architecture matter? Extract Illumina Data to Standard Format Align reads with BWA or MAQ Mark Duplicate Reads Re-align reads around known indels … . Merge all data for each library Mark Duplicate Reads per library … Disk needed Compute needed
  • 15. Possible Solutions • NFS • Gluster • Move EBS drives • Use S3 for interchange • Custom inter-node transfer
  • 16. Moving the Alignment Pipeline to the Cloud Elastic Block Storage (EBS) EBS EBS Compute servers Simple Storage Service (S3) Move Fastq’s from Broad to S3 Find allocated server with capacity OR request & initialize new server Move fastqs to server Run lane-level pipeline Write BAM results back to S3 Release Server? Ready to aggregate? Copy BAMs from S3 to server Allocate existing server or request new one Run aggregation pipeline Pipeline Manager Move BAMs back to S3 Release Servers as needed
  • 17. Challenges of porting to the cloud • May require substantial re-architecture of your application • Getting the data there: network issues • Security/ privacy issues • Efficient utilization of cloud resources • Predicting usage needs and costs
  • 18. Network Capacity and Data Transfer • Latest test: – Transfer of 110 exome fastqs, 800GBytes zipped – 15 hours to upload, using 2 cores (and 2 streams) • Transfer times are very variable • Pay for transfer in&out, and storage monthly  A small center should not have difficult transferring data cycle by cycle for a single machine Broad Amazon S3 1Gb, S3FTP
  • 19. Security!! • Neither the Amazon cloud nor any other cloud is currently approved for storing controlled- access genomic data • Okay for 1000 Genomes, not for TCGA • Major limitation of cloud right now • Not necessarily a technical issue
  • 20. Job Times and Node Utilization for BWA Alignment of 4 lanes on 1 CC1 node 0 10 20 30 40 50 60 70 80 90 100 4:43:21PM 5:28:21PM 6:13:21PM 6:58:21PM 7:43:21PM 8:28:22PM 9:13:22PM 9:58:22PM 10:43:22PM 11:28:22PM 12:13:22AM 12:58:22AM 1:43:22AM 2:28:22AM 3:13:22AM 3:58:22AM 4:43:22AM 5:28:23AM 6:13:23AM 6:58:23AM 7:43:23AM 8:28:23AM 9:13:23AM 9:58:23AM 10:43:23AM 11:28:23AM 12:13:23PM 12:58:23PM 1:43:23PM 2:28:24PM 3:13:24PM 3:58:24PM 4:43:24PM 5:28:24PM 6:13:24PM 6:58:24PM 7:43:24PM 8:28:24PM 9:13:24PM 9:58:24PM 10:43:24PM 11:28:25PM 12:13:25AM 12:58:25AM 1:43:25AM 2:28:25AM 3:13:25AM 3:58:25AM %user %iowait
  • 21. Costs?? • Best estimate: – Cloud is 2-4X the cost of local compute for our pipeline • BUT apples to apples comparison is difficult • Comparison is more favorable for smaller centers • Potential big savings for big collaborations • Cloud costs going down more rapidly than local costs • Much cheaper if you can predict capacity for next 1-3 years.
  • 22. Costs • Efficient utilization of compute can be difficult • Noisy neighbors affect utilization & efficiency • Changing data sizes affect utilization rates and resource constraints
  • 23. The Gotchas • No way to share data among multiple compute servers at once. – Need to move data if using different servers for different steps. • Network speed variability • Noisy neighbors – Need to use the largest machines always • Security regulations
  • 24. Conclusions • Definitely a viable option for small centers using standard software • Potential to save costs for large collaborations • Maybe not cost effective for spikes • Moving to the cloud is non-trivial • Large datasets pose challenges • Security rules need to be resolved • Costs are hard to predict/ difficult to compare
  • 25. Acknowledgements • Zach Leber • Seva Kashin • Thaniel Novod • Frans Lawaetz • John Hanks • Matthew Trunnell • Tim Fennell • Kathleen Tibbetts • Alex Wysoker • Kiran Giramella • Chris Dagdigian • Vivien Bonazzi Funding from NHGRI