Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011

Cloud Computing:
Safe Haven from the Data Deluge?
Toby Bloom, Ph.D.

Clouds: the solution to all problems?

Agenda
• What is the “cloud”?
• When to use it?
• An example: moving our analysis pipeline to
the cloud
• What works; what doesn’t

What is Cloud Computing?
• Pay-as-you-go compute infrastructure
– Compute servers by the hour
– Storage services by the month
– Network transfers by the byte
• Wide range of other services offered by cloud
providers
• Other definitions:
– Google cloud
• Google apps, pay-as-you-go
• “applications as a service”
–

Why clouds?
• Small research centers
– 1 or 2 Illuminas can overwhelm IT infrastructure
• Spikes in load
– The week before Marco, the compute queues get very long
• Uneven load
– If load goes up and down unpredictably, don’t want to buy
resources to handle the peaks and leave them idle much of
the time
• Large collaborative projects
– Avoid repeatedly transferring data between centers
– make computational resources available in one place
– easier to share all results quickly

The advantage for large projects
1000G Pilot - Fastq lifecycle
Generate
Fastq
Fastq to
NCBI
Replicate to
EBI
Download to
Sanger
Upload BAM
to EBI
Replicate to
NCBI
Mirror to 3+
analyis sites
Goal on Cloud
10+ copies +
backups
Generate Fastq
NCBI EBI
All further
processing on
Cloud
2 files + replicas

Our Experiment: Analysis on the Cloud
• Implement our current Illumina production
analysis pipeline (Picard) on the Amazon cloud
• Compare performance & cost to local
pipelines.
• Tune architecture for the cloud
– How to change the implementation to work best
on the cloud
– Identify general “rules” for cloud implementations
• Test use on some real projects

The Pipeline
Extract
Illumina Data
to Standard
Format
Align reads
with BWA or
MAQ
Mark
Duplicate
Reads
Re-align reads
around known
indels
Calibrate
Quality Scores
Collect Metrics
about Libraries
and Run
Verify Sample
Identity
Summary
Report
Aggregation
Pipeline
Merge all data
for each library
Mark Duplicate
Reads per
library
Collect Metrics
per library
Merge all
libraries for a
sample
Collect Metrics
about the
Sample
Downstream
pipelines and
analysts
Run Level
Pipeline
Lane-Level Analysis
Sample-Level Aggregation

Current Status:
• Pipeline Manager and Picard Alignment
Pipeline are running on the Amazon cloud
• Currently running 1000 Genomes Exomes
through Picard on the cloud
– As a high-volume test case
– But also the actual pipeline for the Exome DCC
– ~110 Exomes processed.
• Still restructuring / optimizing
• Cloud capabilities always changing

Challenges of porting to the cloud
• May require substantial re-architecture of
your application
• Getting the data there
• Security/ privacy issues
• Efficient utilization of cloud resources
• Predicting usage needs and costs

IT Architecture Differences
Isilon Storage –
Petabytes in one file system
Compute Blades:
One farm, little local storage
Photos from Chris Dagdigian
Broad IT Architecture:
Load Management Software (LSF/ SGE)

Amazon Cloud Virtual Architecture
Elastic Block Storage
(EBS)
EBS
EBS
Compute servers
Simple Storage Service (S3)
Load Management Software (LSF/ SGE)

Quick Comparison
Broad
• Ease of development
– Data is all in the same place
all the time
– All servers can access all data
uniformly
– LSF does lots of the work
• Very high throughput
• Easy to add more compute
or more storage, but costly
• But
– Heavy network load
– Response time secondary to
throughput
Amazon Cloud
• Can add more compute or
storage as needed
• Don’t pay for what you don’t
use
• Need to explicitly assign
analyses to specific servers
– And move data there
• Faster turnaround
– Local storage
• But
– Need to make sure you have
enough local storage for each
job

Why does system architecture matter?
Extract Illumina
Data to Standard
Format
Align reads
with BWA
or MAQ
Mark Duplicate Reads
Re-align reads
around known
indels
…
.
Merge all data for
each library
Mark Duplicate Reads per
library
…
Disk needed
Compute needed

Possible Solutions
• NFS
• Gluster
• Move EBS drives
• Use S3 for interchange
• Custom inter-node transfer

Moving the Alignment Pipeline to the
Cloud
Elastic Block Storage
(EBS)
EBS
EBS
Compute servers
Simple Storage Service (S3)
Move Fastq’s
from Broad
to S3
Find
allocated
server with
capacity
OR request &
initialize new
server
Move fastqs
to server
Run lane-level
pipeline
Write BAM
results back
to S3
Release
Server?
Ready to
aggregate?
Copy BAMs
from S3 to
server
Allocate existing
server or request
new one
Run aggregation
pipeline
Pipeline Manager
Move
BAMs back
to S3
Release
Servers
as needed

Challenges of porting to the cloud
• May require substantial re-architecture of
your application
• Getting the data there: network issues
• Security/ privacy issues
• Efficient utilization of cloud resources
• Predicting usage needs and costs

Network Capacity and Data Transfer
• Latest test:
– Transfer of 110 exome fastqs, 800GBytes zipped
– 15 hours to upload, using 2 cores (and 2 streams)
• Transfer times are very variable
• Pay for transfer in&out, and storage monthly
 A small center should not have difficult transferring
data cycle by cycle for a single machine
Broad
Amazon
S3
1Gb, S3FTP

Security!!
• Neither the Amazon cloud nor any other cloud
is currently approved for storing controlled-
access genomic data
• Okay for 1000 Genomes, not for TCGA
• Major limitation of cloud right now
• Not necessarily a technical issue

Job Times and Node Utilization for BWA Alignment of 4 lanes on 1 CC1 node
0
10
20
30
40
50
60
70
80
90
100
4:43:21PM
5:28:21PM
6:13:21PM
6:58:21PM
7:43:21PM
8:28:22PM
9:13:22PM
9:58:22PM
10:43:22PM
11:28:22PM
12:13:22AM
12:58:22AM
1:43:22AM
2:28:22AM
3:13:22AM
3:58:22AM
4:43:22AM
5:28:23AM
6:13:23AM
6:58:23AM
7:43:23AM
8:28:23AM
9:13:23AM
9:58:23AM
10:43:23AM
11:28:23AM
12:13:23PM
12:58:23PM
1:43:23PM
2:28:24PM
3:13:24PM
3:58:24PM
4:43:24PM
5:28:24PM
6:13:24PM
6:58:24PM
7:43:24PM
8:28:24PM
9:13:24PM
9:58:24PM
10:43:24PM
11:28:25PM
12:13:25AM
12:58:25AM
1:43:25AM
2:28:25AM
3:13:25AM
3:58:25AM
%user
%iowait

Costs??
• Best estimate:
– Cloud is 2-4X the cost of local compute for our
pipeline
• BUT apples to apples comparison is difficult
• Comparison is more favorable for smaller centers
• Potential big savings for big collaborations
• Cloud costs going down more rapidly than local costs
• Much cheaper if you can predict capacity for next 1-3
years.

Costs
• Efficient utilization of compute can be difficult
• Noisy neighbors affect utilization & efficiency
• Changing data sizes affect utilization rates and
resource constraints

The Gotchas
• No way to share data among multiple
compute servers at once.
– Need to move data if using different servers for
different steps.
• Network speed variability
• Noisy neighbors
– Need to use the largest machines always
• Security regulations

Conclusions
• Definitely a viable option for small centers using
standard software
• Potential to save costs for large collaborations
• Maybe not cost effective for spikes
• Moving to the cloud is non-trivial
• Large datasets pose challenges
• Security rules need to be resolved
• Costs are hard to predict/ difficult to compare

Acknowledgements
• Zach Leber
• Seva Kashin
• Thaniel Novod
• Frans Lawaetz
• John Hanks
• Matthew Trunnell
• Tim Fennell
• Kathleen Tibbetts
• Alex Wysoker
• Kiran Giramella
• Chris Dagdigian
• Vivien Bonazzi
Funding from NHGRI

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011

More Related Content

What's hot (18)

Viewers also liked (7)

Similar to Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011 (20)

Recently uploaded (20)

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011