2015 09 emc lsug

Collaboration @ Scale
September, 2015: Life Sciences User Group, Cambridge MA
Chris Dwan (cdwan@broadinstitute.org)
Director, Research Computing and Data Services
Acting Director, IT

Conclusions
• Good news: The fundamentals still apply.
• Understand your data.
– Get intense about what you need and why you need it who is responsible
and how / when you plan to compute against it.
– This will require organizational courage.
• Stop thinking about “moving” data.
– Archive first. After that, all copies are transient.
• Object storage is different from files
– at many weird levels.
• Elasticity in compute is not like elasticity in data
– Availability of CPUs vs. proximity to elastic compute.
– Also, “trash storage?”

• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute

• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility
to transform medicine by using systematic approaches in the
biological sciences to dramatically accelerate the
understanding and cure of disease”

If a man’s at odds to know his own mind it’s
because he hasn’t got aught but his mind to know
it with.
Cormac McCarthy, Blood Meridian or The Evening
Redness in the West

Broad Genomics Data Production
338 trillion base pairs (PF) in August
At ~1.25 bytes per base:
422 TByte / month ~= 170 MByte / sec

Broad Genomics Data Production: Context

We were all talking
about “data
tsunamis” here.

I joined the
Broad here
We were all talking
about “data
tsunamis” here.

Under the hood: ~1TB of MongoDB

Organizations which design systems … are
constrained to produce designs which are copies of
the communication structures of those organizations
Melvin Conway, 1968

If you have four groups working on a compiler, you’ll
get a four pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996

Never send a human to do a machine’s job.
Agent Smith, The Matrix

Broad IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO

Broad IT Services
Traditional IT:
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User

Broad IT Services
Traditional IT:
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
Cloud / Hybrid Model
• Granular shared services
• VPN used to expose selected
services to particular projects
Responsibility: Project / Service Lead
BITS DevOps DSDE Dev Cloud Pilot
VPN VPN VPN

The future is already here – it’s just not very well
distributed
William
Gibson

CycleCloud provides straightforward, recognizable cluster
functionality with autoscaling and a clean management UI.
Do not be fooled by the 85 page “quick start guide,” it’s just a
cluster.

Instances are provisioned based
on queued jobs
3,000 tasks completed in two hours
(differential dependency on gene sets in R)
5 instances @ 32 cores:$8.54 / hr
This was a $20 analysis
Searching for the right use case …

Cycle Cloud on Google Pre-emptible
Instances
50,000+ cores used for ~2 hours

If you want to recruit the best people, you
have to hit home runs from time to time.

My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
End User visible OS and vendor patches (Red Hat, plus satellite)
Private Cloud Public Cloud
Containerized
Wonderland
The basics still apply
Network topology (VLANS, et al)
Public Cloud
Infrastructure
Instance Provisioning
(CycleCloud)
… Docker / Mesos
Kubernetes / Cloud
Foundry / Workflow
Description
Language / …

bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
Deleted after six
weeks
“Keep forever”
gVCF
VCF
argon
A nightmare* of files

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
bragg
Deleted after six
weeks
knox
Six months on high
performance storage,
then migrated to cost
effective filers.
gVCF
VCF
Over time, these directories
become a highly curated forest
of symbolic links, spanning
several filesystems
A nightmare of files

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
bragg
Deleted after six
weeks
knox
Six months on high
effective filers.
gVCF
VCF
several filesystems
kiwi
flynn
argon
mint
A nightmare of files

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
bragg
Deleted after six
weeks
knox
Six months on high
effective filers.
gVCF
VCF
several filesystems
kiwi
flynn
argon
Setting aside the operational issues,
meaningful access management is
frankly impossible in this architecture.
mint
A nightmare* of files

Caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Openstack
Production Farm
Avere Edge Filer
(physical)
On premise data
stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.

Cloud-backed, file-based storage
10 Gb/sec Network
80+ Gb/sec Network
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)
On premise data
stores
Cloud backed data stores
We decided to call this fargo.
It’s cold, sort of far away, and
not really where we were
planning to go.

Caching edge filers for unlimited expansion
space
10 Gb/sec Network
80+ Gb/sec Network
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)
On premise data
stores
Avere Edge Filer
(virtual)
Cloud backed
data stores
Eventually we can stand up
“cloud pods” that make direct
reference to fargo.

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
bragg
Deleted after six
weeks
knox
Six months on high
effective filers.
gVCF
VCF
several filesystems
kiwi
flynn
argon
frankly impossible in this
architecture.
mint
Fargo (avere backed,
file storage)
This is cool, but it’s not the answer.

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
Deleted after six
weeks
Six months on high
effective filers.
gVCF
VCF
several filesystems
argon
frankly impossible in this
architecture.
Fargo (avere backed,
file storage)
This is cool, but it’s not the answer.

Data push to “Fargo”
September 1, 2015:
• Sustained 250MB/sec for several weeks
• 646TB of files occupying 579TB of usable space (compression, even at 10%
savings, is totally worth it)
• Client side encryption in-line: Skip the conversation, just click the button.

The edges are still a little rough
The billing API is the best way to
get usage information out of cloud
providers.

The billing API is the best way to get usage
information out of google’s cloud offerings.
“df” can be off by
hundreds of TB.

Seriously? “df” is
off by hundreds of
TB.
Eight exabytes is
cool though.

I guess it’s better than
waiting all day for ‘du’ to
finish…

We write ~250 objects, 1MB
each, every second of every
day.
“ls” is not a meaningful tool
at this scale.

Old style dashboards simply
won’t cut it.

File based storage: The
Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.
– We need complex, dynamic, context sensitive semantics including
consent for research use.

Object storage
• It’s still made out of
disks and servers.
• You get the option of
striping across on-
premise and cloud in
dynamic and sensible
ways.

My object storage opinions
• The S3 standard defines object storage
– Any application that uses any special / proprietary features is a
nonstarter – including clever metadata stuff.
• All object storage must be durable to the loss of an entire
data center
– Conversations about sizing / usage need to be incredibly simple
• Must be cost effective at scale
– Throughput and latency are considerations, not requirements
– This breaks the data question into stewardship and usage
• Must not merely re-iterate the failure modes of filesystems

Do not call the tortoise unworthy because she is not
something else.
Walt
Whitman, Song of Myself

Object Storage is different
• Filesystems
– I/O errors or stalls are rare, and are usually evidence of
serious problems
– Optimize for throughput by using long streaming reads
and writes.
• Object Storage
– I/O errors are common, with an expectation of several
retries
– Optimize for throughput by parallelizing and reducing the
cost of a retry
– Multipart upload and download are essential

Broad Data Production, 2015:
~100TB /wk of unique information
“Data is heavy: It goes to the cheapest, closest place, and it stays
there”
Jeff Hammerbacher
This means that you should put data in its final resting place as
soon as it is generated. Anything else leads to madness.

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
Deleted after six
weeks
gVCF
VCF
Our long term archive must be
“object native”
Long term archive
Object native
Archive first
Must re-tool all pipelines to support
object storage stage-in and stage out.
Once you have your
archive right, all other
data is transientCrammed,
encrypted BAMs
• Not aligned
• Not aggregated

bragg bragg iodine
Sequencer
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
Deleted after six
weeks
gVCF
VCF
Long term archive
Object native
Archive first
Once you have your
archive right, all other
data is transient
Once the long term archive is
object-native, we can move the
main-line production to the
cloud.

The dashboard should look opaque,
because metadata lives elsewhere.

The dashboard should look opaque
• Object “names” should be a bag of UUIDs
• Object storage should be basically unusable without the
metadata index.
• Anything else recapitulates the failure mode of file based
storage.
• This should scare you.

Data Deletion @ Scale
Me: “Blah Blah … I think we’re cool to delete about
600TB of data from a cloud bucket. What do you
think?”

Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket Ray: “BOOM!”

Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket
• This was my first deliberate data deletion at this scale.
• It scared me how fast / easy it was.
• Considering a “pull request” model for large scale deletions.

Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues

This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, this
year.
We also have an opportunity to waste vast amounts of
money and still not really help the world.
I would like to work together with you to build a better future,
sooner.
cdwan@broadinstitute.org

Conclusions
• Good news: The fundamentals still apply.
• Understand your data.
– Get intense about what you need and why you need it who is responsible
and how / when you plan to compute against it.
• Stop thinking about “moving” data.
– Archive first. After that, all copies are transient.
• Object storage is different from files
– at many weird levels.
• Elasticity in compute is not like elasticity in data
– Availability of CPUs vs. proximity to elastic compute.
– Also, “trash storage?”

The opposite of play is not work, it’s depression
Jane McGonnigal, Reality is Broken
Thank You

2015 09 emc lsug

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to 2015 09 emc lsug (20)

More from Chris Dwan (20)

Recently uploaded (20)

2015 09 emc lsug