Collaboration @ Scale
September, 2015: Life Sciences User Group, Cambridge MA
Chris Dwan (cdwan@broadinstitute.org)
Director, Research Computing and Data Services
Acting Director, IT
Conclusions
• Good news: The fundamentals still apply.
• Understand your data.
– Get intense about what you need and why you need it who is responsible
and how / when you plan to compute against it.
– This will require organizational courage.
• Stop thinking about “moving” data.
– Archive first. After that, all copies are transient.
• Object storage is different from files
– at many weird levels.
• Elasticity in compute is not like elasticity in data
– Availability of CPUs vs. proximity to elastic compute.
– Also, “trash storage?”
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiatives
focused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platforms
focused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility
to transform medicine by using systematic approaches in the
biological sciences to dramatically accelerate the
understanding and cure of disease”
If a man’s at odds to know his own mind it’s
because he hasn’t got aught but his mind to know
it with.
Cormac McCarthy, Blood Meridian or The Evening
Redness in the West
Broad Genomics Data Production
338 trillion base pairs (PF) in August
At ~1.25 bytes per base:
422 TByte / month ~= 170 MByte / sec
Broad Genomics Data Production: Context
Broad Genomics Data Production: Context
We were all talking
about “data
tsunamis” here.
Broad Genomics Data Production: Context
I joined the
Broad here
We were all talking
about “data
tsunamis” here.
Under the hood: ~1TB of MongoDB
Organizations which design systems … are
constrained to produce designs which are copies of
the communication structures of those organizations
Melvin Conway, 1968
If you have four groups working on a compiler, you’ll
get a four pass compiler
Eric S Raymond, The New Hacker’s Dictionary, 1996
Never send a human to do a machine’s job.
Agent Smith, The Matrix
Broad IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Broad IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
Broad IT Services
Traditional IT:
• Globally shared services
• NFS, AD / LDAP, DNS, …
• Many services provided using
public clouds
Responsibility: CIO
Cancer Genome Analysis Connectivity Map
Billing Support:
• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”
• No shared services
Responsibility: User
Cloud / Hybrid Model
• Granular shared services
• VPN used to expose selected
services to particular projects
Responsibility: Project / Service Lead
BITS DevOps DSDE Dev Cloud Pilot
VPN VPN VPN
The future is already here – it’s just not very well
distributed
William
Gibson
CycleCloud provides straightforward, recognizable cluster
functionality with autoscaling and a clean management UI.
Do not be fooled by the 85 page “quick start guide,” it’s just a
cluster.
Instances are provisioned based
on queued jobs
3,000 tasks completed in two hours
(differential dependency on gene sets in R)
5 instances @ 32 cores:$8.54 / hr
This was a $20 analysis
Searching for the right use case …
Cycle Cloud on Google Pre-emptible
Instances
50,000+ cores used for ~2 hours
If you want to recruit the best people, you
have to hit home runs from time to time.
My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
End User visible OS and vendor patches (Red Hat, plus satellite)
Private Cloud Public Cloud
Containerized
Wonderland
The basics still apply
Network topology (VLANS, et al)
Public Cloud
Infrastructure
Instance Provisioning
(CycleCloud)
… Docker / Mesos
Kubernetes / Cloud
Foundry / Workflow
Description
Language / …
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
Deleted after six
weeks
“Keep forever”
gVCF
VCF
argon
A nightmare* of files
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
bragg
Deleted after six
weeks
knox
Six months on high
performance storage,
then migrated to cost
effective filers.
gVCF
VCF
Over time, these directories
become a highly curated forest
of symbolic links, spanning
several filesystems
A nightmare of files
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
bragg
Deleted after six
weeks
knox
Six months on high
performance storage,
then migrated to cost
effective filers.
gVCF
VCF
Over time, these directories
become a highly curated forest
of symbolic links, spanning
several filesystems
kiwi
flynn
argon
mint
A nightmare of files
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
bragg
Deleted after six
weeks
knox
Six months on high
performance storage,
then migrated to cost
effective filers.
gVCF
VCF
Over time, these directories
become a highly curated forest
of symbolic links, spanning
several filesystems
kiwi
flynn
argon
Setting aside the operational issues,
meaningful access management is
frankly impossible in this architecture.
mint
A nightmare* of files
Caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Openstack
Production Farm
Avere Edge Filer
(physical)
On premise data
stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Cloud-backed, file-based storage
10 Gb/sec Network
80+ Gb/sec Network
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)
On premise data
stores
Cloud backed data stores
Shared Research Farm
We decided to call this fargo.
It’s cold, sort of far away, and
not really where we were
planning to go.
Caching edge filers for unlimited expansion
space
10 Gb/sec Network
80+ Gb/sec Network
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)
On premise data
stores
Avere Edge Filer
(virtual)
Cloud backed
data stores
Shared Research Farm
Eventually we can stand up
“cloud pods” that make direct
reference to fargo.
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
bragg
Deleted after six
weeks
knox
Six months on high
performance storage,
then migrated to cost
effective filers.
gVCF
VCF
Over time, these directories
become a highly curated forest
of symbolic links, spanning
several filesystems
kiwi
flynn
argon
Setting aside the operational issues,
meaningful access management is
frankly impossible in this
architecture.
mint
Fargo (avere backed,
file storage)
This is cool, but it’s not the answer.
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
Deleted after six
weeks
Six months on high
performance storage,
then migrated to cost
effective filers.
gVCF
VCF
Over time, these directories
become a highly curated forest
of symbolic links, spanning
several filesystems
argon
Setting aside the operational issues,
meaningful access management is
frankly impossible in this
architecture.
Fargo (avere backed,
file storage)
This is cool, but it’s not the answer.
Data push to “Fargo”
September 1, 2015:
• Sustained 250MB/sec for several weeks
• 646TB of files occupying 579TB of usable space (compression, even at 10%
savings, is totally worth it)
• Client side encryption in-line: Skip the conversation, just click the button.
The edges are still a little rough
The billing API is the best way to
get usage information out of cloud
providers.
The edges are still a little rough
The billing API is the best way to get usage
information out of google’s cloud offerings.
“df” can be off by
hundreds of TB.
The edges are still a little rough
The billing API is the best way to get usage
information out of google’s cloud offerings.
Seriously? “df” is
off by hundreds of
TB.
Eight exabytes is
cool though.
The edges are still a little rough
The billing API is the best way to get usage
information out of google’s cloud offerings.
I guess it’s better than
waiting all day for ‘du’ to
finish…
The edges are still a little rough
The billing API is the best way to get usage
information out of google’s cloud offerings.
We write ~250 objects, 1MB
each, every second of every
day.
“ls” is not a meaningful tool
at this scale.
The edges are still a little rough
The billing API is the best way to get usage
information out of google’s cloud offerings.
Old style dashboards simply
won’t cut it.
File based storage: The
Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.
– We need complex, dynamic, context sensitive semantics including
consent for research use.
File based storage: The
Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.
– We need complex, dynamic, context sensitive semantics including
consent for research use.
Object storage
• It’s still made out of
disks and servers.
• You get the option of
striping across on-
premise and cloud in
dynamic and sensible
ways.
My object storage opinions
• The S3 standard defines object storage
– Any application that uses any special / proprietary features is a
nonstarter – including clever metadata stuff.
• All object storage must be durable to the loss of an entire
data center
– Conversations about sizing / usage need to be incredibly simple
• Must be cost effective at scale
– Throughput and latency are considerations, not requirements
– This breaks the data question into stewardship and usage
• Must not merely re-iterate the failure modes of filesystems
Do not call the tortoise unworthy because she is not
something else.
Walt
Whitman, Song of Myself
Object Storage is different
• Filesystems
– I/O errors or stalls are rare, and are usually evidence of
serious problems
– Optimize for throughput by using long streaming reads
and writes.
• Object Storage
– I/O errors are common, with an expectation of several
retries
– Optimize for throughput by parallelizing and reducing the
cost of a retry
– Multipart upload and download are essential
Broad Data Production, 2015:
~100TB /wk of unique information
“Data is heavy: It goes to the cheapest, closest place, and it stays
there”
Jeff Hammerbacher
This means that you should put data in its final resting place as
soon as it is generated. Anything else leads to madness.
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
Deleted after six
weeks
gVCF
VCF
Our long term archive must be
“object native”
Long term archive
Object native
Archive first
Must re-tool all pipelines to support
object storage stage-in and stage out.
Once you have your
archive right, all other
data is transientCrammed,
encrypted BAMs
• Not aligned
• Not aggregated
bragg bragg iodine
Sequencer
Flowcell Directories
• Base Calling
• Paired reads
• /seq/illumina
Lane BAMs
• Aligned
• Not aggregated
• /seq/picard
Aggregated BAMs
• Aligned to a reference
• /seq/picard_aggregation
Deleted after six
weeks
gVCF
VCF
Long term archive
Object native
Archive first
Once you have your
archive right, all other
data is transient
Once the long term archive is
object-native, we can move the
main-line production to the
cloud.
The dashboard should look opaque,
because metadata lives elsewhere.
The dashboard should look opaque
• Object “names” should be a bag of UUIDs
• Object storage should be basically unusable without the
metadata index.
• Anything else recapitulates the failure mode of file based
storage.
• This should scare you.
Data Deletion @ Scale
Me: “Blah Blah … I think we’re cool to delete about
600TB of data from a cloud bucket. What do you
think?”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket Ray: “BOOM!”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket
• This was my first deliberate data deletion at this scale.
• It scared me how fast / easy it was.
• Considering a “pull request” model for large scale deletions.
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues
This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, this
year.
We also have an opportunity to waste vast amounts of
money and still not really help the world.
I would like to work together with you to build a better future,
sooner.
cdwan@broadinstitute.org
Conclusions
• Good news: The fundamentals still apply.
• Understand your data.
– Get intense about what you need and why you need it who is responsible
and how / when you plan to compute against it.
• Stop thinking about “moving” data.
– Archive first. After that, all copies are transient.
• Object storage is different from files
– at many weird levels.
• Elasticity in compute is not like elasticity in data
– Availability of CPUs vs. proximity to elastic compute.
– Also, “trash storage?”
The opposite of play is not work, it’s depression
Jane McGonnigal, Reality is Broken
Thank You

More Related Content

PPTX
2017 bio it world
PPTX
2016 09 cxo forum
PPTX
2016 05 sanger
PPTX
2015 04 bio it world
PPTX
2013 bio it world
PPTX
2017 12 lab informatics summit
PPTX
So Long Computer Overlords
PPTX
Rpi talk foster september 2011
2017 bio it world
2016 09 cxo forum
2016 05 sanger
2015 04 bio it world
2013 bio it world
2017 12 lab informatics summit
So Long Computer Overlords
Rpi talk foster september 2011

What's hot (20)

PDF
Advanced Research Computing at York
PDF
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
PPTX
Empowering Transformational Science
PDF
Big Data: an introduction
PPTX
Accelerating data-intensive science by outsourcing the mundane
PPTX
2019 BioIt World - Post cloud legacy edition
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
PDF
Cloud Accelerated Genomics
PDF
Introduction to Big Data
PDF
IRJET- Systematic Review: Progression Study on BIG DATA articles
PPTX
Big Data - A brief introduction
PPTX
Cloud-native Enterprise Data Science Teams
PPTX
Mapping Life Science Informatics to the Cloud
PDF
Adoption of Cloud Computing in Scientific Research
PDF
Big Data and Bad Analogies
PPTX
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
PPTX
Big Data - An Overview
PDF
Briefing Room analyst comments - streaming analytics
PPT
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
PDF
The universe of identifiers and how ANDS is using them
Advanced Research Computing at York
Chris Marsden, University of Essex (Plenary): Regulation, Standards, Governan...
Empowering Transformational Science
Big Data: an introduction
Accelerating data-intensive science by outsourcing the mundane
2019 BioIt World - Post cloud legacy edition
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Cloud Accelerated Genomics
Introduction to Big Data
IRJET- Systematic Review: Progression Study on BIG DATA articles
Big Data - A brief introduction
Cloud-native Enterprise Data Science Teams
Mapping Life Science Informatics to the Cloud
Adoption of Cloud Computing in Scientific Research
Big Data and Bad Analogies
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
Big Data - An Overview
Briefing Room analyst comments - streaming analytics
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The universe of identifiers and how ANDS is using them
Ad

Viewers also liked (11)

PDF
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
PPTX
Clinical applications of NGS
ODP
Storage for next-generation sequencing
PPTX
Next generation sequencing
PPTX
Next Gen Sequencing (NGS) Technology Overview
PDF
NGS technologies - platforms and applications
PDF
Ngs intro_v6_public
PPTX
Ngs ppt
PDF
Introduction to next generation sequencing
PDF
BioIT World 2016 - HPC Trends from the Trenches
PPTX
A Comparison of NGS Platforms.
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Clinical applications of NGS
Storage for next-generation sequencing
Next generation sequencing
Next Gen Sequencing (NGS) Technology Overview
NGS technologies - platforms and applications
Ngs intro_v6_public
Ngs ppt
Introduction to next generation sequencing
BioIT World 2016 - HPC Trends from the Trenches
A Comparison of NGS Platforms.
Ad

Similar to 2015 09 emc lsug (20)

PPT
Computing Outside The Box June 2009
ODP
Clouds, Grids and Data
PPT
Computing Outside The Box September 2009
PPT
Cyberinfrastructure and Applications Overview: Howard University June22
ODP
Cloud Experiences
PPT
20120524 cern data centre evolution v2
PPTX
Data-intensive applications on cloud computing resources: Applications in lif...
PDF
Big data and cloud computing 9 sep-2017
PDF
Coates bosc2010 clouds-fluff-and-no-substance
PPTX
Utilising Cloud Computing for Research through Infrastructure, Software and D...
PDF
AWS re:Invent - Accelerating Research
PPTX
Re invent announcements_2016_hcls_use_cases_mchampion
PDF
HPC Cluster Computing from 64 to 156,000 Cores 
KEY
Trends from the Trenches (Singapore Edition)
PPTX
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
ODP
Cloud accounting software uk
PDF
ClassCloud: switch your PC Classroom into Cloud Testbed
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
ODP
Clouds: All fluff and no substance?
PDF
Datacenter Computing with Apache Mesos - BigData DC
Computing Outside The Box June 2009
Clouds, Grids and Data
Computing Outside The Box September 2009
Cyberinfrastructure and Applications Overview: Howard University June22
Cloud Experiences
20120524 cern data centre evolution v2
Data-intensive applications on cloud computing resources: Applications in lif...
Big data and cloud computing 9 sep-2017
Coates bosc2010 clouds-fluff-and-no-substance
Utilising Cloud Computing for Research through Infrastructure, Software and D...
AWS re:Invent - Accelerating Research
Re invent announcements_2016_hcls_use_cases_mchampion
HPC Cluster Computing from 64 to 156,000 Cores 
Trends from the Trenches (Singapore Edition)
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud accounting software uk
ClassCloud: switch your PC Classroom into Cloud Testbed
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Clouds: All fluff and no substance?
Datacenter Computing with Apache Mesos - BigData DC

More from Chris Dwan (20)

PPTX
Data and Computing Infrastructure for the Life Sciences
PDF
Somerville Police Staffing Final Report.pdf
PDF
2023 Ward 2 community meeting.pdf
PPTX
One Size Does Not Fit All
PDF
Somerville FY23 Proposed Budget
PPTX
Production Bioinformatics, emphasis on Production
PPTX
#Defund thepolice
PPTX
2009 cluster user training
PPTX
No Free Lunch: Metadata in the life sciences
PDF
Somerville ufc memo tree hearing
PDF
2011 career-fair
PPTX
Advocacy in the Enterprise (what works, what doesn't)
PPTX
"The Cutting Edge Can Hurt You"
PPT
Introduction to HPC
PPT
Intro bioinformatics
PDF
Proposed tree protection ordinance
PDF
Tree Ordinance Change Matrix
PDF
Tree protection overhaul
PDF
Response from newport
PDF
Sacramento underpass bid_docs
Data and Computing Infrastructure for the Life Sciences
Somerville Police Staffing Final Report.pdf
2023 Ward 2 community meeting.pdf
One Size Does Not Fit All
Somerville FY23 Proposed Budget
Production Bioinformatics, emphasis on Production
#Defund thepolice
2009 cluster user training
No Free Lunch: Metadata in the life sciences
Somerville ufc memo tree hearing
2011 career-fair
Advocacy in the Enterprise (what works, what doesn't)
"The Cutting Edge Can Hurt You"
Introduction to HPC
Intro bioinformatics
Proposed tree protection ordinance
Tree Ordinance Change Matrix
Tree protection overhaul
Response from newport
Sacramento underpass bid_docs

Recently uploaded (20)

PPTX
Training Program for knowledge in solar cell and solar industry
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PPTX
Internet of Everything -Basic concepts details
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
4 layer Arch & Reference Arch of IoT.pdf
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
PPTX
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
SaaS reusability assessment using machine learning techniques
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
Training Program for knowledge in solar cell and solar industry
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
SGT Report The Beast Plan and Cyberphysical Systems of Control
future_of_ai_comprehensive_20250822032121.pptx
Internet of Everything -Basic concepts details
Module 1 Introduction to Web Programming .pptx
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Lung cancer patients survival prediction using outlier detection and optimize...
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Co-training pseudo-labeling for text classification with support vector machi...
Connector Corner: Transform Unstructured Documents with Agentic Automation
4 layer Arch & Reference Arch of IoT.pdf
Improvisation in detection of pomegranate leaf disease using transfer learni...
Transform-Your-Supply-Chain-with-AI-Driven-Quality-Engineering.pdf
agenticai-neweraofintelligence-250529192801-1b5e6870.pptx
MuleSoft-Compete-Deck for midddleware integrations
Advancing precision in air quality forecasting through machine learning integ...
SaaS reusability assessment using machine learning techniques
Convolutional neural network based encoder-decoder for efficient real-time ob...

2015 09 emc lsug

  • 1. Collaboration @ Scale September, 2015: Life Sciences User Group, Cambridge MA Chris Dwan ([email protected]) Director, Research Computing and Data Services Acting Director, IT
  • 2. Conclusions • Good news: The fundamentals still apply. • Understand your data. – Get intense about what you need and why you need it who is responsible and how / when you plan to compute against it. – This will require organizational courage. • Stop thinking about “moving” data. – Archive first. After that, all copies are transient. • Object storage is different from files – at many weird levels. • Elasticity in compute is not like elasticity in data – Availability of CPUs vs. proximity to elastic compute. – Also, “trash storage?”
  • 3. • The Broad Institute is a non-profit biomedical research institute founded in 2004 • Fifty core faculty members and hundreds of associate members from MIT and Harvard • ~1000 research and administrative personnel, plus ~2,400+ associated researchers • ~1.4 x 106 genotyped samples Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute
  • 4. • The Broad Institute is a non-profit biomedical research institute founded in 2004 • Fifty core faculty members and hundreds of associate members from MIT and Harvard • ~1000 research and administrative personnel, plus ~2,400+ associated researchers • ~1.4 x 106 genotyped samples Programs and Initiatives focused on specific disease or biology areas Cancer Genome Biology Cell Circuits Psychiatric Disease Metabolism Medical and Population Genetics Infectious Disease Epigenomics Platforms focused technological innovation and application Genomics Therapeutics Imaging Metabolite Profiling Proteomics Genetic Perturbation The Broad Institute “This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”
  • 5. If a man’s at odds to know his own mind it’s because he hasn’t got aught but his mind to know it with. Cormac McCarthy, Blood Meridian or The Evening Redness in the West
  • 6. Broad Genomics Data Production 338 trillion base pairs (PF) in August At ~1.25 bytes per base: 422 TByte / month ~= 170 MByte / sec
  • 7. Broad Genomics Data Production: Context
  • 8. Broad Genomics Data Production: Context We were all talking about “data tsunamis” here.
  • 9. Broad Genomics Data Production: Context I joined the Broad here We were all talking about “data tsunamis” here.
  • 10. Under the hood: ~1TB of MongoDB
  • 11. Organizations which design systems … are constrained to produce designs which are copies of the communication structures of those organizations Melvin Conway, 1968
  • 12. If you have four groups working on a compiler, you’ll get a four pass compiler Eric S Raymond, The New Hacker’s Dictionary, 1996
  • 13. Never send a human to do a machine’s job. Agent Smith, The Matrix
  • 14. Broad IT Services Traditional IT: • Globally shared services • NFS, AD / LDAP, DNS, … • Many services provided using public clouds Responsibility: CIO
  • 15. Broad IT Services Traditional IT: • Globally shared services • NFS, AD / LDAP, DNS, … • Many services provided using public clouds Responsibility: CIO Cancer Genome Analysis Connectivity Map Billing Support: • IT provides coordination between internal cost objects and cloud vendor “projects” or “roles” • No shared services Responsibility: User
  • 16. Broad IT Services Traditional IT: • Globally shared services • NFS, AD / LDAP, DNS, … • Many services provided using public clouds Responsibility: CIO Cancer Genome Analysis Connectivity Map Billing Support: • IT provides coordination between internal cost objects and cloud vendor “projects” or “roles” • No shared services Responsibility: User Cloud / Hybrid Model • Granular shared services • VPN used to expose selected services to particular projects Responsibility: Project / Service Lead BITS DevOps DSDE Dev Cloud Pilot VPN VPN VPN
  • 17. The future is already here – it’s just not very well distributed William Gibson
  • 18. CycleCloud provides straightforward, recognizable cluster functionality with autoscaling and a clean management UI. Do not be fooled by the 85 page “quick start guide,” it’s just a cluster.
  • 19. Instances are provisioned based on queued jobs 3,000 tasks completed in two hours (differential dependency on gene sets in R) 5 instances @ 32 cores:$8.54 / hr This was a $20 analysis Searching for the right use case …
  • 20. Cycle Cloud on Google Pre-emptible Instances 50,000+ cores used for ~2 hours
  • 21. If you want to recruit the best people, you have to hit home runs from time to time.
  • 22. My Metal Boot Image Provisioning (PXE / Cobbler, Kickstart) Hardware Provisioning (UCS, Xcat) Broad configuration (Puppet) User or execution environment (Dotkit, docker, JVM, Tomcat) Hypervisor OS Instance Provisioning (Openstack) Bare Metal End User visible OS and vendor patches (Red Hat, plus satellite) Private Cloud Public Cloud Containerized Wonderland The basics still apply Network topology (VLANS, et al) Public Cloud Infrastructure Instance Provisioning (CycleCloud) … Docker / Mesos Kubernetes / Cloud Foundry / Workflow Description Language / …
  • 23. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation Deleted after six weeks “Keep forever” gVCF VCF argon A nightmare* of files
  • 24. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation bragg Deleted after six weeks knox Six months on high performance storage, then migrated to cost effective filers. gVCF VCF Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems A nightmare of files
  • 25. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation bragg Deleted after six weeks knox Six months on high performance storage, then migrated to cost effective filers. gVCF VCF Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems kiwi flynn argon mint A nightmare of files
  • 26. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation bragg Deleted after six weeks knox Six months on high performance storage, then migrated to cost effective filers. gVCF VCF Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems kiwi flynn argon Setting aside the operational issues, meaningful access management is frankly impossible in this architecture. mint A nightmare* of files
  • 27. Caching edge filers for shared references 10 Gb/sec Network 80+ Gb/sec Network Openstack Production Farm Avere Edge Filer (physical) On premise data stores Shared Research Farm Coherence on small volumes of files provided by a combination of clever network routing and Avere’s caching algorithms.
  • 28. Cloud-backed, file-based storage 10 Gb/sec Network 80+ Gb/sec Network Openstack Production Farm Multiple Public Clouds Avere Edge Filer (physical) On premise data stores Cloud backed data stores Shared Research Farm We decided to call this fargo. It’s cold, sort of far away, and not really where we were planning to go.
  • 29. Caching edge filers for unlimited expansion space 10 Gb/sec Network 80+ Gb/sec Network Openstack Production Farm Multiple Public Clouds Avere Edge Filer (physical) On premise data stores Avere Edge Filer (virtual) Cloud backed data stores Shared Research Farm Eventually we can stand up “cloud pods” that make direct reference to fargo.
  • 30. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation bragg Deleted after six weeks knox Six months on high performance storage, then migrated to cost effective filers. gVCF VCF Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems kiwi flynn argon Setting aside the operational issues, meaningful access management is frankly impossible in this architecture. mint Fargo (avere backed, file storage) This is cool, but it’s not the answer.
  • 31. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation Deleted after six weeks Six months on high performance storage, then migrated to cost effective filers. gVCF VCF Over time, these directories become a highly curated forest of symbolic links, spanning several filesystems argon Setting aside the operational issues, meaningful access management is frankly impossible in this architecture. Fargo (avere backed, file storage) This is cool, but it’s not the answer.
  • 32. Data push to “Fargo” September 1, 2015: • Sustained 250MB/sec for several weeks • 646TB of files occupying 579TB of usable space (compression, even at 10% savings, is totally worth it) • Client side encryption in-line: Skip the conversation, just click the button.
  • 33. The edges are still a little rough The billing API is the best way to get usage information out of cloud providers.
  • 34. The edges are still a little rough The billing API is the best way to get usage information out of google’s cloud offerings. “df” can be off by hundreds of TB.
  • 35. The edges are still a little rough The billing API is the best way to get usage information out of google’s cloud offerings. Seriously? “df” is off by hundreds of TB. Eight exabytes is cool though.
  • 36. The edges are still a little rough The billing API is the best way to get usage information out of google’s cloud offerings. I guess it’s better than waiting all day for ‘du’ to finish…
  • 37. The edges are still a little rough The billing API is the best way to get usage information out of google’s cloud offerings. We write ~250 objects, 1MB each, every second of every day. “ls” is not a meaningful tool at this scale.
  • 38. The edges are still a little rough The billing API is the best way to get usage information out of google’s cloud offerings. Old style dashboards simply won’t cut it.
  • 39. File based storage: The Information Limits • Single namespace filers hit real-world limits at: – ~5PB (restriping times, operational hotspots, MTBF headaches) – ~109 files: Directories must either be wider or deeper than human brains can handle. • Filesystem paths are presumed to persist forever – Leads inevitably to forests of symbolic links • Access semantics are inadequate for the federated world. – We need complex, dynamic, context sensitive semantics including consent for research use.
  • 40. File based storage: The Information Limits • Single namespace filers hit real-world limits at: – ~5PB (restriping times, operational hotspots, MTBF headaches) – ~109 files: Directories must either be wider or deeper than human brains can handle. • Filesystem paths are presumed to persist forever – Leads inevitably to forests of symbolic links • Access semantics are inadequate for the federated world. – We need complex, dynamic, context sensitive semantics including consent for research use.
  • 41. Object storage • It’s still made out of disks and servers. • You get the option of striping across on- premise and cloud in dynamic and sensible ways.
  • 42. My object storage opinions • The S3 standard defines object storage – Any application that uses any special / proprietary features is a nonstarter – including clever metadata stuff. • All object storage must be durable to the loss of an entire data center – Conversations about sizing / usage need to be incredibly simple • Must be cost effective at scale – Throughput and latency are considerations, not requirements – This breaks the data question into stewardship and usage • Must not merely re-iterate the failure modes of filesystems
  • 43. Do not call the tortoise unworthy because she is not something else. Walt Whitman, Song of Myself
  • 44. Object Storage is different • Filesystems – I/O errors or stalls are rare, and are usually evidence of serious problems – Optimize for throughput by using long streaming reads and writes. • Object Storage – I/O errors are common, with an expectation of several retries – Optimize for throughput by parallelizing and reducing the cost of a retry – Multipart upload and download are essential
  • 45. Broad Data Production, 2015: ~100TB /wk of unique information “Data is heavy: It goes to the cheapest, closest place, and it stays there” Jeff Hammerbacher This means that you should put data in its final resting place as soon as it is generated. Anything else leads to madness.
  • 46. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation Deleted after six weeks gVCF VCF Our long term archive must be “object native” Long term archive Object native Archive first Must re-tool all pipelines to support object storage stage-in and stage out. Once you have your archive right, all other data is transientCrammed, encrypted BAMs • Not aligned • Not aggregated
  • 47. bragg bragg iodine Sequencer Flowcell Directories • Base Calling • Paired reads • /seq/illumina Lane BAMs • Aligned • Not aggregated • /seq/picard Aggregated BAMs • Aligned to a reference • /seq/picard_aggregation Deleted after six weeks gVCF VCF Long term archive Object native Archive first Once you have your archive right, all other data is transient Once the long term archive is object-native, we can move the main-line production to the cloud.
  • 48. The dashboard should look opaque, because metadata lives elsewhere.
  • 49. The dashboard should look opaque • Object “names” should be a bag of UUIDs • Object storage should be basically unusable without the metadata index. • Anything else recapitulates the failure mode of file based storage. • This should scare you.
  • 50. Data Deletion @ Scale Me: “Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket. What do you think?”
  • 51. Data Deletion @ Scale Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket Ray: “BOOM!”
  • 52. Data Deletion @ Scale Blah Blah … I think we’re cool to delete about 600TB of data from a cloud bucket • This was my first deliberate data deletion at this scale. • It scared me how fast / easy it was. • Considering a “pull request” model for large scale deletions.
  • 53. Standards are needed for genomic data “The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.” Regulatory Issues Ethical Issues Technical Issues
  • 54. This stuff is important We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, this year. We also have an opportunity to waste vast amounts of money and still not really help the world. I would like to work together with you to build a better future, sooner. [email protected]
  • 55. Conclusions • Good news: The fundamentals still apply. • Understand your data. – Get intense about what you need and why you need it who is responsible and how / when you plan to compute against it. • Stop thinking about “moving” data. – Archive first. After that, all copies are transient. • Object storage is different from files – at many weird levels. • Elasticity in compute is not like elasticity in data – Availability of CPUs vs. proximity to elastic compute. – Also, “trash storage?”
  • 56. The opposite of play is not work, it’s depression Jane McGonnigal, Reality is Broken Thank You