Cloud Experiences Guy Coates
Wellcome Trust Sanger Institute
[email_address]
The Sanger Institute Funded by Wellcome Trust. 2 nd  largest research charity in the world.
~700 employees.
Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. Sequenced 1/3 of the human genome. (largest single contributor).
We have active cancer, malaria, pathogen and genomic variation / human health studies. All data is made publicly available. Websites, ftp, direct database. access, programmatic APIs.
DNA Sequencing  TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAG  CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCG  ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments Human Genome (3GBases)
Moore's Law Compute/disk doubles every 18 months Sequencing doubles every 12 months
Economic Trends: The Human genome project:  13 years.
23 labs.
$500 Million. A Human genome today: 3 days.
1 machine.
$8,000. Trend will continue: $500 genome is probable within 3-5 years.
The scary graph Peak Yearly capillary sequencing: 30 Gbase Current weekly sequencing: 6000 Gbase
Our Science
UK 10K Project Decode the genome of 10,000 people in  the uk.
Will improve the understanding of human genetic variation and disease. Genome Research Limited Wellcome Trust launches study of 10,000 human genomes in UK; 24 June 2010 www.sanger.ac.uk/about/press/2010/100624-uk10k.html
New scale, new insights . . . to common disease Coronary heart disease
Hypertension
Bipolar disorder
Arthritis
Obesity
Diabetes (types I and II)
Breast cancer
Malaria
Tuberculosis
Cancer Genome Project Cancer is a disease caused by abnormalities in a cell's genome.
Detailed Changes: Sequencing hundreds of cancer samples
First Comprehensive look at cancer genomes Lung Cancer
Malignant melanoma
Breast cancer Identify driver mutations for: Improved diagnostics
Development of novel therapies
Targeting of existing therapeutics Lung Cancer and melanoma laid bare; 16 December 2009  www.sanger.ac.uk/about/press/2009/091216.html
IT Challenges
Managing Growth Analysing the data takes a lot of compute and disk space Finished sequence is the start of the problem, not the end. Growth of compute & storage  Storage /compute doubles every 12 months. 2010 ~12 PB raw Moore's law will not save us.
1000$ genome*
*Informatics not included
Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet
Data centre 4x250 M 2  Data centres. 2-4KW / M 2  cooling.
1.8 MW power draw
1.5 PUE  Overhead aircon, power and networking. Allows counter-current cooling.
Focus on power & space efficient storage and compute.  Technology Refresh. 1 data centre is an empty shell. Rotate into the empty room every 4 years and refurb. “Fallow Field” principle. rack rack rack rack
Our HPC Infrastructure Compute 8500 cores
10GigE / 1GigE networking. High performance storage 1.5 PB DDN 9000&10000 storage
Lustre filesystem LSF queuing system
Ensembl Data visualisation / Mining web services. www.ensembl.org
Provides web / programmatic interfaces to genomic data.
10k visitors / 126k page views per day. Compute Pipeline (HPTC Workload) Take a raw genome and run it through a compute pipeline to find genes and other features of interest.
Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes. Software is Open Source (apache license).
Data is free for download.
Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet HPC Compute Pipeline Web / Database  infrastructure
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
Annotation
Annotation
Why Cloud?

More Related Content

ODP
Storage for next-generation sequencing
ODP
Next generation genomics: Petascale data in the life sciences
ODP
Life sciences big data use cases
ODP
Sharing data: Sanger Experiences
ODP
Next-generation sequencing: Data mangement
ODP
Future Architectures for genomics
PDF
Managing Genomics Data at the Sanger Institute
ODP
Sanger HPC infrastructure Report (2007)
Storage for next-generation sequencing
Next generation genomics: Petascale data in the life sciences
Life sciences big data use cases
Sharing data: Sanger Experiences
Next-generation sequencing: Data mangement
Future Architectures for genomics
Managing Genomics Data at the Sanger Institute
Sanger HPC infrastructure Report (2007)

What's hot (20)

ODP
Clouds, Grids and Data
PDF
Challenges and Opportunities of Big Data Genomics
ODP
Cluster Filesystems and the next 1000 human genomes
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PPT
PUC Masterclass Big Data
PPT
My other computer_is_a_datacentre
PPTX
Empowering Transformational Science
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
PDF
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
PPTX
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
PPTX
VariantSpark: applying Spark-based machine learning methods to genomic inform...
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
PDF
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
PDF
Utility HPC: Right Systems, Right Scale, Right Science
PDF
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
PDF
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
PPTX
PDF
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
PPTX
How novel compute technology transforms life science research
PDF
Switc Hpa
Clouds, Grids and Data
Challenges and Opportunities of Big Data Genomics
Cluster Filesystems and the next 1000 human genomes
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PUC Masterclass Big Data
My other computer_is_a_datacentre
Empowering Transformational Science
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
VariantSpark: applying Spark-based machine learning methods to genomic inform...
Hadoop for Bioinformatics: Building a Scalable Variant Store
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
Utility HPC: Right Systems, Right Scale, Right Science
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
How novel compute technology transforms life science research
Switc Hpa
Ad

Similar to Cloud Experiences (20)

ODP
Clouds: All fluff and no substance?
PDF
Coates bosc2010 clouds-fluff-and-no-substance
PPTX
Climb bath
PPTX
2015 09 emc lsug
PPTX
CLIMB System Introduction Talk - CLIMB Launch
PPTX
"The Cutting Edge Can Hurt You"
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
PPT
CLOUD BIOINFORMATICS Part1
PPTX
Everything comes in 3's
PDF
Computational infrastructure for NGS data analysis
PDF
Cloud Technical Challenges
PPT
Computing Outside The Box June 2009
PPTX
2016 05 sanger
PDF
Ruby on bioinformatics
PPT
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
PDF
The pulse of cloud computing with bioinformatics as an example
PPTX
Climb stateoftheartintro
PPTX
Sanger, upcoming Openstack for Bio-informaticians
PPTX
Flexible compute
PDF
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Clouds: All fluff and no substance?
Coates bosc2010 clouds-fluff-and-no-substance
Climb bath
2015 09 emc lsug
CLIMB System Introduction Talk - CLIMB Launch
"The Cutting Edge Can Hurt You"
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
CLOUD BIOINFORMATICS Part1
Everything comes in 3's
Computational infrastructure for NGS data analysis
Cloud Technical Challenges
Computing Outside The Box June 2009
2016 05 sanger
Ruby on bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
The pulse of cloud computing with bioinformatics as an example
Climb stateoftheartintro
Sanger, upcoming Openstack for Bio-informaticians
Flexible compute
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Ad

Recently uploaded (20)

PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Human Computer Interaction Miterm Lesson
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PPTX
Training Program for knowledge in solar cell and solar industry
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PDF
Ensemble model-based arrhythmia classification with local interpretable model...
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Co-training pseudo-labeling for text classification with support vector machi...
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
NewMind AI Weekly Chronicles – August ’25 Week IV
Human Computer Interaction Miterm Lesson
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
Training Program for knowledge in solar cell and solar industry
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
MuleSoft-Compete-Deck for midddleware integrations
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
Ensemble model-based arrhythmia classification with local interpretable model...
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Connector Corner: Transform Unstructured Documents with Agentic Automation
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
LMS bot: enhanced learning management systems for improved student learning e...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf

Cloud Experiences

  • 4. The Sanger Institute Funded by Wellcome Trust. 2 nd largest research charity in the world.
  • 6. Based in Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. Sequenced 1/3 of the human genome. (largest single contributor).
  • 7. We have active cancer, malaria, pathogen and genomic variation / human health studies. All data is made publicly available. Websites, ftp, direct database. access, programmatic APIs.
  • 8. DNA Sequencing TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments Human Genome (3GBases)
  • 9. Moore's Law Compute/disk doubles every 18 months Sequencing doubles every 12 months
  • 10. Economic Trends: The Human genome project: 13 years.
  • 12. $500 Million. A Human genome today: 3 days.
  • 14. $8,000. Trend will continue: $500 genome is probable within 3-5 years.
  • 15. The scary graph Peak Yearly capillary sequencing: 30 Gbase Current weekly sequencing: 6000 Gbase
  • 17. UK 10K Project Decode the genome of 10,000 people in the uk.
  • 18. Will improve the understanding of human genetic variation and disease. Genome Research Limited Wellcome Trust launches study of 10,000 human genomes in UK; 24 June 2010 www.sanger.ac.uk/about/press/2010/100624-uk10k.html
  • 19. New scale, new insights . . . to common disease Coronary heart disease
  • 24. Diabetes (types I and II)
  • 28. Cancer Genome Project Cancer is a disease caused by abnormalities in a cell's genome.
  • 29. Detailed Changes: Sequencing hundreds of cancer samples
  • 30. First Comprehensive look at cancer genomes Lung Cancer
  • 32. Breast cancer Identify driver mutations for: Improved diagnostics
  • 33. Development of novel therapies
  • 34. Targeting of existing therapeutics Lung Cancer and melanoma laid bare; 16 December 2009 www.sanger.ac.uk/about/press/2009/091216.html
  • 36. Managing Growth Analysing the data takes a lot of compute and disk space Finished sequence is the start of the problem, not the end. Growth of compute & storage Storage /compute doubles every 12 months. 2010 ~12 PB raw Moore's law will not save us.
  • 39. Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet
  • 40. Data centre 4x250 M 2 Data centres. 2-4KW / M 2 cooling.
  • 41. 1.8 MW power draw
  • 42. 1.5 PUE Overhead aircon, power and networking. Allows counter-current cooling.
  • 43. Focus on power & space efficient storage and compute. Technology Refresh. 1 data centre is an empty shell. Rotate into the empty room every 4 years and refurb. “Fallow Field” principle. rack rack rack rack
  • 44. Our HPC Infrastructure Compute 8500 cores
  • 45. 10GigE / 1GigE networking. High performance storage 1.5 PB DDN 9000&10000 storage
  • 46. Lustre filesystem LSF queuing system
  • 47. Ensembl Data visualisation / Mining web services. www.ensembl.org
  • 48. Provides web / programmatic interfaces to genomic data.
  • 49. 10k visitors / 126k page views per day. Compute Pipeline (HPTC Workload) Take a raw genome and run it through a compute pipeline to find genes and other features of interest.
  • 50. Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes. Software is Open Source (apache license).
  • 51. Data is free for download.
  • 52. Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet HPC Compute Pipeline Web / Database infrastructure
  • 53. TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
  • 57. Web Services Ensembl has a worldwide audience.
  • 58. Historically, web site performance was not great, especially for non-european institutes. Pages were quite heavyweight.
  • 59. Not properly cached etc. Web team spent a lot of time re-designing the code to make it more streamlined. Greatly improved performance. Coding can only get you so-far. 150-240ms round trip time from Europe to the US.
  • 60. We need a set of geographically dispersed mirrors.
  • 61. Colocation Real machines in a co-lo facility in California. Traditional mirror. Hardware was initially configured on site. 16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc. Shipped to the co-lo for installation. Sent a person to California for 3 weeks.
  • 62. Spent 1 week getting stuff into/out of customs. ****ing FCC paperwork! Additional infrastructure work. VPN between UK and US. Incredibly time consuming. Really don't want to end up having to send someone on a plane to the US to fix things.
  • 63. Cloud Opportunities We wanted more mirrors. US East coast, asia-pacific. Investigations into AWS already ongoing.
  • 64. Many people would like to run ensembl webcode to visualise their own data. Non trivial for the non-expert user. Mysql, apache, perl. Can we distribute AMIs instead? Ready to run. Can we eat our own dog-food? Run mirror site from the AMIs?
  • 65. What we actually did: AWS Sanger Sanger VPN
  • 66. Building a mirror on AWS Application development was required Significant code changes required to make the webcode “mirror aware”. Mostly done for the original co-location site. Some software development / sysadmin work needed. Preparation of OS images, software stack configuration.
  • 67. VPN configuration Significant amount of tuning required. Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB).
  • 68. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.
  • 70. Is it cost effective? Lots of misleading cost statements made about cloud. “Our analysis only cost $500.”
  • 71. CPU is only “$0.085 / hr”. What are we comparing against? Doing the analysis once? Continually?
  • 72. Buying a $2000 server?
  • 73. Leasing a $2000 server for 3 years?
  • 74. Using $150 of time at your local supercomputing facility?
  • 75. Buying a $2000 of server but having to build a $1M datacentre to put it in? Requires the dreaded Total Cost of Ownership (TCO) calculation. hardware + power + cooling + facilities + admin/developers etc Incredibly hard to do.
  • 76. Breakdown: Comparing costs to the “real” Co-lo power, cooling costs are all included.
  • 77. Admin costs are the same, so we can ignore them. Same people responsible for both. Cost for Co-location facility: $120,000 hardware + $51,000 /yr colo.
  • 78. $91,000 per year (3 years hardware lifetime). Cost for AWS site: $84,000 per year. We can run 3 mirrors for 90% of the cost of 1 mirror.
  • 79. It is not free!
  • 80. Advantages No physical hardware. Work can start as soon as we enter our credit card numbers...
  • 81. No US customs, Fedex etc. Less hardware: No Firewalls, SAN management appliances etc. Much simpler management infrastructure. AWS give you out of band management “for free”.
  • 82. No hardware issues. Easy path for growth. No space constraints. No need to get tin decommissioned /re-installed at Co-lo. Add more machines until we run out of cash.
  • 83. Downsides Underestimated the time it would take to make the web-code mirror-ready. Not a cloud specific problem, but something to be aware of when you take big applications and move them outside your home institution. Curation of software images takes time. Regular releases of new data and code.
  • 84. Ensembl team now has a dedicated person responsible for the cloud.
  • 85. Somebody has to look after the systems. Management overhead does not necessarily go down. But it does change.
  • 86. Going forward Change code to remove all dependencies on Sanger. Full DR capability. Make the AMIs publically available. Today we have Mysql servers + data. Data generously hosted on Amazon public datasets. Allow users to simply run their own sites.
  • 88. Why HPC in the Cloud? We already have a data-centre. Not seeking to replace our existing infrastructure.
  • 89. Not cost effective. But: Long lead-times for installing kit. ~3-6 months from idea to going live.
  • 90. Longer than the science can wait.
  • 91. Ability to burst capacity might be useful. Test environments. Test at scale.
  • 92. Large clusters for a short amount of time.
  • 93. Distributing analysis tools Sequencing is becoming a commodity.
  • 94. Informatics / analysis tools needs to be commodity too.
  • 95. Requires a significant amount of domain knowledge. Complicated software installs, relational databases etc. Goal: Researcher with no IT knowledge can take their sequence data, upload it to AWS, get it analysed and view the results.
  • 96. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics
  • 97. Our Workload Embarrassingly Parallel. Lots of single threaded jobs.
  • 100. Perl pipeline manager to generate and manage workflow.
  • 101. Batch schedular to execute jobs on nodes.
  • 102. mysql database to hold results & state. Moderate memory sizes. 3 GB/core IO bound. Fast parallel filesystems.
  • 103. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics
  • 104. Different Architectures VS CPU CPU CPU Fat Network POSIX Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular S3 Hadoop?
  • 105. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics
  • 106. Careful choice of problem: Choose a simple part of the pipeline Re-factor all the code that expects global filesystem and make it use S3. Why not use hadoop? Production code that works nicely inside Sanger.
  • 107. Vast effort to port code, for little benefit.
  • 108. Questions about stability for multi-user systems internally. Build self assembling HPC cluster. Code which will spin up AWS images and self assembles into a HPC cluster and batch schedular. Cloud allows you to simplify. Sanger compute cluster is shared. Lots of complexity in ensuring applications/users play nicely together. AWS clusters are unique to a user/application.
  • 109. The real problem: Internet Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link). Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s)
  • 110. Cambridge -> EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
  • 111. 11 hours to move 1TB to Dublin.
  • 112. 23 hours to move 1 TB to East coast. What speed should we get? Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible. Do you have fast enough disks at each end to keep the network full?
  • 113. Networking How do we improve data transfers across the public internet? CERN approach; don't.
  • 114. 10 Gbit dedicated network between CERN and the T1 centres. Can it work for cloud? Buy dedicated bandwidth to a provider. Ties you in.
  • 115. Should they pay? What happens when you want to move?
  • 116. Summary Moving existing HPC applications is painful.
  • 117. Small data / high CPU applications work really well.
  • 118. Large data applications less well.
  • 120. Are you allowed to put data on the cloud? Default policy:
  • 121. “Our data is confidential/important/critical to our business.
  • 122. We must keep our data on our computers.”
  • 123. “Apart from when we outsource it already.”
  • 124. Reasons to be optimistic: Most (all?) data security issues can be dealt with. But the devil is in the details.
  • 125. Data can be put on the cloud, if care is taken. It is probably more secure there than in your own data-centre. Can you match AWS data availability guarantees? Are cloud providers different from any other organisation you outsource to?
  • 126. Outstanding Issues Audit and compliance: If you need IP agreements, above your providers standard T&Cs, how do you push them through? Geographical boundaries mean little in the cloud. Data can be replicated across national boundaries, without end user being aware. Moving personally identifiable data outside of the EU is potentially problematic. (Can be problematic within the EU; privacy laws are not as harmonised as you might think.)
  • 127. More sequencing experiments are trying to link with phenotype data. (ie personally identifiable medical records).
  • 128. Private Cloud to rescue? Can we do something different?
  • 129. Traditional Collaboration DCC: Sequencing Centre + Archive Sequencing centre Sequencing centre Sequencing centre Sequencing centre IT IT IT IT
  • 130. Dark Archives Storing data in an archive is not particularly useful. You need to be able to access the data and do something useful with it. Data in current archives is “dark”. You can put/get data, but cannot compute across it.
  • 131. Is data in an inaccessible archive really useful?
  • 132. Private Cloud Collaborations Sequencing Centre Sequencing centre Sequencing centre Sequencing centre Private Cloud IaaS / SaaS Private Cloud IaaS / SaaS
  • 133. Private Cloud Advantages: Small organisations leverage expertise of big IT organisations.
  • 134. Academia tends to be linked by fast research networks. Moving data is easier. (move compute to the data via VMs) Consortium will be signed up to data-access agreements. Simplifies data governance. Problems: Big change in funding model.
  • 135. Are big centres set up to provide private cloud services? Selling services is hard if you are a charity. Can we do it as well as the big internet companies?
  • 136. Summary Cloud is a useful tool. Will not replace our local IT infrastructure. Porting existing applications can be hard. Do not underestimate time / people. Still need IT staff. End up doing different things.