SlideShare a Scribd company logo
Storage and Analysis of Sensitive Large-Scale
Biomedical Data in Sweden
Ola Spjuth
SNIC, UPPMAX and Science for Life Laboratory
Uppsala University, Sweden
ola.spjuth@farmbio.uu.se
Ola Spjuth
• Associate Professor in
Pharmaceutical Bioinformatics
• Guest Researcher
• Co-Director
• Manager of Bioinformatics
Compute and Storage facility
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
2003: First sequenced human genome -
13 years for $3 billions
2015: Human whole genome sequenced in 3 days for ~$1150
…requires supercomputers
for analysis and storage
Massively parallel sequencing….
2010: Science for Life Laboratory inaugurated
An internationally leading center
that develops and applies
large-scale technologies for
molecular biosciences with a focus
on health and environment.
National platform since 2013
Stockholm node
Uppsala node
2. Data delivery
Data generation and delivery
3. Analysis
Scientists
www.uppmax.uu.se/uppnex
High-performance computers and
large scale storage for
bioinformatics analysis.
1. Sample
transfer
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Sequence production 2014:
• Generated > 120 Tbp of sequence data
• 13.7 Gbp/hour, 3.8 Mbp/sec (on average)
Hardware resourcesmilou: HP
cluster of
208 nodes
pica: 6 (7) PB
Hitachi storage
halvan: 2 TB high-
memory computer
Fast network via
SUNET
Backup via SNIC
Long-term
storage at
SweStore
nestor: 48 nodes
production cluster
meles: 547 TB
Hitachi storage mosler: 24
nodes, 223 TBSmog: 100 nodes, ~300 TB
2015: 250 nodes
2016: 200
new nodes
+1 PB
+2 PB
A national e-Infrastructure for NGS
Software +
reference data
Support
Education
Compute resources
Storage resources
Efficiency +
automation
What we sequenced at NGI /
Chipster workbench on UPPMAX
UpCloud – smog - (OpenStack)
• Open catalogue of VMIs
• Hosted at Uppsala University
M. Dahlö, F. Haziza, A. Kallio,
E. Korpelainen, E. Bongcam-
Rudloff, and O. Spjuth.
BioImg.org: A catalogue of
virtual machine images for
the life sciences. Accepted in
Bioinformatics and Biology
Insights.
www.bioimg.org
Managing Virtual Machine Images
Mosler overview
• e-Infrastructure for
working with sensitive
data
• Copy of Norwegian
solution (TSD)
• Designed to look like
UPPMAX clusters
Mosler specifications
• High-performance computing in a virtualized
environment (OpenStack)
• 2-factor authentication
• Restricted data transfer in/out
• Only accessible over remote desktop (ThinLinc) via
Mosler dashboard
• Aim: Compliant with all laws and regulations for
analyzing sensitive data in Sweden
Consortia
DBA
Consortium
member
MyResearch
Virtual environment
storage compute
Mosler
Data
hosting
Data
syncing
Access, analysis
Data hosting use case
Manager
DBA
Scientist
LifeGene
Virtual environment
storage compute
Mosler
1. Request
for data
2. Approval
3. Data
extraction
4. Data
transfer
5. Access, analysis
Data extraction use case
Nov 2014
20M € total grant
4M € IT-infrastructure
X-Ten System
• First system able to deliver 1000$ genome
• Each run 1.2TB data
• 16 Human genome (30X)
• 3 days per run
• Population scale genomics
• 15K genomes per year
Swedish Genome
Initiative
Call for a reference variation Database (1000 genomes)
and for Whole Human Genome (half price).
Goal: 5.000 genomes 2015, 10.000 genomes 2016
-100000
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Aug-11 Mar-12 Sep-12 Apr-13 Nov-13 May-14 Dec-14 Jun-15 Jan-16
GigaBases
Production date
NGI-Stockholm Procution (Jan-12 to Dec-15)
Data production
Conservative Prediction
(60% of maximum production)
Whole Genome Sequencing
• Data on new scale, 80% expected to be sensitive
 New challenges
• Funding for IT-infrastructure from KAW foundation
– Resources for data production (2 M EUR)
– Resources for scientists (2 M EUR)
• A national security project funded by Swedish Research
Council (5 M EUR over 4 years) – SNIC Sens
SNIC-Sens
• 4-year project, started Jan 2015
• Project owner: SNIC (Ann-Charlotte Sonnhammer)
• Project leader: Ola Spjuth (until end of this week)
• Aims:
– Specifications for analyzing sensitive data in SNIC
(hardware, legal, contracts, processes etc.)
– Evaluation on the use of public cloud providers (Google,
Amazon)
– Make available e-Infrastructure for production and
research of data generated at NGI, blueprint for other
domains
SNIC-Sens roadmap
• Information classification workshop (21/5)
• Risk/vulnerability analysis (2/6)
• Specifications for hardware procurement
• Public tender (end of this week)
• Installation and testing of production system (Aug-
Sept)
• Installation, configuration and testing of research
system (Q3-Q4)
• Research system online (Q1 2016)
Two pilots for clinical data management
CML, Lucia Cavelier
MDR, Åsa Melhus

More Related Content

PPTX
Big data in biology
Omkar Reddy
 
PPTX
Bioinformatics in the Era of Open Science and Big Data
Philip Bourne
 
PPTX
Database technologies in bioinformatics
Gleb Sklyr
 
PDF
Big Data
Angelo D'Ambrosio
 
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Data Driven Innovation
 
PDF
2021-01-27--biodiversity-informatics-gbif-(52slides)
Dag Endresen
 
PDF
FAIR and open biodiversity collection data management
Dag Endresen
 
PDF
The Biodiversity Informatics Landscape
Vince Smith
 
Big data in biology
Omkar Reddy
 
Bioinformatics in the Era of Open Science and Big Data
Philip Bourne
 
Database technologies in bioinformatics
Gleb Sklyr
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Data Driven Innovation
 
2021-01-27--biodiversity-informatics-gbif-(52slides)
Dag Endresen
 
FAIR and open biodiversity collection data management
Dag Endresen
 
The Biodiversity Informatics Landscape
Vince Smith
 

What's hot (19)

PDF
Introduction to Bioinformatics
Alexander Niema Moshiri
 
PDF
Introduction to Bioinformatics.
Elena Sügis
 
PPT
Intro bioinformatics
Chris Dwan
 
PPTX
Introduction to Bioinformatics
Denis C. Bauer
 
PPTX
FAIR Agronomy, where are we? The KnetMiner Use Case
Rothamsted Research, UK
 
PPTX
Human Genetics & Big Data [sans Ethics]
Allen Day, PhD
 
PPTX
Supporting researchers in the molecular life sciences Jeff Christiansen
ARDC
 
PDF
Museum collections as research data - October 2019
Dag Endresen
 
PDF
GBIF towards 2030 (November 2018)
Dag Endresen
 
PPTX
Bioinformatics Final Presentation
Shruthi Choudary
 
PPTX
Bioinformatics
ANJALY JOHNSON K
 
PPTX
Introduction to bioinformatics
Makarand Bhale
 
PPT
Bioinformatics - Discovering the Bio Logic Of Nature
Robert Cormia
 
PDF
GBIF and Biodiversity informatics for museums, 15 March 2021
Dag Endresen
 
PPT
Introduction to Bioinformatics Slides
Saide OER Africa
 
PPT
Bioinformatics Databases
cschlos2
 
PPTX
Application of bioinformatics
Kamlesh Patade
 
PPTX
Interoperable Data for KnetMiner and DFW Use Cases
Rothamsted Research, UK
 
PPT
Bioinformatics workshop presentation
SKUAST-Kashmir
 
Introduction to Bioinformatics
Alexander Niema Moshiri
 
Introduction to Bioinformatics.
Elena Sügis
 
Intro bioinformatics
Chris Dwan
 
Introduction to Bioinformatics
Denis C. Bauer
 
FAIR Agronomy, where are we? The KnetMiner Use Case
Rothamsted Research, UK
 
Human Genetics & Big Data [sans Ethics]
Allen Day, PhD
 
Supporting researchers in the molecular life sciences Jeff Christiansen
ARDC
 
Museum collections as research data - October 2019
Dag Endresen
 
GBIF towards 2030 (November 2018)
Dag Endresen
 
Bioinformatics Final Presentation
Shruthi Choudary
 
Bioinformatics
ANJALY JOHNSON K
 
Introduction to bioinformatics
Makarand Bhale
 
Bioinformatics - Discovering the Bio Logic Of Nature
Robert Cormia
 
GBIF and Biodiversity informatics for museums, 15 March 2021
Dag Endresen
 
Introduction to Bioinformatics Slides
Saide OER Africa
 
Bioinformatics Databases
cschlos2
 
Application of bioinformatics
Kamlesh Patade
 
Interoperable Data for KnetMiner and DFW Use Cases
Rothamsted Research, UK
 
Bioinformatics workshop presentation
SKUAST-Kashmir
 
Ad

Similar to Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden (20)

PPTX
Data analytics challenges in genomics
mikaelhuss
 
ODP
Life sciences big data use cases
Guy Coates
 
PDF
scilifelab-folder-2016-7
Lars GJ Hammarström
 
PPTX
ngs.pptx
aaaa bbb
 
PPTX
Next Generation Sequencing - An Overview
EdizonJambormias2
 
PPTX
2016 09 cxo forum
Chris Dwan
 
PDF
High-Performance Networking Use Cases in Life Sciences
Ari Berman
 
ODP
Next generation genomics: Petascale data in the life sciences
Guy Coates
 
PPTX
Data analysis & integration challenges in genomics
mikaelhuss
 
PDF
NordForsk Open Access Reykjavik 14-15/8-2014:NeIC
NordForsk
 
PDF
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Robert Grossman
 
PPTX
Electron Microscopy Between OPIC, Oxford and eBIC
Jisc
 
PPTX
2014 aus-agta
c.titus.brown
 
PDF
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Matthieu Schapranow
 
PPTX
Emerging challenges in data-intensive genomics
mikaelhuss
 
PPTX
2016 07 12_purdue_bigdatainomics_seandavis
Sean Davis
 
PPTX
The case for cloud computing in Life Sciences
Ola Spjuth
 
PPTX
Life science requirements from e-infrastructure: initial results from a joint...
Rafael C. Jimenez
 
PDF
Faster R & D Analysis Tool - TRG
TRG
 
PPTX
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth
 
Data analytics challenges in genomics
mikaelhuss
 
Life sciences big data use cases
Guy Coates
 
scilifelab-folder-2016-7
Lars GJ Hammarström
 
ngs.pptx
aaaa bbb
 
Next Generation Sequencing - An Overview
EdizonJambormias2
 
2016 09 cxo forum
Chris Dwan
 
High-Performance Networking Use Cases in Life Sciences
Ari Berman
 
Next generation genomics: Petascale data in the life sciences
Guy Coates
 
Data analysis & integration challenges in genomics
mikaelhuss
 
NordForsk Open Access Reykjavik 14-15/8-2014:NeIC
NordForsk
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Robert Grossman
 
Electron Microscopy Between OPIC, Oxford and eBIC
Jisc
 
2014 aus-agta
c.titus.brown
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Matthieu Schapranow
 
Emerging challenges in data-intensive genomics
mikaelhuss
 
2016 07 12_purdue_bigdatainomics_seandavis
Sean Davis
 
The case for cloud computing in Life Sciences
Ola Spjuth
 
Life science requirements from e-infrastructure: initial results from a joint...
Rafael C. Jimenez
 
Faster R & D Analysis Tool - TRG
TRG
 
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth
 
Ad

More from Ola Spjuth (15)

PPTX
Automating cell-based screening with open source, robotics and AI
Ola Spjuth
 
PPTX
Towards automated phenotypic cell profiling with high-content imaging
Ola Spjuth
 
PPTX
Towards Automated AI-guided Drug Discovery Labs
Ola Spjuth
 
PDF
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Ola Spjuth
 
PPTX
Building an informatics solution to sustain AI-guided cell profiling with hig...
Ola Spjuth
 
PPTX
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
PPTX
Data-intensive applications on cloud computing resources: Applications in lif...
Ola Spjuth
 
PPTX
Agile large-scale machine-learning pipelines in drug discovery
Ola Spjuth
 
PPTX
Enabling Translational Medicine with e-Science
Ola Spjuth
 
PPTX
Continuous modeling - automating model building on high-performance e-Infrast...
Ola Spjuth
 
PPTX
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Ola Spjuth
 
PPTX
Interoperability and scalability with microservices in science
Ola Spjuth
 
PPTX
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Ola Spjuth
 
PPT
Building a flexible infrastructure with Bioclipse, open source, and federated...
Ola Spjuth
 
PPT
Accessing and scripting CDK from Bioclipse
Ola Spjuth
 
Automating cell-based screening with open source, robotics and AI
Ola Spjuth
 
Towards automated phenotypic cell profiling with high-content imaging
Ola Spjuth
 
Towards Automated AI-guided Drug Discovery Labs
Ola Spjuth
 
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Ola Spjuth
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Ola Spjuth
 
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 
Data-intensive applications on cloud computing resources: Applications in lif...
Ola Spjuth
 
Agile large-scale machine-learning pipelines in drug discovery
Ola Spjuth
 
Enabling Translational Medicine with e-Science
Ola Spjuth
 
Continuous modeling - automating model building on high-performance e-Infrast...
Ola Spjuth
 
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Ola Spjuth
 
Interoperability and scalability with microservices in science
Ola Spjuth
 
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Ola Spjuth
 
Building a flexible infrastructure with Bioclipse, open source, and federated...
Ola Spjuth
 
Accessing and scripting CDK from Bioclipse
Ola Spjuth
 

Recently uploaded (20)

PPTX
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
PDF
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
PDF
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
PPTX
Laboratory design and safe microbiological practices
Akanksha Divkar
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PPTX
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
PDF
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
PPTX
ANTIANGINAL DRUGS.pptx m pharm pharmacology
46JaybhayAshwiniHari
 
PDF
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PDF
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
PPTX
Pengenalan Sel dan organisasi kehidupanpptx
SuntiEkaprawesti1
 
PPTX
Quality control test for plastic & metal.pptx
shrutipandit17
 
PPTX
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
PPTX
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
PPT
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
PPTX
Evolution of diet breadth in herbivorus insects.pptx
Mr. Suresh R. Jambagi
 
PDF
JADESreveals a large population of low mass black holes at high redshift
Sérgio Sacani
 
PPTX
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 
METABOLIC_SYNDROME Dr Shadab- kgmu lucknow pptx
ShadabAlam169087
 
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
A water-rich interior in the temperate sub-Neptune K2-18 b revealed by JWST
Sérgio Sacani
 
Laboratory design and safe microbiological practices
Akanksha Divkar
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
Multiwavelength Study of a Hyperluminous X-Ray Source near NGC6099: A Strong ...
Sérgio Sacani
 
ANTIANGINAL DRUGS.pptx m pharm pharmacology
46JaybhayAshwiniHari
 
Sujay Rao Mandavilli Multi-barreled appraoch to educational reform FINAL FINA...
Sujay Rao Mandavilli
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
Migrating Katalon Studio Tests to Playwright with Model Driven Engineering
ESUG
 
Pengenalan Sel dan organisasi kehidupanpptx
SuntiEkaprawesti1
 
Quality control test for plastic & metal.pptx
shrutipandit17
 
Qualification of.UV visible spectrophotometer pptx
shrutipandit17
 
Q1_Science 8_Week4-Day 5.pptx science re
AizaRazonado
 
1. Basic Principles of Medical Microbiology Part 1.ppt
separatedwalk
 
Evolution of diet breadth in herbivorus insects.pptx
Mr. Suresh R. Jambagi
 
JADESreveals a large population of low mass black holes at high redshift
Sérgio Sacani
 
Nanofertilizer: Its potential benefits and associated challenges.pptx
BikramjitDeuri
 

Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden

  • 1. Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden Ola Spjuth SNIC, UPPMAX and Science for Life Laboratory Uppsala University, Sweden [email protected]
  • 2. Ola Spjuth • Associate Professor in Pharmaceutical Bioinformatics • Guest Researcher • Co-Director • Manager of Bioinformatics Compute and Storage facility
  • 4. 2003: First sequenced human genome - 13 years for $3 billions
  • 5. 2015: Human whole genome sequenced in 3 days for ~$1150 …requires supercomputers for analysis and storage Massively parallel sequencing….
  • 6. 2010: Science for Life Laboratory inaugurated An internationally leading center that develops and applies large-scale technologies for molecular biosciences with a focus on health and environment. National platform since 2013 Stockholm node Uppsala node
  • 7. 2. Data delivery Data generation and delivery 3. Analysis Scientists www.uppmax.uu.se/uppnex High-performance computers and large scale storage for bioinformatics analysis. 1. Sample transfer
  • 9. Sequence production 2014: • Generated > 120 Tbp of sequence data • 13.7 Gbp/hour, 3.8 Mbp/sec (on average)
  • 10. Hardware resourcesmilou: HP cluster of 208 nodes pica: 6 (7) PB Hitachi storage halvan: 2 TB high- memory computer Fast network via SUNET Backup via SNIC Long-term storage at SweStore nestor: 48 nodes production cluster meles: 547 TB Hitachi storage mosler: 24 nodes, 223 TBSmog: 100 nodes, ~300 TB 2015: 250 nodes 2016: 200 new nodes +1 PB +2 PB
  • 11. A national e-Infrastructure for NGS Software + reference data Support Education Compute resources Storage resources Efficiency + automation
  • 12. What we sequenced at NGI /
  • 13. Chipster workbench on UPPMAX UpCloud – smog - (OpenStack)
  • 14. • Open catalogue of VMIs • Hosted at Uppsala University M. Dahlö, F. Haziza, A. Kallio, E. Korpelainen, E. Bongcam- Rudloff, and O. Spjuth. BioImg.org: A catalogue of virtual machine images for the life sciences. Accepted in Bioinformatics and Biology Insights. www.bioimg.org Managing Virtual Machine Images
  • 15. Mosler overview • e-Infrastructure for working with sensitive data • Copy of Norwegian solution (TSD) • Designed to look like UPPMAX clusters
  • 16. Mosler specifications • High-performance computing in a virtualized environment (OpenStack) • 2-factor authentication • Restricted data transfer in/out • Only accessible over remote desktop (ThinLinc) via Mosler dashboard • Aim: Compliant with all laws and regulations for analyzing sensitive data in Sweden
  • 18. Manager DBA Scientist LifeGene Virtual environment storage compute Mosler 1. Request for data 2. Approval 3. Data extraction 4. Data transfer 5. Access, analysis Data extraction use case
  • 19. Nov 2014 20M € total grant 4M € IT-infrastructure
  • 20. X-Ten System • First system able to deliver 1000$ genome • Each run 1.2TB data • 16 Human genome (30X) • 3 days per run • Population scale genomics • 15K genomes per year Swedish Genome Initiative Call for a reference variation Database (1000 genomes) and for Whole Human Genome (half price). Goal: 5.000 genomes 2015, 10.000 genomes 2016
  • 21. -100000 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 Aug-11 Mar-12 Sep-12 Apr-13 Nov-13 May-14 Dec-14 Jun-15 Jan-16 GigaBases Production date NGI-Stockholm Procution (Jan-12 to Dec-15) Data production Conservative Prediction (60% of maximum production)
  • 22. Whole Genome Sequencing • Data on new scale, 80% expected to be sensitive  New challenges • Funding for IT-infrastructure from KAW foundation – Resources for data production (2 M EUR) – Resources for scientists (2 M EUR) • A national security project funded by Swedish Research Council (5 M EUR over 4 years) – SNIC Sens
  • 23. SNIC-Sens • 4-year project, started Jan 2015 • Project owner: SNIC (Ann-Charlotte Sonnhammer) • Project leader: Ola Spjuth (until end of this week) • Aims: – Specifications for analyzing sensitive data in SNIC (hardware, legal, contracts, processes etc.) – Evaluation on the use of public cloud providers (Google, Amazon) – Make available e-Infrastructure for production and research of data generated at NGI, blueprint for other domains
  • 24. SNIC-Sens roadmap • Information classification workshop (21/5) • Risk/vulnerability analysis (2/6) • Specifications for hardware procurement • Public tender (end of this week) • Installation and testing of production system (Aug- Sept) • Installation, configuration and testing of research system (Q3-Q4) • Research system online (Q1 2016)
  • 25. Two pilots for clinical data management

Editor's Notes

  • #7: Strategic funding to enable: Infrastructure for high-throughput analysis Multi-disciplinary research environment Competence in technology and analysis methodology
  • #11: Bild på milou: 2 M compute hours Bild på pica: 7 PB storage Bild på halvan: High-memory machine Backup (via SNIC, currently to Linköping) Network (SUNET)
  • #12: Access to computers (many if you need) Access to storage (a lot if you need) Pre-installed software and reference genomes Free
  • #21: X-Ten system, some “scary” numbers and some screen shot about what we plan in the following months (the picture shows the thirst thee HiSeq X in our basement)