SlideShare a Scribd company logo
Big Data Genomics:
Clustering Billions of DNA
Sequences with Apache Spark
Zhong Wang, Ph.D.
Group Lead, Genome Analysis
05/23/2019
1999-2007
2008-now: JGI as the DOE sequencing center dedicated to plants and microbes.
DOE JGI: A brief history
Our Mission
3
DOE JGI, Serving as a genomic user facility
in support of the DOE missions:
• Walnut Creek 1999-2019
• Berkeley, CA
• 250 employees
• $70M annual budget
bioenergy, carbon cycling, & biogeochemistry
Our sequencer lineups
Miseq
NextSeq 500
Hiseq 2500
PacBio RSII
Oxford Nanopore
Short-read technologies
Long-read technologies
Novaseq 6000
PacBio Sequel
MinION Promethion
200Tb
sequencing data
in FY18
Illumina
Genomics big data is not typical big data
Unstructured
Volume, variety
veracity increases
during analytics
Metagenome is the genome of a microbial community
10s "intimate kiss" = 80 million bacteria
Metagenomics questions: Who are there? What they do? How they interact?
Microbial communities are “dark matters”
Number of Species
Cow
~6000
Human
~1000
Soil,
>100000
>90% of the species haven’t been seen before
Metagenome sequencing and assembly
Harvest
microbes
Extract
DNA
Shear, &
Sequencing
Assembly
Short Reads
Reconstructed
genomes
Microbial
Community
Metagenome
DNA
The metagenome assembly problem
Library of Books Shredded Library “reconstructed” Library
Genome ~= Book Metagenome ~= Library
Sequencing ~= sampling the pieces and read them
Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Typical Human Cow Ocean Soil
Gigabases (Gb)
Complexity is another…
Remove contaminants,
sequencing errors
Overlap graph
de bruijn graph
Contigs or clusters
Repetitive elements
Homologous genes
Horizontal transferred genes
The ideal solution and the failed ones
 Easy to develop
 Robust
 Scale to big data
 Efficient
BigMem
• Easy to
develop
• Expensive
• Not scale
MPI
• Fast
• Hard to
develop
• Not robust
Hadoop
• Easy to
develop
• Scale
• Slow
Addressing big data: Apache Spark
• New scalable programming paradigm
• Compatible with Hadoop-supported
storage systems
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
 Scale to big data
 Efficient
 Easy to develop
 Robust
Goal: Metagenome read clustering
Read clustering can reduce metagenome problem to
single-genome problem
• Parallel Processing
• Individualized optimization
Reads Read clusters
Algorithm
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads
Graph Construction and Edge Reduction Label Propagation Algorithm
Clustering performance on long reads
Read length = 500-20,000
Short reads? Not so much
Read length = 150
Can long reads come in rescue?
Hybrid clustering
A tradeoff between cost and performance
0
50
100
150
200
250
0% 20% 40% 60% 80% 100%
mean cluster size (K) #reads (M) #clusters
Percent of long reads used
Short-read only: there is still a way out
More samples, better results: one vs 50
More data, better results:
clustering success is dependent on coverage
Can we scale to big data?
Hardware and software environments
Customized EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Hadoop 2.7.3 2.7.3 2.7.2
Spark 2.1.1 2.2.0 2.1.0
A quick reminder…
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads (KMR)
Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
Scale to bigger data volume on a 20-node cluster
0
200
400
600
800
20 40 60 80 100
ExecutionTime(mins)
Data Size (GB)
KMR Edges LPA Total
Increasing nodes on a 50G-dataset
0
100
200
300
400
500
25 50 75 100
ExecutionTime(mins)
Number of nodes
50G
KMR Edges LPA Total
Fine tune parallelism
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
ExecutionTIme(mins)
Spark default parallelism (log10)
50G 20G
Dataset complexity vs performance
146.33
44.5
0
20
40
60
80
100
120
140
160
Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina)
ExecutionTime(mins)
KMR Edges LPA
Platform comparison: Clouds and HPC
Customized EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Time (min) 106 105 126
Now we have a big hammer…
Clustering for identifying genome contaminants
Russula 70Mb
Bradyrhizobium
7.2Mb
Collimonas: 5.3Mb
Targeting big metagenome projects
Dr. Morgan-Kiss
@ Miami University
Dr. Slonczewski
@Kenyon University
Two lakes, 1.2Tbp
Acknowledgements
Spark Team
Lizhen Shi @FSU
Xiandong Meng
Kexue Li, LiliWang and Li Deng
@Shanghai U
Kurt Labutti
Elizabeth Tseng @PacBio
Lisa Gerhardt , Evan Racah
@ NERSC
Yong Qin, Gary Jung,
Greg Kurtzer, Bernard Li,
@ HPC
Philip Blood,
Bryon Gill
@PSC

More Related Content

What's hot (20)

PPTX
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
PDF
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Big Data Spain
 
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
PDF
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
PDF
Big Data Tools in AWS
Shu-Jeng Hsieh
 
PPTX
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
PPTX
The Past, Present, and Future of Hadoop at LinkedIn
Carl Steinbach
 
PPTX
Tame that Beast
DataWorks Summit/Hadoop Summit
 
PDF
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
PPTX
Real time analytics
Leandro Totino Pereira
 
PPTX
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
PDF
Benchmarking Apache Druid
Matt Sarrel
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PPTX
Querying Druid in SQL with Superset
DataWorks Summit
 
PDF
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
PPTX
Drill at the Chug 9-19-12
Ted Dunning
 
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Big Data Spain
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
The Past, Present, and Future of Hadoop at LinkedIn
Carl Steinbach
 
Leveraging docker for hadoop build automation and big data stack provisioning
Evans Ye
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
DataStax
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Real time analytics
Leandro Totino Pereira
 
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
Benchmarking Apache Druid
Matt Sarrel
 
Apache Spark At Scale in the Cloud
Databricks
 
Querying Druid in SQL with Superset
DataWorks Summit
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Labs
 
Drill at the Chug 9-19-12
Ted Dunning
 

Similar to Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark (20)

PDF
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Spark Summit
 
PDF
Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age
batchinsights
 
PDF
Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age
batchinsights
 
PDF
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
ryancox
 
PPTX
Hadoop ecosystem for health/life sciences
Uri Laserson
 
PPTX
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
PPTX
Big data hadoop ecosystem and nosql
Khanderao Kand
 
PPTX
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
PDF
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
PDF
Whitepaper : CHI: Hadoop's Rise in Life Sciences
EMC
 
PPTX
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
Jan Aerts
 
PPTX
2012 sept 18_thug_biotech
Adam Muise
 
PPTX
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PPTX
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth
 
PDF
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Spark Summit
 
Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age
batchinsights
 
Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age
batchinsights
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
ryancox
 
Hadoop ecosystem for health/life sciences
Uri Laserson
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
Big data hadoop ecosystem and nosql
Khanderao Kand
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
eScience Cluster Arch. Overview
Francesco Bongiovanni
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
EMC
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
Jan Aerts
 
2012 sept 18_thug_biotech
Adam Muise
 
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Big Data and Cloud Computing
Farzad Nozarian
 
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth
 
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
PPTX
Applying Noisy Knowledge Graphs to Real Problems
DataWorks Summit
 
PDF
Open Source, Open Data: Driving Innovation in Smart Cities
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
DataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

  • 1. Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark Zhong Wang, Ph.D. Group Lead, Genome Analysis 05/23/2019
  • 2. 1999-2007 2008-now: JGI as the DOE sequencing center dedicated to plants and microbes. DOE JGI: A brief history
  • 3. Our Mission 3 DOE JGI, Serving as a genomic user facility in support of the DOE missions: • Walnut Creek 1999-2019 • Berkeley, CA • 250 employees • $70M annual budget bioenergy, carbon cycling, & biogeochemistry
  • 4. Our sequencer lineups Miseq NextSeq 500 Hiseq 2500 PacBio RSII Oxford Nanopore Short-read technologies Long-read technologies Novaseq 6000 PacBio Sequel MinION Promethion 200Tb sequencing data in FY18 Illumina
  • 5. Genomics big data is not typical big data Unstructured Volume, variety veracity increases during analytics
  • 6. Metagenome is the genome of a microbial community 10s "intimate kiss" = 80 million bacteria Metagenomics questions: Who are there? What they do? How they interact?
  • 7. Microbial communities are “dark matters” Number of Species Cow ~6000 Human ~1000 Soil, >100000 >90% of the species haven’t been seen before
  • 8. Metagenome sequencing and assembly Harvest microbes Extract DNA Shear, & Sequencing Assembly Short Reads Reconstructed genomes Microbial Community Metagenome DNA
  • 9. The metagenome assembly problem Library of Books Shredded Library “reconstructed” Library Genome ~= Book Metagenome ~= Library Sequencing ~= sampling the pieces and read them
  • 10. Scale is an enemy 1 10 100 1,000 10,000 100,000 1,000,000 Typical Human Cow Ocean Soil Gigabases (Gb)
  • 11. Complexity is another… Remove contaminants, sequencing errors Overlap graph de bruijn graph Contigs or clusters Repetitive elements Homologous genes Horizontal transferred genes
  • 12. The ideal solution and the failed ones  Easy to develop  Robust  Scale to big data  Efficient BigMem • Easy to develop • Expensive • Not scale MPI • Fast • Hard to develop • Not robust Hadoop • Easy to develop • Scale • Slow
  • 13. Addressing big data: Apache Spark • New scalable programming paradigm • Compatible with Hadoop-supported storage systems • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell  Scale to big data  Efficient  Easy to develop  Robust
  • 14. Goal: Metagenome read clustering Read clustering can reduce metagenome problem to single-genome problem • Parallel Processing • Individualized optimization Reads Read clusters
  • 15. Algorithm 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads Graph Construction and Edge Reduction Label Propagation Algorithm
  • 16. Clustering performance on long reads Read length = 500-20,000
  • 17. Short reads? Not so much Read length = 150
  • 18. Can long reads come in rescue?
  • 20. A tradeoff between cost and performance 0 50 100 150 200 250 0% 20% 40% 60% 80% 100% mean cluster size (K) #reads (M) #clusters Percent of long reads used
  • 21. Short-read only: there is still a way out
  • 22. More samples, better results: one vs 50
  • 23. More data, better results: clustering success is dependent on coverage
  • 24. Can we scale to big data?
  • 25. Hardware and software environments Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Hadoop 2.7.3 2.7.3 2.7.2 Spark 2.1.1 2.2.0 2.1.0
  • 26. A quick reminder… 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads (KMR) Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
  • 27. Scale to bigger data volume on a 20-node cluster 0 200 400 600 800 20 40 60 80 100 ExecutionTime(mins) Data Size (GB) KMR Edges LPA Total
  • 28. Increasing nodes on a 50G-dataset 0 100 200 300 400 500 25 50 75 100 ExecutionTime(mins) Number of nodes 50G KMR Edges LPA Total
  • 29. Fine tune parallelism 0 50 100 150 200 250 300 350 1 2 3 4 5 6 7 8 ExecutionTIme(mins) Spark default parallelism (log10) 50G 20G
  • 30. Dataset complexity vs performance 146.33 44.5 0 20 40 60 80 100 120 140 160 Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina) ExecutionTime(mins) KMR Edges LPA
  • 31. Platform comparison: Clouds and HPC Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Time (min) 106 105 126
  • 32. Now we have a big hammer…
  • 33. Clustering for identifying genome contaminants Russula 70Mb Bradyrhizobium 7.2Mb Collimonas: 5.3Mb
  • 34. Targeting big metagenome projects Dr. Morgan-Kiss @ Miami University Dr. Slonczewski @Kenyon University Two lakes, 1.2Tbp
  • 35. Acknowledgements Spark Team Lizhen Shi @FSU Xiandong Meng Kexue Li, LiliWang and Li Deng @Shanghai U Kurt Labutti Elizabeth Tseng @PacBio Lisa Gerhardt , Evan Racah @ NERSC Yong Qin, Gary Jung, Greg Kurtzer, Bernard Li, @ HPC Philip Blood, Bryon Gill @PSC