Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics:
Clustering Billions of DNA
Sequences with Apache Spark
Zhong Wang, Ph.D.
Group Lead, Genome Analysis
05/23/2019

1999-2007
2008-now: JGI as the DOE sequencing center dedicated to plants and microbes.
DOE JGI: A brief history

Our Mission
3
DOE JGI, Serving as a genomic user facility
in support of the DOE missions:
• Walnut Creek 1999-2019
• Berkeley, CA
• 250 employees
• $70M annual budget
bioenergy, carbon cycling, & biogeochemistry

Our sequencer lineups
Miseq
NextSeq 500
Hiseq 2500
PacBio RSII
Oxford Nanopore
Short-read technologies
Long-read technologies
Novaseq 6000
PacBio Sequel
MinION Promethion
200Tb
sequencing data
in FY18
Illumina

Genomics big data is not typical big data
Unstructured
Volume, variety
veracity increases
during analytics

Metagenome is the genome of a microbial community
10s "intimate kiss" = 80 million bacteria
Metagenomics questions: Who are there? What they do? How they interact?

Microbial communities are “dark matters”
Number of Species
Cow
～6000
Human
～1000
Soil,
>100000
>90% of the species haven’t been seen before

Metagenome sequencing and assembly
Harvest
microbes
Extract
DNA
Shear, &
Sequencing
Assembly
Short Reads
Reconstructed
genomes
Microbial
Community
Metagenome
DNA

The metagenome assembly problem
Library of Books Shredded Library “reconstructed” Library
Genome ~= Book Metagenome ~= Library
Sequencing ~= sampling the pieces and read them

Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Typical Human Cow Ocean Soil
Gigabases (Gb)

Complexity is another…
Remove contaminants,
sequencing errors
Overlap graph
de bruijn graph
Contigs or clusters
Repetitive elements
Homologous genes
Horizontal transferred genes

The ideal solution and the failed ones
 Easy to develop
 Robust
 Scale to big data
 Efficient
BigMem
• Easy to
develop
• Expensive
• Not scale
MPI
• Fast
• Hard to
develop
• Not robust
Hadoop
• Easy to
develop
• Scale
• Slow

Addressing big data: Apache Spark
• New scalable programming paradigm
• Compatible with Hadoop-supported
storage systems
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
 Scale to big data
 Efficient
 Easy to develop
 Robust

Goal: Metagenome read clustering
Read clustering can reduce metagenome problem to
single-genome problem
• Parallel Processing
• Individualized optimization
Reads Read clusters

Algorithm
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads
Graph Construction and Edge Reduction Label Propagation Algorithm

Clustering performance on long reads
Read length = 500-20,000

Short reads? Not so much
Read length = 150

Can long reads come in rescue?

A tradeoff between cost and performance
0
50
100
150
200
250
0% 20% 40% 60% 80% 100%
mean cluster size (K) #reads (M) #clusters
Percent of long reads used

Short-read only: there is still a way out

More samples, better results: one vs 50

More data, better results:
clustering success is dependent on coverage

Hardware and software environments
Customized EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Hadoop 2.7.3 2.7.3 2.7.2
Spark 2.1.1 2.2.0 2.1.0

A quick reminder…
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads (KMR)
Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)

Scale to bigger data volume on a 20-node cluster
0
200
400
600
800
20 40 60 80 100
ExecutionTime(mins)
Data Size (GB)
KMR Edges LPA Total

Increasing nodes on a 50G-dataset
0
100
200
300
400
500
25 50 75 100
ExecutionTime(mins)
Number of nodes
50G
KMR Edges LPA Total

Fine tune parallelism
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
ExecutionTIme(mins)
Spark default parallelism (log10)
50G 20G

Dataset complexity vs performance
146.33
44.5
0
20
40
60
80
100
120
140
160
Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina)
ExecutionTime(mins)
KMR Edges LPA

Platform comparison: Clouds and HPC
Customized EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Time (min) 106 105 126

Clustering for identifying genome contaminants
Russula 70Mb
Bradyrhizobium
7.2Mb
Collimonas: 5.3Mb

Targeting big metagenome projects
Dr. Morgan-Kiss
@ Miami University
Dr. Slonczewski
@Kenyon University
Two lakes, 1.2Tbp

Acknowledgements
Spark Team
Lizhen Shi @FSU
Xiandong Meng
Kexue Li, LiliWang and Li Deng
@Shanghai U
Kurt Labutti
Elizabeth Tseng @PacBio
Lisa Gerhardt , Evan Racah
@ NERSC
Yong Qin, Gary Jung,
Greg Kurtzer, Bernard Li,
@ HPC
Philip Blood,
Bryon Gill
@PSC

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

More Related Content

What's hot (20)

Similar to Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark