SlideShare a Scribd company logo
Rethinking data-intensive science using scalable
analytics systems
Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri
Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher,
Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

AMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl
Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge,
Cambridge, MA
1
Abstract
• In this paper, we describe ADAM, an example genomics
pipeline that leverages the open-source Apache Spark
and Parquet systems to achieve a 28x speedup over
current genomics pipelines, while reducing cost by 63%.
From building this system, we were able to distill a set of
techniques for implementing scientific analyses efficiently
using commodity “big data” systems.
2
Background
source : NIH National Genome Research Institute 3
Characteristics of science analysis systems
4
Layering
• Physical Storage coordinates
data writes to physical media.

• Data Distribution manages
access, replication, and
distribution of the files that
have been written to storage
media.

• Materialized Data encodes
the patterns for how data is
encoded and stored. This layer
determines I/O bandwidth and
compression.
5
Layering
• Data Schema specifies the
representation of data, and forms
the narrow waist of the stack that
separates access from execution

• Evidence Access provides
primitives for processing data,
and enables the transformation
of data into different views and
traversals.

• Presentation enhances the data
schema with convenience
methods for performing common
tasks and accessing common
derived fields from a single
element.
6
Layering
• Applications use the evidence
access and presentation layers
to compose algorithms for
performing an analysis.
7
Case studies
8
Parquet
• OSS Created by Twitter and Cloudera, based on Google
Dremel
• Columnar File Format
• Limit I/O to only data that is needed
• Compresses very well - ADAM file are 5-25% smaller than
BAM file without loss of data
• 3 layers of parallelism: File/row group, Column chunk,
Page
9
Parquet/Spark integration
• 1 row group in Parquet maps
to 1 partition in spark

• We interact with Parquet via
input/output formats

• Spark builds and execute a
computation Directed Acyclic
Graph(DAG), manages data
locality, error/retries
10
Performance
11
• We evaluated ADAM against the GATK [14], SAMtools
[32], Picard [51], and Sambamba [50]. We evaluated the
performance of BQSR, INDEL realignment (IR), duplicate
marking (DM), sort, and Flagstat (FS).
Genomics Workloads
12
• This data is shown in Table 2. Although ADAM is more
expensive than the best legacy tool (Sambamba [50]) for
sorting and duplicate marking, ADAM is less expensive
for all other stages. In total, using ADAM reduces the
end-to-end analysis cost by 63% over a pipeline
constructed out of solely legacy tools.
Genomics Workloads
13
• Table 3 describes the instance types.
Genomics Workloads
14
Genomics Workloads
• We achieve near-linear speedup across 128 nodes
15
Conclusion
• By rethinking the architecture of scientific data
management systems, we have been able to achieve
parity on single node systems, while providing linear
strong scaling out to 128 nodes. By making it easy to
scale scientific analysis across multiple commodity
machines, we enable the use of smaller, less expensive
computers, leading to a 63% cost improvement and a
28x improvement in read preprocessing pipeline latency.
16
Q&A
17

More Related Content

What's hot (20)

PDF
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
PDF
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
Nurul Emran
 
PPTX
Visualizing and Clustering Life Science Applications in Parallel 
Geoffrey Fox
 
PDF
Reusable Software and Open Data To Optimize Agriculture
David LeBauer
 
PDF
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
PDF
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Globus
 
PPTX
Data Science Solutions by Materials Scientists: The Early Case Studies
Tony Fast
 
PDF
From data to knowledge – the Ondex System for integrating Life Sciences data ...
Catherine Canevet
 
PPTX
PNNL April 2011 ogce
marpierc
 
PDF
Accelerating GWAS epistatic interaction analysis methods
Priscill Orue Esquivel
 
PPTX
FedCentric_Presentation
Yatpang Cheung
 
PDF
Cybertools stork-2009-cybertools allhandmeeting-poster
balmanme
 
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 
PPTX
Adbms 30 data placement
Vaibhav Khanna
 
PDF
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
PDF
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
PDF
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
IJERA Editor
 
PPT
5.4 mining sequence patterns in biological data
Krish_ver2
 
PDF
Cg33504508
IJERA Editor
 
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...
idescitation
 
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
Nurul Emran
 
Visualizing and Clustering Life Science Applications in Parallel 
Geoffrey Fox
 
Reusable Software and Open Data To Optimize Agriculture
David LeBauer
 
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Globus
 
Data Science Solutions by Materials Scientists: The Early Case Studies
Tony Fast
 
From data to knowledge – the Ondex System for integrating Life Sciences data ...
Catherine Canevet
 
PNNL April 2011 ogce
marpierc
 
Accelerating GWAS epistatic interaction analysis methods
Priscill Orue Esquivel
 
FedCentric_Presentation
Yatpang Cheung
 
Cybertools stork-2009-cybertools allhandmeeting-poster
balmanme
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
 
Adbms 30 data placement
Vaibhav Khanna
 
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...
IJERA Editor
 
5.4 mining sequence patterns in biological data
Krish_ver2
 
Cg33504508
IJERA Editor
 

Viewers also liked (20)

PDF
Explanations in Data Systems
Fotis Savva
 
PDF
Rethinking Data-Intensive Science Using Scalable Analytics Systems
fnothaft
 
PPT
Tema 8
marinaortega11
 
PDF
MNCC - 2013-09-27 - GWT & PhoneGap
Cyrille Savelief
 
PDF
Computer Languages
Veniman
 
PPTX
API Authentication
petya_st
 
PPTX
Afl presentation assessment 2b
Gary Addison
 
PDF
nyaruto_sajtokiajanlo_nagy
Gyuricza Eszter
 
PPTX
Prezentacja mrzygłód sylwia
Sylwia Mrzygłód
 
PPTX
Skal International Sunshine Coast 2015 National AGM club report
Joanne Skinner
 
PPSX
Skal International Sunshine Coast National Assembly Sep 2015
Joanne Skinner
 
PPTX
Tide ghoshal sir
Kumari Pswn
 
PDF
STCW Certificates
Ramon Bibal Jr.
 
PPTX
第4回プログラミングカフェ_テキスト
街角プログラミングカフェ
 
DOCX
My notes
Kumari Pswn
 
PDF
Real_Estate_Script
Jeff Kent
 
PPTX
第7回プログラミングカフェ_テキスト
街角プログラミングカフェ
 
PPTX
第3回プログラミングカフェ_テキスト
街角プログラミングカフェ
 
PPT
Qcl 15-v4 [challenge-no 4 pareto graph]_[imnu]_[shubham gupta]
shubham gupta
 
Explanations in Data Systems
Fotis Savva
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
fnothaft
 
MNCC - 2013-09-27 - GWT & PhoneGap
Cyrille Savelief
 
Computer Languages
Veniman
 
API Authentication
petya_st
 
Afl presentation assessment 2b
Gary Addison
 
nyaruto_sajtokiajanlo_nagy
Gyuricza Eszter
 
Prezentacja mrzygłód sylwia
Sylwia Mrzygłód
 
Skal International Sunshine Coast 2015 National AGM club report
Joanne Skinner
 
Skal International Sunshine Coast National Assembly Sep 2015
Joanne Skinner
 
Tide ghoshal sir
Kumari Pswn
 
STCW Certificates
Ramon Bibal Jr.
 
第4回プログラミングカフェ_テキスト
街角プログラミングカフェ
 
My notes
Kumari Pswn
 
Real_Estate_Script
Jeff Kent
 
第7回プログラミングカフェ_テキスト
街角プログラミングカフェ
 
第3回プログラミングカフェ_テキスト
街角プログラミングカフェ
 
Qcl 15-v4 [challenge-no 4 pareto graph]_[imnu]_[shubham gupta]
shubham gupta
 
Ad

Similar to Rethinking data intensive science using scalable analytics systems (20)

PDF
Adam
newmooxx
 
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
PDF
Ga4 gh meeting at the the sanger institute
Matt Massie
 
PDF
Design for Scalability in ADAM
fnothaft
 
PDF
ADAM
Matt Massie
 
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
PDF
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
PDF
Spark Summit East 2015
Timothy Danford
 
PDF
Scaling up genomic analysis with ADAM
fnothaft
 
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
PDF
Fast Variant Calling with ADAM and avocado
fnothaft
 
PPTX
Big data analysing genomics and the bdg project
sree navya
 
PDF
Scalable up genomic analysis with ADAM
fnothaft
 
PDF
Scaling up genomic analysis with ADAM
fnothaft
 
PDF
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
PDF
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
PDF
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
PPT
Strata-Hadoop 2015 Presentation
Timothy Danford
 
PDF
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
PDF
Adam bosc-071114
fnothaft
 
Adam
newmooxx
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Ga4 gh meeting at the the sanger institute
Matt Massie
 
Design for Scalability in ADAM
fnothaft
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Andy Petrella
 
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
Spark Summit East 2015
Timothy Danford
 
Scaling up genomic analysis with ADAM
fnothaft
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Fast Variant Calling with ADAM and avocado
fnothaft
 
Big data analysing genomics and the bdg project
sree navya
 
Scalable up genomic analysis with ADAM
fnothaft
 
Scaling up genomic analysis with ADAM
fnothaft
 
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
Strata-Hadoop 2015 Presentation
Timothy Danford
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
Adam bosc-071114
fnothaft
 
Ad

Recently uploaded (20)

PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Tally software_Introduction_Presentation
AditiBansal54083
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 

Rethinking data intensive science using scalable analytics systems

  • 1. Rethinking data-intensive science using scalable analytics systems Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson AMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge, Cambridge, MA 1
  • 2. Abstract • In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28x speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity “big data” systems. 2
  • 3. Background source : NIH National Genome Research Institute 3
  • 4. Characteristics of science analysis systems 4
  • 5. Layering • Physical Storage coordinates data writes to physical media. • Data Distribution manages access, replication, and distribution of the files that have been written to storage media. • Materialized Data encodes the patterns for how data is encoded and stored. This layer determines I/O bandwidth and compression. 5
  • 6. Layering • Data Schema specifies the representation of data, and forms the narrow waist of the stack that separates access from execution • Evidence Access provides primitives for processing data, and enables the transformation of data into different views and traversals. • Presentation enhances the data schema with convenience methods for performing common tasks and accessing common derived fields from a single element. 6
  • 7. Layering • Applications use the evidence access and presentation layers to compose algorithms for performing an analysis. 7
  • 9. Parquet • OSS Created by Twitter and Cloudera, based on Google Dremel • Columnar File Format • Limit I/O to only data that is needed • Compresses very well - ADAM file are 5-25% smaller than BAM file without loss of data • 3 layers of parallelism: File/row group, Column chunk, Page 9
  • 10. Parquet/Spark integration • 1 row group in Parquet maps to 1 partition in spark • We interact with Parquet via input/output formats • Spark builds and execute a computation Directed Acyclic Graph(DAG), manages data locality, error/retries 10
  • 12. • We evaluated ADAM against the GATK [14], SAMtools [32], Picard [51], and Sambamba [50]. We evaluated the performance of BQSR, INDEL realignment (IR), duplicate marking (DM), sort, and Flagstat (FS). Genomics Workloads 12
  • 13. • This data is shown in Table 2. Although ADAM is more expensive than the best legacy tool (Sambamba [50]) for sorting and duplicate marking, ADAM is less expensive for all other stages. In total, using ADAM reduces the end-to-end analysis cost by 63% over a pipeline constructed out of solely legacy tools. Genomics Workloads 13
  • 14. • Table 3 describes the instance types. Genomics Workloads 14
  • 15. Genomics Workloads • We achieve near-linear speedup across 128 nodes 15
  • 16. Conclusion • By rethinking the architecture of scientific data management systems, we have been able to achieve parity on single node systems, while providing linear strong scaling out to 128 nodes. By making it easy to scale scientific analysis across multiple commodity machines, we enable the use of smaller, less expensive computers, leading to a 63% cost improvement and a 28x improvement in read preprocessing pipeline latency. 16