Rethinking data intensive science using scalable analytics systems

1 like371 views

This paper describes ADAM, a genomics pipeline that uses Apache Spark and Parquet to achieve a 28x speedup over current pipelines while reducing costs by 63%. The paper discusses how ADAM leverages techniques like columnar storage, Spark's distributed processing, and data locality to improve performance. Evaluation shows ADAM outperforms tools like GATK and Sambamba on tasks like variant calling and duplicate marking. The system achieves near-linear scaling to 128 nodes, enabling faster and cheaper genomic analysis through distributed processing on commodity clusters.

Software

Rethinking data-intensive science using scalable
analytics systems
Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri
Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeﬀ Hammerbacher,
Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

AMPLab, University of California, Berkeley, Cloudera, San Francisco, CA, Carl
Icahn School of Medicine, Mount Sinai, New York, NY, Genomebridge,
Cambridge, MA
1

Abstract
• In this paper, we describe ADAM, an example genomics
pipeline that leverages the open-source Apache Spark
and Parquet systems to achieve a 28x speedup over
current genomics pipelines, while reducing cost by 63%.
From building this system, we were able to distill a set of
techniques for implementing scientiﬁc analyses efﬁciently
using commodity “big data” systems.
2

Background
source : NIH National Genome Research Institute 3

Characteristics of science analysis systems
4

Layering
• Physical Storage coordinates
data writes to physical media.

• Data Distribution manages
access, replication, and
distribution of the ﬁles that
have been written to storage
media.

• Materialized Data encodes
the patterns for how data is
encoded and stored. This layer
determines I/O bandwidth and
compression.
5

Layering
• Data Schema speciﬁes the
representation of data, and forms
the narrow waist of the stack that
separates access from execution

• Evidence Access provides
primitives for processing data,
and enables the transformation
of data into diﬀerent views and
traversals.

• Presentation enhances the data
schema with convenience
methods for performing common
tasks and accessing common
derived ﬁelds from a single
element.
6

Layering
• Applications use the evidence
access and presentation layers
to compose algorithms for
performing an analysis.
7

Parquet
• OSS Created by Twitter and Cloudera, based on Google
Dremel
• Columnar File Format
• Limit I/O to only data that is needed
• Compresses very well - ADAM ﬁle are 5-25% smaller than
BAM ﬁle without loss of data
• 3 layers of parallelism: File/row group, Column chunk,
Page
9

Parquet/Spark integration
• 1 row group in Parquet maps
to 1 partition in spark

• We interact with Parquet via
input/output formats

• Spark builds and execute a
computation Directed Acyclic
Graph(DAG), manages data
locality, error/retries
10

• We evaluated ADAM against the GATK [14], SAMtools
[32], Picard [51], and Sambamba [50]. We evaluated the
performance of BQSR, INDEL realignment (IR), duplicate
marking (DM), sort, and Flagstat (FS).
Genomics Workloads
12

• This data is shown in Table 2. Although ADAM is more
expensive than the best legacy tool (Sambamba [50]) for
sorting and duplicate marking, ADAM is less expensive
for all other stages. In total, using ADAM reduces the
end-to-end analysis cost by 63% over a pipeline
constructed out of solely legacy tools.
Genomics Workloads
13

• Table 3 describes the instance types.
Genomics Workloads
14

Genomics Workloads
• We achieve near-linear speedup across 128 nodes
15

Conclusion
• By rethinking the architecture of scientiﬁc data
management systems, we have been able to achieve
parity on single node systems, while providing linear
strong scaling out to 128 nodes. By making it easy to
scale scientiﬁc analysis across multiple commodity
machines, we enable the use of smaller, less expensive
computers, leading to a 63% cost improvement and a
28x improvement in read preprocessing pipeline latency.
16

More Related Content

What's hot (20)

PDF

Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation

PDF

NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...Nurul Emran

PPTX

Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox

PDF

Reusable Software and Open Data To Optimize AgricultureDavid LeBauer

PDF

Big Data Clustering Model based on Fuzzy GaussianIJCSIS Research Publications

PDF

Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Globus

PPTX

Data Science Solutions by Materials Scientists: The Early Case StudiesTony Fast

PDF

From data to knowledge – the Ondex System for integrating Life Sciences data ...Catherine Canevet

PPTX

PNNL April 2011 ogcemarpierc

PDF

Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel

PPTX

FedCentric_PresentationYatpang Cheung

PDF

Cybertools stork-2009-cybertools allhandmeeting-posterbalmanme

PDF

IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal

PDF

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko

PPTX

Adbms 30 data placementVaibhav Khanna

PDF

Spatial Analysis On Histological Images Using SparkJen Aman

PDF

A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES

PDF

Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...IJERA Editor

PPT

5.4 mining sequence patterns in biological dataKrish_ver2

PDF

Cg33504508IJERA Editor

Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation

NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...Nurul Emran

Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox

Reusable Software and Open Data To Optimize AgricultureDavid LeBauer

Big Data Clustering Model based on Fuzzy GaussianIJCSIS Research Publications

Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Globus

Data Science Solutions by Materials Scientists: The Early Case StudiesTony Fast

From data to knowledge – the Ondex System for integrating Life Sciences data ...Catherine Canevet

PNNL April 2011 ogcemarpierc

Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel

FedCentric_PresentationYatpang Cheung

Cybertools stork-2009-cybertools allhandmeeting-posterbalmanme

IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal

Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsValery Tkachenko

Adbms 30 data placementVaibhav Khanna

Spatial Analysis On Histological Images Using SparkJen Aman

A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES

Implementing Workload Postponing In Cloudsim to Maximize Renewable Energy Uti...IJERA Editor

5.4 mining sequence patterns in biological dataKrish_ver2

Cg33504508IJERA Editor

Viewers also liked (20)

PDF

Explanations in Data SystemsFotis Savva

PDF

Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft

PPT

Tema 8marinaortega11

PDF

MNCC - 2013-09-27 - GWT & PhoneGapCyrille Savelief

PDF

Computer LanguagesVeniman

PPTX

API Authenticationpetya_st

PPTX

Afl presentation assessment 2bGary Addison

PDF

nyaruto_sajtokiajanlo_nagyGyuricza Eszter

PPTX

Prezentacja mrzygłód sylwia Sylwia Mrzygłód

PPT

AEX SYSTEMSAex System- Legazpi Bicol, Philippines

PPTX

Skal International Sunshine Coast 2015 National AGM club reportJoanne Skinner

PPSX

Skal International Sunshine Coast National Assembly Sep 2015Joanne Skinner

PPTX

Tide ghoshal sirKumari Pswn

PDF

STCW CertificatesRamon Bibal Jr.

PPTX

第4回プログラミングカフェ_テキスト街角プログラミングカフェ

DOCX

My notesKumari Pswn

PDF

Real_Estate_ScriptJeff Kent

PPTX

第7回プログラミングカフェ_テキスト街角プログラミングカフェ

PPTX

第3回プログラミングカフェ_テキスト街角プログラミングカフェ

PPT

Qcl 15-v4 [challenge-no 4 pareto graph]_[imnu]_[shubham gupta]shubham gupta

Explanations in Data SystemsFotis Savva

Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft

Tema 8marinaortega11

MNCC - 2013-09-27 - GWT & PhoneGapCyrille Savelief

Computer LanguagesVeniman

API Authenticationpetya_st

Afl presentation assessment 2bGary Addison

nyaruto_sajtokiajanlo_nagyGyuricza Eszter

Prezentacja mrzygłód sylwia Sylwia Mrzygłód

AEX SYSTEMSAex System- Legazpi Bicol, Philippines

Skal International Sunshine Coast 2015 National AGM club reportJoanne Skinner

Skal International Sunshine Coast National Assembly Sep 2015Joanne Skinner

Tide ghoshal sirKumari Pswn

STCW CertificatesRamon Bibal Jr.

第4回プログラミングカフェ_テキスト街角プログラミングカフェ

My notesKumari Pswn

Real_Estate_ScriptJeff Kent

第7回プログラミングカフェ_テキスト街角プログラミングカフェ

第3回プログラミングカフェ_テキスト街角プログラミングカフェ

Qcl 15-v4 [challenge-no 4 pareto graph]_[imnu]_[shubham gupta]shubham gupta

Similar to Rethinking data intensive science using scalable analytics systems (20)

PDF

Adamnewmooxx

PDF

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit

PDF

Ga4 gh meeting at the the sanger instituteMatt Massie

PDF

Design for Scalability in ADAMfnothaft

PDF

ADAMMatt Massie

PDF

Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella

PDF

Why is Bioinformatics a Good Fit for Spark?Timothy Danford

PDF

Spark Summit East 2015Timothy Danford

PDF

Scaling up genomic analysis with ADAMfnothaft

PDF

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

PDF

Fast Variant Calling with ADAM and avocadofnothaft

PPTX

Big data analysing genomics and the bdg projectsree navya

PDF

Scalable up genomic analysis with ADAMfnothaft

PDF

Scaling up genomic analysis with ADAMfnothaft

PDF

Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella

PDF

Lightning fast genomics with Spark, Adam and ScalaAndy Petrella

PDF

Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit

PPT

Strata-Hadoop 2015 PresentationTimothy Danford

PDF

H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati

PDF

Adam bosc-071114fnothaft

Adamnewmooxx

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit

Ga4 gh meeting at the the sanger instituteMatt Massie

Design for Scalability in ADAMfnothaft

ADAMMatt Massie

Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella

Why is Bioinformatics a Good Fit for Spark?Timothy Danford

Spark Summit East 2015Timothy Danford

Scaling up genomic analysis with ADAMfnothaft

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

Fast Variant Calling with ADAM and avocadofnothaft

Big data analysing genomics and the bdg projectsree navya

Scalable up genomic analysis with ADAMfnothaft

Scaling up genomic analysis with ADAMfnothaft

Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella

Lightning fast genomics with Spark, Adam and ScalaAndy Petrella

Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit

Strata-Hadoop 2015 PresentationTimothy Danford

H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati

Adam bosc-071114fnothaft

Recently uploaded (20)

PDF

iTop VPN With Crack Lifetime Activation Key-CODEutfefguu

PPTX

Finding Your License Details in IBM SPSS Statistics Version 31.pptxVersion 1 Analytics

PDF

MiniTool Partition Wizard Free Crack + Full Free Download 2025bashirkhan333g

PDF

Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREEutfefguu

PDF

Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...logixshapers59

PPTX

Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agentsklpathrudu

PDF

SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneğiSalih Küçük

PDF

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

PDF

Build It, Buy It, or Already Got It? Make Smarter Martech Decisionsbbedford2

PDF

AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025bashirkhan333g

PDF

Download Canva Pro 2025 PC Crack Full Latest Versionbashirkhan333g

PPTX

Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...bbedford2

PDF

Generic or Specific? Making sensible software design decisionsBert Jan Schrijver

PPTX

Hardware(Central Processing Unit ) CU and ALURizwanaKalsoom2

PPTX

Tally software_Introduction_PresentationAditiBansal54083

PDF

유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례Seongdae Kim

PPTX

AEM User Group: India Chapter Kickoff Meetingjennaf3

PPTX

Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...klpathrudu

PDF

AI + DevOps = Smart Automation with devseccops.ai.pdfDevseccops.ai

PPTX

OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...Shane Coughlan