Spark: A Unified Engine for Big Data Processing

56 COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11
contributed articles
DOI:10.1145/2934664
This open source computing framework
unifies streaming, batch, and interactive big
data workloads to unlock new applications.
BY MATEI ZAHARIA, REYNOLD S. XIN, PATRICK WENDELL,
TATHAGATA DAS, MICHAEL ARMBRUST, ANKUR DAVE,
XIANGRUI MENG, JOSH ROSEN, SHIVARAM VENKATARAMAN,
MICHAEL J. FRANKLIN, ALI GHODSI, JOSEPH GONZALEZ,
SCOTT SHENKER, AND ION STOICA
THE GROWTH OF data volumes in industry and research
poses tremendous opportunities, as well as tremendous
computational challenges. As data sizes have outpaced
the capabilities of single machines, users have needed
new systems to scale out computations to multiple
nodes. As a result, there has been an explosion of
new cluster programming models targeting diverse
computing workloads.1,4,7,10
At first, these models were
relatively specialized, with new models developed for
new workloads; for example, MapReduce4
supported
batch processing, but Google also developed Dremel13
for interactive SQL queries and Pregel11
for iterative graph algorithms. In the
open source Apache Hadoop stack,
systems like Storm1
and Impala9
are
also specialized. Even in the relational
database world, the trend has been to
move away from “one-size-fits-all” sys-
tems.18
Unfortunately, most big data
applications need to combine many
different processing types. The very
nature of “big data” is that it is diverse
and messy; a typical pipeline will need
MapReduce-like code for data load-
ing, SQL-like queries, and iterative
machine learning. Specialized engines
can thus create both complexity and
inefficiency; users must stitch together
disparate systems, and some applica-
tions simply cannot be expressed effi-
ciently in any engine.
In 2009, our group at the Univer-
sity of California, Berkeley, started
the Apache Spark project to design
a unified engine for distributed data
processing. Spark has a programming
model similar to MapReduce but ex-
tends it with a data-sharing abstrac-
tion called “Resilient Distributed Da-
tasets,” or RDDs.25
Using this simple
extension, Spark can capture a wide
range of processing workloads that
previously needed separate engines,
including SQL, streaming, machine
learning, and graph processing2,26,6
(see Figure 1). These implementations
use the same optimizations as special-
ized engines (such as column-oriented
processing and incremental updates)
and achieve similar performance but
run as libraries over a common en-
gine, making them easy and efficient
to compose. Rather than being specific
Apache Spark:
A Unified
Engine for
Big Data
Processing
key insights
˽ A simple programming model can
capture streaming, batch, and interactive
workloads and enable new applications
that combine them.
˽ Apache Spark applications range from
finance to scientific data processing
and combine libraries for SQL, machine
learning, and graphs.
˽ In six years, Apache Spark has
grown to 1,000 contributors and
thousands of deployments.

NOVEMBER 2016 | VOL. 59 | NO. 11 | COMMUNICATIONS OF THE ACM 57
to these workloads, we claim this result
is more general; when augmented with
data sharing, MapReduce can emu-
late any distributed computation, so
it should also be possible to run many
other types of workloads.24
Spark’s generality has several im-
portant benefits. First, applications
are easier to develop because they use a
unified API. Second, it is more efficient
to combine processing tasks; whereas
prior systems required writing the
data to storage to pass it to another en-
gine, Spark can run diverse functions
over the same data, often in memory.
Finally, Spark enables new applica-
tions (such as interactive queries on a
graph and streaming machine learn-
ing) that were not possible with previ-
ous systems. One powerful analogy for
the value of unification is to compare
smartphones to the separate portable
devices that existed before them (such
as cameras, cellphones, and GPS gad-
gets). In unifying the functions of these
devices, smartphones enabled new
applications that combine their func-
tions (such as video messaging and
Waze) that would not have been pos-
sible on any one device.
Since its release in 2010, Spark
has grown to be the most active open
source project or big data processing,
with more than 1,000 contributors. The
project is in use in more than 1,000 or-
ganizations, ranging from technology
companies to banking, retail, biotech-
nology, and astronomy. The largest
publicly announced deployment has
Analyses performed using Spark of brain activity in a larval zebrafish: (left) matrix factorization to characterize functionally similar
regions (as depicted by different colors) and (right) embedding dynamics of whole-brain activity into lower-dimensional trajectories.
Source: Jeremy Freeman and Misha Ahrens, Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA.

across a cluster that can be manipu-
lated in parallel. Users create RDDs by
applying operations called “transfor-
mations” (such as map, filter, and
groupBy) to their data.
Spark exposes RDDs through a func-
tional programming API in Scala, Java,
Python, and R, where users can simply
pass local functions to run on the clus-
ter. For example, the following Scala
code creates an RDD representing the
error messages in a log file, by search-
ing for lines that start with ERROR, and
then prints the total number of errors:
lines = spark.textFile(“hdfs:/
/
...”)
errors = lines.filter(
s => s.startsWith(“ERROR”)
)
println(“Totalerrors:“+errors.count()
)
The first line defines an RDD backed
by a file in the Hadoop Distributed File
System(HDFS)asacollectionoflinesof
text. The second line calls the filter
transformation to derive a new RDD
from lines. Its argument is a Scala
function literal or closure.a
Finally, the
last line calls count, another type of
RDD operation called an “action” that
a The closures passed to Spark can call into any
existing Scala or Python library or even refer-
ence variables in the outer program. Spark
sends read-only copies of these variables to
worker nodes.
returns a result to the program (here,
the number of elements in the RDD)
instead of defining a new RDD.
Spark evaluates RDDs lazily, al-
lowing it to find an efficient plan for
the user’s computation. In particular,
transformations return a new RDD ob-
ject representing the result of a compu-
tation but do not immediately compute
it. When an action is called, Spark looks
at the whole graph of transformations
used to create an execution plan. For ex-
ample, if there were multiple filter or
map operations in a row, Spark can fuse
them into one pass, or, if it knows that
data is partitioned, it can avoid moving
it over the network for groupBy.5
Users
can thus build up programs modularly
without losing performance.
Finally, RDDs provide explicit sup-
port for data sharing among compu-
tations. By default, RDDs are “ephem-
eral” in that they get recomputed each
time they are used in an action (such
as count). However, users can also
persist selected RDDs in memory or
for rapid reuse. (If the data does not
fit in memory, Spark will also spill it
to disk.) For example, a user searching
through a large set of log files in HDFS
to debug a problem might load just the
error messages into memory across the
cluster by calling
errors.persist()
After this, the user can run a variety of
queries on the in-memory data:
/
/ Count errors mentioning MySQL
errors.filter(s => s.contains(“MySQL”)
)
.count()
/
/Fetchbackthetimefieldsoferrorsthat
/
/ mention PHP, assuming time is field #3:
errors.filter(s => s.contains(“PHP”)
)
.map(line => line.split(‘t’)
(3)
)
.collect()
This data sharing is the main differ-
ence between Spark and previous com-
puting models like MapReduce; other-
wise, the individual operations (such
as map and groupBy) are similar. Data
sharing provides large speedups, often
as much as 100×, for interactive que-
ries and iterative algorithms.23
It is also
the key to Spark’s generality, as we dis-
cuss later.
Fault tolerance. Apart from provid-
ing data sharing and a variety of paral-
more than 8,000 nodes.22
As Spark has
grown, we have sought to keep building
on its strength as a unified engine. We
(and others) have continued to build an
integrated standard library over Spark,
with functions from data import to ma-
chine learning. Users find this ability
powerful; in surveys, we find the major-
ity of users combine multiple of Spark’s
libraries in their applications.
As parallel data processing becomes
common, the composability of process-
ing functions will be one of the most
important concerns for both usability
and performance. Much of data analy-
sis is exploratory, with users wishing to
combine library functions quickly into
a working pipeline. However, for “big
data” in particular, copying data be-
tween different systems is anathema to
performance. Users thus need abstrac-
tions that are general and composable.
In this article, we introduce the Spark
programming model and explain why it
is highly general. We also discuss how
we leveraged this generality to build
other processing tasks over it. Finally,
we summarize Spark’s most common
applications and describe ongoing de-
velopment work in the project.
Programming Model
The key programming abstraction in
Spark is RDDs, which are fault-toler-
ant collections of objects partitioned
Figure 1. Apache Spark software stack, with specialized processing libraries implemented
over the core engine.
SQL
Streaming ML Graph

lel operations, RDDs also automatical-
ly recover from failures. Traditionally,
distributed computing systems have
provided fault tolerance through data
replication or checkpointing. Spark
uses a different approach called “lin-
eage.”25
Each RDD tracks the graph of
transformations that was used to build
it and reruns these operations on base
data to reconstruct any lost partitions.
Forexample,Figure2showstheRDDsin
our previous query, where we obtain the
time fields of errors mentioning PHP by
applying two filters and a map. If any
partition of an RDD is lost (for example,
ifanodeholdinganin-memorypartition
of errors fails), Spark will rebuild it by
applying the filter on the corresponding
block of the HDFS file. For “shuffle” op-
erations that send data from all nodes to
all other nodes (such as reduceByKey),
senders persist their output data locally
in case a receiver fails.
Lineage-based recovery is signifi-
cantly more efficient than replication
in data-intensive workloads. It saves
both time, because writing data over
the network is much slower than writ-
ing it to RAM, and storage space in
memory. Recovery is typically much
faster than simply rerunning the pro-
gram, because a failed node usually
contains multiple RDD partitions, and
these partitions can be rebuilt in paral-
lel on other nodes.
A longer example. As a longer exam-
ple, Figure 3 shows an implementa-
tion of logistic regression in Spark.
It uses batch gradient descent, a
simple iterative algorithm that
computes a gradient function over
the data repeatedly as a parallel
sum. Spark makes it easy to load the
data into RAM once and run multiple
sums. As a result, it runs faster than
traditional MapReduce. For example,
in a 100GB job (see Figure 4), MapRe-
duce takes 110 seconds per iteration
because each iteration loads the data
from disk, while Spark takes only one
second per iteration after the first load.
Integration with storage systems.
Much like Google’s MapReduce,
Spark is designed to be used with
multiple external systems for per-
sistent storage. Spark is most com-
monly used with cluster file systems
like HDFS and key-value stores like
S3 and Cassandra. It can also connect
with Apache Hive as a data catalog.
SQL and DataFrames. One of the
most common data processing para-
digms is relational queries. Spark SQL2
and its predecessor, Shark,23
imple-
ment such queries on Spark, using
techniques similar to analytical da-
tabases. For example, these systems
support columnar storage, cost-based
optimization, and code generation for
query execution. The main idea behind
these systems is to use the same data
layout as analytical databases—com-
pressed columnar storage—inside
RDDs. In Spark SQL, each record in an
RDD holds a series of rows stored in bi-
nary format, and the system generates
RDDs usually store only temporary
data within an application, though
some applications (such as the Spark
SQL JDBC server) also share RDDs
across multiple users.2
Spark’s de-
sign as a storage-system-agnostic
engine makes it easy for users to run
computations against existing data
and join diverse data sources.
Higher-Level Libraries
The RDD programming model pro-
vides only distributed collections of
objects and functions to run on them.
Using RDDs, however, we have built
a variety of higher-level libraries on
Spark, targeting many of the use cas-
es of specialized computing engines.
The key idea is that if we control the
data structures stored inside RDDs,
the partitioning of data across nodes,
and the functions run on them, we can
implement many of the execution tech-
niques in other engines. Indeed, as we
show in this section, these libraries
often achieve state-of-the-art perfor-
mance on each task while offering sig-
nificant benefits when users combine
them. We now discuss the four main
libraries included with Apache Spark.
Figure 2. Lineage graph for the third query
in our example; boxes represent RDDs, and
arrows represent transformations.
lines
errors
PHP errors
filter(line.startsWith(“ERROR”))
filter(line.contains(“PHP”)))
map(line.split(‘t’)(3))
time fields
Figure 3. A Scala implementation of logistic regression via batch gradient descent in Spark.
// Load data into an RDD
val points = sc.textFile(...).map(readPoint).persist()
// Start with a random parameter vector
var w = DenseVector.random(D)
// On each iteration, update param vector with a sum
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
p.x * (1/(1+exp(-p.y*(w.dot(p.x))))-1) * p.y
}.reduce((a, b) => a+b)
w -= gradient
}
Figure 4. Performance of logistic regression in Hadoop MapReduce vs. Spark for 100GB of
data on 50 m2.4xlarge EC2 nodes.
0
500
1,000
1,500
2,000
2,500
1 5 10 20
Running
Time
(s)
Number of Iterations
Spark
Hadoop

means model) are easily passed to oth-
er libraries. Apart from compatibility
at the API level, composition in Spark
is also efficient at the execution level,
because Spark can optimize across pro-
cessing libraries. For example, if one li-
brary runs a map function and the next
library runs a map on its result, Spark
will fuse these operations into a single
map. Likewise, Spark’s fault recovery
works seamlessly across these librar-
ies, recomputing lost data no matter
which libraries produced it.
Performance. Given that these librar-
ies run over the same engine, do they
lose performance? We found that by
implementing the optimizations we
just outlined within RDDs, we can often
match the performance of specialized
engines. For example, Figure 6 com-
pares Spark’s performance on three
simple tasks—a SQL query, stream-
ing word count, and Alternating Least
Squares matrix factorization—versus
other engines. While the results vary
across workloads, Spark is generally
comparable with specialized systems
like Storm, GraphLab, and Impala.b
For
stream processing, although we show
results from a distributed implementa-
tion on Storm, the per-node through-
put is also comparable to commercial
streaming engines like Oracle CEP.26
Even in highly competitive bench-
marks, we have achieved state-of-the-
art performance using Apache Spark.
In 2014, we entered the Daytona Gray-
Sort benchmark (http://sortbench-
mark.org/) involving sorting 100TB of
data on disk, and tied for a new record
with a specialized system built only
for sorting on a similar number of ma-
chines. As in the other examples, this
was possible because we could imple-
ment both the communication and
CPU optimizations necessary for large-
scale sorting inside the RDD model.
Applications
Apache Spark is used in a wide range
of applications. Our surveys of Spark
b One area in which other designs have outper-
formedSparkiscertaingraphcomputations.12,16
However, these results are for algorithms with
low ratios of computation to communication
(such as PageRank) where the latency from syn-
chronized communication in Spark is signifi-
cant. In applications with more computation
(such as the ALS algorithm) distributing the ap-
plication on Spark still helps.
code to run directly against this layout.
Beyond running SQL queries,
we have used the Spark SQL engine
to provide a higher-level abstrac-
tion for basic data transformations
called DataFrames,2
which are RDDs
of records with a known schema.
DataFrames are a common abstraction
for tabular data in R and Python, with
programmatic methods for filtering,
computing new columns, and aggrega-
tion. In Spark, these operations map
down to the Spark SQL engine and re-
ceive all its optimizations. We discuss
DataFrames more later.
One technique not yet implemented
in Spark SQL is indexing, though other
libraries over Spark (such as Indexe-
dRDDs3
) do use it.
Spark Streaming. Spark Streaming26
implements incremental stream pro-
cessingusingamodelcalled“discretized
streams.” To implement streaming over
Spark, we split the input data into small
batches (such as every 200 milliseconds)
that we regularly combine with state
stored inside RDDs to produce new re-
sults. Running streaming computations
this way has several benefits over tradi-
tional distributed streaming systems.
For example, fault recovery is less expen-
sive due to using lineage, and it is pos-
sible to combine streaming with batch
and interactive queries.
GraphX. GraphX6
provides a graph
computation interface similar to Pregel
and GraphLab,10,11
implementing the
same placement optimizations as these
systems (such as vertex partitioning
schemes) through its choice of parti-
tioning function for the RDDs it builds.
MLlib. MLlib,14
Spark’s machine
learning library, implements more
than 50 common algorithms for dis-
tributed model training. For example, it
includes the common distributed algo-
rithms of decision trees (PLANET), La-
tent Dirichlet Allocation, and Alternat-
ing Least Squares matrix factorization.
Combining processing tasks. Spark’s
libraries all operate on RDDs as the
data abstraction, making them easy to
combine in applications. For example,
Figure 5 shows a program that reads
some historical Twitter data using
Spark SQL, trains a K-means clustering
model using MLlib, and then applies
the model to a new stream of tweets.
The data tasks returned by each library
(here the historic tweet RDD and the K-
Spark has a similar
programming
model to
MapReduce but
extends it with
a data-sharing
abstraction
called “resilient
distributed
datasets,” or RDDs.

users have identified more than 1,000
companies using Spark, in areas from
Web services to biotechnology to fi-
nance. In academia, we have also seen
applications in several scientific do-
mains. Across these workloads, we find
users take advantage of Spark’s gener-
ality and often combine multiple of its
libraries. Here, we cover a few top use
cases. Presentations on many use cases
are also available on the Spark Summit
conference website (http://www.spark-
summit.org).
Batch processing. Spark’s most com-
mon applications are for batch proc-
essing on large datasets, including
Extract-Transform-Load workloads to
convert data from a raw format (such
as log files) to a more structured for-
mat and offline training of machine
learning models. Published examples
of these workloads include page per-
sonalization and recommendation at
Yahoo!; managing a data lake at Gold-
man Sachs; graph mining at Alibaba;
financial Value at Risk calculation; and
text mining of customer feedback at
Toyota. The largest published use case
we are aware of is an 8,000-node cluster
at Chinese social network Tencent that
ingests 1PB of data per day.22
While Spark can process data in
memory, many of the applications in
this category run only on disk. In such
cases, Spark can still improve perfor-
mance over MapReduce due to its sup-
port for more complex operator graphs.
Interactive queries. Interactive use of
Spark falls into three main classes. First,
organizations use Spark SQL for rela-
tional queries, often through business-
intelligencetoolslikeTableau.Examples
include eBay and Baidu. Second, devel-
opers and data scientists can use Spark’s
Scala, Python, and R interfaces interac-
tively through shells or visual notebook
environments. Such interactive use is
crucial for asking more advanced ques-
tions and for designing models that
eventually lead to production applica-
tionsandiscommoninalldeployments.
Third, several vendors have developed
domain-specific interactive applications
that run on Spark. Examples include
Tresata (anti-money laundering), Tri-
facta(datacleaning),andPanTera(large-
scale visualization, as in Figure 7).
Stream processing. Real-time proc-
essing is also a popular use case, both
in analytics and in real-time decision-
streaming with batch and interactive
queries. For example, video company
Conviva uses Spark to continuously
maintain a model of content distribu-
tion server performance, querying it
automatically when it moves clients
making applications. Published use
cases for Spark Streaming include
network security monitoring at Cis-
co, prescriptive analytics at Samsung
SDS, and log mining at Netflix. Many
of these applications also combine
Figure 7. PanTera, a visualization application built on Spark that can interactively filter data.
Source: PanTera
Figure 5. Example combining the SQL, machine learning, and streaming libraries in Spark.
// Load historical data as an RDD using Spark SQL
val trainingData = sql(
“SELECT location, language FROM old_tweets”)
// Train a K-means model using MLlib
val model = new KMeans()
.setFeaturesCol(“location”)
.setPredictionCol(“language”)
.fit(trainingData)
// Apply the model to new tweets in a stream
TwitterUtils.createStream(...)
.map(tweet => model.predict(tweet.location))
Figure 6. Comparing Spark’s performance with several widely used specialized systems
for SQL, streaming, and machine learning. Data is from Zaharia24
(SQL query and stream-
ing word count) and Sparks et al.17
(alternating least squares matrix factorization).
Machine Learning
Storm
Spark
0
2
4
6
8
Throughput
(records/s)
Streaming
Impala
(disk)
Impala
(mem)
Redshift
Spark
(disk)
Spark
(mem)
0
5
10
15
20
Response Time
(sec)
SQL
MATLAB
Mahout
GraphLab
Spark
0
1
2
3
4
5
6
Response Time
(hours)
10 x 106

queries during live experiments. Figure
8 shows an example image generated
using Spark.
Spark components used. Because
Spark is a unified data-processing en-
gine, the natural question is how many
of its libraries organizations actually
use. Our surveys of Spark users have
shown that organizations do, indeed,
use multiple components, with over
60% of organizations using at least
three of Spark’s APIs. Figure 9 out-
lines the usage of each component in
a July 2015 Spark survey by Databricks
that reached 1,400 respondents. We
list the Spark Core API (just RDDs)
as one component and the higher-
level libraries as others. We see that
many components are widely used,
with Spark Core and SQL as the most
popular. Streaming is used in 46% of
organizations and machine learning
in 54%. While not shown directly in
Figure 9, most organizations use mul-
tiple components; 88% use at least two
of them, 60% use at least three (such
as Spark Core and two libraries), and
27% use at least four components.
Deployment environments. We also
see growing diversity in where Apache
Spark applications run and what data
sources they connect to. While the first
Spark deployments were generally in
Hadoop environments, only 40% of de-
ployments in our July 2015 Spark sur-
vey were on the Hadoop YARN cluster
manager. In addition, 52% of respon-
dents ran Spark on a public cloud.
Why Is the Spark Model General?
While Apache Spark demonstrates
that a unified cluster programming
model is both feasible and useful, it
would be helpful to understand what
makes cluster programming models
general, along with Spark’s limita-
tions. Here, we summarize a discus-
sion on the generality of RDDs from
Zaharia.24
We study RDDs from two
perspectives. First, from an expres-
siveness point of view, we argue that
RDDs can emulate any distributed
computation, and will do so efficient-
ly in many cases unless the computa-
tion is sensitive to network latency.
Second, from a systems point of view,
we show that RDDs give applications
control over the most common bottle-
neck resources in clusters—network and
storage I/O—and thus make it possible
to express the same optimizations
for these resources that characterize
specialized systems.
Expressivenessperspective.Tostudythe
expressiveness of RDDs, we start by com-
paring RDDs to the MapReduce model,
which RDDs build on. The first question
is what computations can MapReduce
itself express? Although there have been
numerous discussions about the limita-
tions of MapReduce, the surprising an-
swer here is that MapReduce can emu-
late any distributedcomputation.
To see this, note that any distributed
computation consists of nodes that per-
form localcomputationandoccasionally
exchange messages. MapReduce offers
the map operation, which allows local
computation, and reduce, which allows
all-to-all communication. Any distrib-
uted computation can thus be emulated,
perhaps somewhat inefficiently, by
breaking down its work into timesteps,
across servers, in an application that
requires substantial parallel work for
both model maintenance and queries.
Scientific applications. Spark has also
been used in several scientific domains,
including large-scale spam detection,19
image processing,27
and genomic data
processing.15
One example that com-
bines batch, interactive, and stream
processing is the Thunder platform
for neuroscience at Howard Hughes
Medical Institute, Janelia Farm.5
It is
designed to process brain-imaging data
from experiments in real time, scaling
up to 1TB/hour of whole-brain imaging
data from organisms (such as zebrafish
and mice). Using Thunder, researchers
can apply machine learning algorithms
(such as clustering and Principal Com-
ponent Analysis) to identify neurons in-
volved in specific behaviors. The same
code can be run in batch jobs on data
from previous runs or in interactive
Figure 9. Percent of organizations using each Spark component, from the Databricks 2015
Spark survey; https://databricks.com/blog/2015/09/24/.
0% 20% 40% 60% 80% 100%
GraphX
MLlib
Streaming
SQL
Core
Fraction of Users
Figure 8. Visualization of neurons in the zebrafish brain created with Spark, where each
neuron is colored based on the direction of movement that correlates with its activity.
Source: Jeremy Freeman and Misha Ahrens of Janelia Research Campus.

running maps to perform the local
computation in each timestep, and
batching and exchanging messages at
the end of each step using a reduce. A
series of MapReduce steps will capture
the whole result, as in Figure 10. Re-
cent theoretical work has formalized
this type of emulation by showing that
MapReduce can simulate many com-
putations in the Parallel Random Ac-
cess Machine model.8
Repeated Map-
Reduce is also equivalent to the Bulk
Synchronous Parallel model.20
While this line of work shows that
MapReduce can emulate arbitrary
computations, two problems can
make the “constant factor” behind
this emulation high. First, MapReduce
is inefficient at sharing data across
timesteps because it relies on repli-
cated external storage systems for this
purpose. Our emulated system may
thus become slower due to writing
out its state after each step. Second,
the latency of the MapReduce steps
determines how well our emulation
will match a real network, and most
Map-Reduce implementations were
designed for batch environments with
minutes to hours of latency.
RDDs and Spark address both of
these limitations. On the data-sharing
front, RDDs make data sharing fast by
avoiding replication of intermediate data
and can closely emulate the in-memory
“data sharing” across time that would
happen in a system composed of long-
running processes. On the latency front,
Spark can run MapReduce-like steps
on large clusters with 100ms latency;
nothingintrinsictotheMapReducemodel
prevents this. While some applications
need finer-grain timesteps and commu-
nication, this 100ms latency is enough
to implement many data-intensive
workloads, where the amount of com-
putation that can be batched before a
communication step is high.
In summary, RDDs build on Map-
Reduce’s ability to emulate any dis-
tributed computation but make this
emulation significantly more efficient.
Their main limitation is increased
latency due to synchronization in each
communication step, but this latency
is often not a factor.
Systems perspective. Independent
of the emulation approach to char-
acterizing Spark’s generality, we can
take a systems approach. What are the
Links. Each node has a 10Gbps
(1.3GB/s) link, or approximately 40×
less than its memory bandwidth and
2× less than its aggregate disk band-
width; and
Racks.Nodesareorganizedintoracks
of 20 to 40 machines, with 40Gbps–
80Gbps bandwidth out of each rack,
or 2×–5× lower than the in-rack net-
work performance.
Given these properties, the most
important performance concern in
many applications is the placement of
data and computation in the network.
Fortunately, RDDs provide the facili-
bottleneck resources in cluster com-
putations? And can RDDs use them ef-
ficiently? Although cluster applications
are diverse, they are all bound by the
same properties of the underlying hard-
ware. Current datacenters have a steep
storage hierarchy that limits most ap-
plications in similar ways. For example,
a typical Hadoop cluster might have the
following characteristics:
Local storage. Each node has local
memory with approximately 50GB/s
of bandwidth, as well as 10 to 20 lo-
cal disks, for approximately 1GB/s to
2GB/s of disk bandwidth;
Figure 11. Example of Spark’s DataFrame API in Python. Unlike Spark’s core API, DataFrames
have a schema with named columns (such as age and city) and take expressions in a limited
language (such as age > 20) instead of arbitrary Python functions.
users.where(users[“age”] > 20)
.groupBy(“city”)
.agg(avg(“age”), max(“income”))
Figure 12. Working with DataFrames in Spark’s R API. We load a distributed DataFrame
using Spark’s JSON data source, then filter and aggregate using standard R column ex-
pressions.
people <- read.df(context, “./people.json”, “json”)
# Filter people by age
adults = filter(people, people$age > 20)
# Count number of people by country
summarize(groupBy(adults, adults$city), count=n(adults$id))
## city count
##1 Cambridge 1
##2 San Francisco 6
##3 Berkeley 4
Figure 10. Emulating an arbitrary distributed computation with MapReduce.
map
reduce
. . .
(a) MapReduce provides primitives
for local computation and all-to-all
communication.
(b) By chaining these steps together,
we can emulate any distributed
computation. The main costs for this
emulation are the latency of the rounds
and the overhead of passing state
across steps.

ity in new libraries. More than 200 third-
party packages are also available.c
In the
research community, multiple projects
at Berkeley, MIT, and Stanford build on
Spark, and many new libraries (such
as GraphX and Spark Streaming) came
from research groups. Here, we sketch
four of the major efforts.
DataFrames and more declarative
APIs. The core Spark API was based on
functional programming over distrib-
uted collections that contain arbitrary
types of Scala, Java, or Python objects.
While this approach was highly ex-
pressive, it also made programs more
difficult to automatically analyze and
optimize. The Scala/Java/Python ob-
jects stored in RDDs could have com-
plex structure, and the functions run
over them could include arbitrary
code. In many applications, develop-
ers could get suboptimal performance
if they did not use the right operators;
for example, the system on its own
could not push filter functions
ahead of maps.
To address this problem, we extend-
edSpark in 2015 to add a more declara-
tive API called DataFrames2
based on
the relational algebra. Data frames are
a common API for tabular data in Py-
thon and R. A data frame is a set of re-
cords with a known schema, essentially
equivalent to a database table, that
supports operations like filtering
and aggregation using a restricted
“expression” API. Unlike working in
the SQL language, however, data frame
operations are invoked as function
calls in a more general programming
language (such as Python and R), al-
lowing developers to easily structure
their program using abstractions in the
host language (such as functions and
classes). Figure 11 and Figure 12 show
examples of the API.
Spark’s DataFrames offer a similar
API to single-node packages but auto-
matically parallelize and optimize the
computation using Spark SQL’s query
planner. User code thus receives op-
timizations (such as predicate push-
down, operator reordering, and join
algorithm selection) that were not
available under Spark’s functional API.
To our knowledge, Spark DataFrames
are the first library to perform such
c One package index is available at https://
spark-packages.org/
relational optimizations under a data
frame API.d
While DataFrames are still new,
they have quickly become a popular
API. In our July 2015 survey, 60% of
respondents reported using them. Be-
cause of the success of DataFrames,
we have also developed a type-safe in-
terface over them called Datasetse
that
lets Java and Scala programmers view
DataFrames as statically typed col-
lections of Java objects, similar to the
RDD API, and still receive relational
optimizations. We expect these APIs
to gradually become the standard ab-
straction for passing data between
Spark libraries.
Performance optimizations. Much of
therecentworkinSparkhasbeenonper-
formance. In 2014, the Databricks team
spent considerable effort to optimize
Spark’s network and I/O primitives, al-
lowing Spark to jointly set a new record
for the Daytona GraySort challenge.f
Spark sorted 100TB of data 3× faster
than the previous record holder based
on Hadoop MapReduce using 10× few-
er machines. This benchmark was not
executed in memory but rather on (solid-
state)disks.In2015,onemajoreffortwas
Project Tungsten,g
which removes Java
Virtual Machine overhead from many of
Spark’scodepathsbyusingcodegenera-
tionandnon-garbage-collectedmemory.
Onebenefitofdoingtheseoptimizations
in a general engine is that they simulta-
neously affect all of Spark’s libraries;
machine learning, streaming, and SQL
all became faster from eachchange.
R language support. The SparkR
project21
was merged into Spark in
2015 to provide a programming inter-
face in R. The R interface is based on
DataFrames and uses almost identical
syntax to R’s built-in data frames. Oth-
er Spark libraries (such as MLlib) are
also easy to call from R, because they
accept DataFrames as input.
Research libraries. Apache Spark
continues to be used to build higher-
d One reason optimization is possible is that
Spark’s DataFrame API uses lazy evaluation
where the content of a DataFrame is not com-
puted until the user asks to write it out. The
data frame APIs in R and Python are eager, pre-
venting optimizations like operator reordering.
e https://databricks.com/blog/2016/01/04/in-
troducing-spark-datasets.html
f http://sortbenchmark.org/ApacheSpark2014.pdf
g https://databricks.com/blog/2015/04/28/
ties to control this placement; the in-
terface lets applications place com-
putations near input data (through
an API for “preferred locations” for
input sources25
), and RDDs provide
control over data partitioning and co-
location (such as specifying that data
be hashed by a given key). Libraries
(such as GraphX) can thus implement
the same placement strategies used in
specialized systems.6
Beyond network and I/O bandwidth,
themostcommonbottlenecktendstobe
CPU time, especially if data is in memo-
ry. In this case, however, Spark can run
the same algorithms and libraries used
in specialized systems on each node. For
example, it uses columnar storage and
processing in Spark SQL, native BLAS
libraries in MLlib, and so on. As we
discussed earlier, the only area where
RDDs clearly add a cost is network la-
tency, due to the synchronization at
parallel communication steps.
One final observation from a systems
perspective is that Spark may incur extra
costs over some of today’s special-
ized systems due to fault tolerance.
For example, in Spark, the map tasks
in each shuffle operation save their
output to local files on the machine
where they ran, so reduce tasks can re-
fetch it later. In addition, Spark imple-
ments a barrier at shuffle stages, so the
reduce tasks do not start until all the
maps have finished. This avoids some
of the complexity that wouldbeneeded
for fault recovery if one “pushed” re-
cords directly from maps to reduces in
a pipelined fashion. Although removing
some of these features would speed
up the system, Spark often performs
competitively despite them. The main
reason is an argument similar to our
previous one: many applications are
bound by an I/O operation (such as
shuffling data across the network or
reading it from disk) and beyond this
operation, optimizations (such as
pipelining) add only a modest benefit.
We have kept fault tolerance “on” by
defaultinSparktomakeiteasytoreason
about applications.
Ongoing Work
Apache Spark remains a rapidly evolv-
ing project, with contributions from
both industry and research. The code-
base size has grown by a factor of six
since June 2013, with most of the activ-

level data processing libraries. Recent
projects include Thunder for neurosci-
ence,5
ADAM for genomics,15
and Kira
for image processing in astronomy.27
Other research libraries (such as
GraphX) have been merged into the
main codebase.
Conclusion
Scalable data processing will be es-
sential for the next generation of
computer applications but typically
involves a complex sequence of pro-
cessing steps with different com-
puting systems. To simplify this
task, the Spark project introduced
a unified programming model and
engine for big data applications. Our
experience shows such a model can
efficiently support today’s workloads
and brings substantial benefits to users.
We hope Apache Spark highlights the
importance of composability in pro-
gramming libraries for big data and
encourages development of more eas-
ily interoperable libraries.
AllApacheSparklibrariesdescribed
in this article are open source at http://
spark.apache.org/. Databricks has
also made videos of all Spark Summit
conference talks available for free at
https://spark-summit.org/.
Acknowledgments
Apache Spark is the work of hun-
dreds of open source contributors
who are credited in the release notes
at https://spark.apache.org. Berke-
ley’s research on Spark was sup-
ported in part by National Science
Foundation CISE Expeditions Award
CCF-1139158, Lawrence Berkeley
National Laboratory Award 7076018,
and DARPA XData Award FA8750-
12-2-0331, and gifts from Amazon
Web Services, Google, SAP, IBM, The
Thomas and Stacey Siebel Founda-
tion, Adobe, Apple, Arimo, Blue
Goji, Bosch, C3Energy, Cisco, Cray,
Cloudera, EMC2, Ericsson, Face-
book, Guavus, Huawei, Informatica,
Intel, Microsoft, NetApp, Pivotal,
Samsung, Schlumberger, Splunk,
Virdata, and VMware.
References
1. Apache Storm project; http://storm.apache.org
2. Armbrust, M. et al. Spark SQL: Relational data
processing in Spark. In Proceedings of the ACM
SIGMOD/PODS Conference (Melbourne, Australia, May
31–June 4). ACM Press, New York, 2015.
3. Dave, A. Indexedrdd project; http://github.com/
S., and Stoica, I. Shark: SQL and rich analytics at scale.
In Proceedings of the ACM SIGMOD/PODS Conference
(New York, June 22–27). ACM Press, New York, 2013.
24. Zaharia, M. An Architecture for Fast and General Data
Processing on Large Clusters. Ph.D. thesis, Electrical
Engineering and Computer Sciences Department,
University of California, Berkeley, 2014; https://www.eecs.
berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
25. Zaharia, M. et al. Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster
computing. In Proceedings of the Ninth USENIX
NSDI Symposium on Networked Systems Design and
Implementation (San Jose, CA, Apr. 25–27, 2012).
26. Zaharia, M. et al. Discretized streams: Fault-tolerant
streaming computation at scale. In Proceedings of
the 24th
ACM SOSP Symposium on Operating Systems
Principles (Farmington, PA, Nov. 3–6). ACM Press, New
York, 2013.
27. Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn,
O., Franklin, M.J., Patterson, D.A., and Perlmutter, S.
Scientific Computing Meets Big Data Technology:
An Astronomy Use Case. In Proceedings of IEEE
International Conference on Big Data (Santa Clara,
CA, Oct. 29–Nov. 1). IEEE, 2015.
Matei Zaharia (matei@cs.stanford.edu) is an assistant
professor of computer science at Stanford University,
Stanford, CA, and CTO of Databricks, San Francisco, CA.
Reynold S. Xin (rxin@databricks.com) is the chief architect
on the Spark team at Databricks, San Francisco, CA.
Patrick Wendell (patrick@databricks.com) is the vice
president of engineering at Databricks, San Francisco, CA.
Tathagata Das (tdas@databricks.com) is a software
engineer at Databricks, San Francisco, CA.
Michael Armbrust (michael@databricks.com) is a
software engineer at Databricks, San Francisco, CA.
Ankur Dave (ankurd@eecs.berkeley.edu) is a graduate
student in the Real-Time, Intelligent and Secure Systems
Lab at the University of California, Berkeley.
Xiangrui Meng (meng@databricks.com) is a software
Josh Rosen (josh@databricks.com) is a software
Shivaram Venkataraman (shivaram@cs.berkeley.edu)
is a Ph.D. student in the AMPLab at the University of
California, Berkeley.
Michael Franklin (mjfranklin@uchicago.edu) is the Liew
Family Chair of Computer Science at the University of
Chicago and Director of the AMPLab at the University of
Ali Ghodsi (ali@databricks.com) is the CEO of Databricks
and adjunct faculty at the University of California,
Berkeley.
Joseph E. Gonzalez (jegonzal@cs.berkeley.edu) is an
assistant professor in EECS at the University of California,
Berkeley.
Scott Shenker (shenker@icsi.berkeley.edu) is a professor
in EECS at the University of California, Berkeley.
Ion Stoica (shenker@icsi.berkeley.edu) is a professor in
EECS and co-director of the AMPLab at the University of
Copyright held by the authors.
Publication rights licensed to ACM. $15.00
amplab/spark-indexedrdd
4. Dean, J. and Ghemawat, S. MapReduce: Simplified
data processing on large clusters. In Proceedings of
the Sixth OSDI Symposium on Operating Systems
Design and Implementation (San Francisco, CA, Dec.
6–8). USENIX Association, Berkeley, CA, 2004.
5. Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y.,
Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T.,
Looger, L.L., and Ahrens, M.B. Mapping brain activity
at scale with cluster computing. Nature Methods 11, 9
(Sept. 2014), 941–950.
6. Gonzalez, J.E. et al. GraphX: Graph processing in a
distributed dataflow framework. In Proceedings of the
11th
OSDI Symposium on Operating Systems Design
and Implementation (Broomfield, CO, Oct. 6–8).
USENIX Association, Berkeley, CA, 2014.
7. Isard, M. et al. Dryad: Distributed data-parallel
programs from sequential building blocks. In
Proceedings of the EuroSys Conference (Lisbon,
Portugal, Mar. 21–23). ACM Press, New York, 2007.
8. Karloff, H., Suri, S., and Vassilvitskii, S. A model
of computation for MapReduce. In Proceedings
of the ACM-SIAM SODA Symposium on Discrete
Algorithms (Austin, TX, Jan. 17–19). ACM Press,
New York, 2010.
9. Kornacker, M. et al. Impala: A modern, open-source
SQL engine for Hadoop. In Proceedings of the Seventh
Biennial CIDR Conference on Innovative Data
Systems Research (Asilomar, CA, Jan. 4–7, 2015).
10. Low, Y. et al. Distributed GraphLab: A framework
for machine learning and data mining in the cloud.
In Proceedings of the 38th
International VLDB
Conference on Very Large Databases (Istanbul,
Turkey, Aug. 27–31, 2012).
11. Malewicz, G. et al. Pregel: A system for large-scale
graph processing. In Proceedings of the ACM
SIGMOD/PODS Conference (Indianapolis, IN, June
6–11). ACM Press, New York, 2010.
12. McSherry, F., Isard, M., and Murray, D.G. Scalability!
But at what COST? In Proceedings of the 15th
HotOS Workshop on Hot Topics in Operating Systems
(Kartause Ittingen, Switzerland, May 18–20). USENIX
Association, Berkeley, CA, 2015.
13. Melnik, S. et al. Dremel: Interactive analysis of Web-
scale datasets. Proceedings of the VLDB Endowment 3
(Sept. 2010), 330–339.
14. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R.,
Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B.,
Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J.,
Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib:
Machine learning in Apache Spark. Journal of Machine
Learning Research 17, 34 (2016), 1–7.
15. Nothaft, F.A., Massie, M., Danford, T., Zhang, Z.,
Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A.,
Hammerbacher, J., Linderman, M., Franklin, M.J.,
Joseph, A.D., and Patterson, D.A. Rethinking data-
intensive science using scalable analytics systems.
In Proceedings of the SIGMOD/PODS Conference
(Melbourne, Australia, May 31–June 4). ACM Press,
New York, 2015.
16. Shun, J. and Blelloch, G.E. Ligra: A lightweight
graph processing framework for shared memory.
In Proceedings of the 18th
ACM SIGPLAN PPoPP
Symposium on Principles and Practice of Parallel
Programming (Shenzhen, China, Feb. 23–27). ACM
Press, New York, 2013.
17. Sparks, E.R., Talwalkar, A., Smith, V., Kottalam,
J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan,
M.I., and Kraska, T. MLI: An API for distributed
machine learning. In Proceedings of the IEEE ICDM
International Conference on Data Mining (Dallas, TX,
Dec. 7–10). IEEE Press, 2013.
18. Stonebraker, M. and Cetintemel, U. ‘One size fits all’: An
idea whose time has come and gone. In Proceedings
of the 21st
International ICDE Conference on Data
Engineering (Tokyo, Japan, Apr. 5–8). IEEE Computer
Society, Washington, D.C., 2005, 2–11.
19. Thomas, K., Grier, C., Ma, J., Paxson, V., and Song,
D. Design and evaluation of a real-time URL spam
filtering service. In Proceedings of the IEEE
Symposium on Security and Privacy (Oakland, CA, May
22–25). IEEE Press, 2011.
20. Valiant, L.G. A bridging model for parallel computation.
Commun. ACM 33, 8 (Aug. 1990), 103–111.
21. Venkataraman, S. et al. SparkR; http://dl.acm.org/
citation.cfm?id=2903740&CFID=687410325&CFTO
KEN=83630888
22. Xin, R. and Zaharia, M. Lessons from running large-
scale Spark workloads; http://tinyurl.com/large-
scale-spark
23. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker,
Watch the authors discuss
their work in this exclusive
Communications video.
http://cacm.acm.org/videos/spark

Spark: A Unified Engine for Big Data Processing

More Related Content

Similar to Spark: A Unified Engine for Big Data Processing (20)

Recently uploaded (20)

Spark: A Unified Engine for Big Data Processing