Resilient Distributed Datasets

RESILIENT DISTRIBUTED DATASETS: A
FAULT-TOLERANT ABSTRACTION FOR
IN-MEMORY CLUSTER COMPUTING
MATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY,
MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA.
NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION
PAPERS WE LOVE AMSTERDAM
AUGUST 13, 2015
@gabriele_modena

(C) PRESENTATION BY GABRIELE MODENA, 2015
About me
• CS.ML
• Data science & predictive modelling
• with a sprinkle of systems work
• Hadoop & c. for data wrangling &
crunching numbers
• … and Spark

We present Resilient Distributed Datasets
(RDDs), a distributed memory abstraction
that lets programmers perform in-memory
computations on large clusters in a fault-
tolerant manner. RDDs are motivated by two
types of applications that current computing
frameworks handle inefficiently: iterative
algorithms and interactive data mining tools.

How
• Review (concepts from) key related work
• RDD + Spark
• Some critiques

Related work
• MapReduce
• Dryad
• Hadoop Distributed FileSystem (HDFS)
• Mesos

What’s an iterative algorithm anyway?
data = input data
w = <target vector>
for i in num_iterations:
for item in data:
update(w)
Multiple input scans
At each iteration,
do something
Update a shared data
structure

HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes ﬁle
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node

MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): ﬁlter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS

MapReduce
emit (k, v) pairs
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…

MapReduce
emit (k, v) pairs
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(This,1), (is, 1), (a, 1),
(test., 1), (Yes, 1), (it, 1),
(is, 1)

MapReduce
emit (k, v) pairs
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(This,1), (is, 1), (a, 1),
(test., 1), (Yes, 1), (it, 1),
(is, 1)
(This, 1), (is, 2), (a, 2),
(test, 2), (Yes, 1), (it, 1)

(c) Image from Apache Tez http://tez.apache.org

Critiques to MR and HDFS
• Great when records (and jobs) are independent
• In reality expect data to be shuﬄed across the
network
• Latency measured in minutes
• Performance hit for iterative methods
• Composability monsters
• Meant for batch workﬂows

Dryad
• Microsoft paper (2007)
• Inspired Apache Tez
• Generalisation of MapReduce via I/O
pipelining
• Applications are (direct acyclic) graphs
of tasks

Dryad
DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex) 
.addVertex(summerVertex) 
.addEdge(new Edge(tokenizerVertex, 
summerVertex, 
edgeConf.createDefaultEdgeProperty()) 
);

MapReduce and Dryad
SELECT a.country, COUNT(b.place_id)
FROM place a JOIN tweets b ON (a. place_id = b.place_id)
GROUP BY a.country;
(c) Image from Apache Tez http://tez.apache.org. Modiﬁed.

Critiques to Dryad
• No explicit abstraction for data sharing
• Must express data reps as DAG
• Partial solution: DryadLINQ
• No notion of a distributed ﬁlesystem
• How to handle large inputs?
• Local writes / remote reads?

Resilient Distributed Datasets
Read-only, partitioned collection of records 
=> a distributed immutable array
 
accessed via coarse-grained transformations
=> apply a function (scala closure) to all 
elements of the array
Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj

Spark
• Transformations - lazily create RDDs 
wc = dataset.ﬂatMap(tokenize) 
.reduceByKey(add)
• Actions - execute computation 
wc.collect()
Runtime and API

Applications
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Applications
• Driver code deﬁnes RDDs
and invokes actions
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Applications
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Applications
and invokes actions
workers, that store
• Scala closures are
serialised as Java objects
and passed across the
network over HTTPDriver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Applications
and invokes actions
workers, that store
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Applications
and invokes actions
workers, that store
network over HTTP
serialised object
• Closures are deserialised
on each worker and
applied to the RDD
(partition)
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Applications
and invokes actions
workers, that store
network over HTTP
serialised object
• Closures are deserialised
on each worker and
applied to the RDD
(partition)
• Mesos takes care of
resource management
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM

Data persistance
1. in memory as deserialized java object
2. in memory as serialized data
3. on disk
RDD Checkpointing
Memory management via LRU eviction policy
.persist() RDD for future reuse

Lineage
lines = spark.textFile(“hdfs://...")
errors =
lines.ﬁlter(_.startsWith("ERROR"))
errors.persist()
errors.ﬁlter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()

Lineage
lines
errors =
errors.persist()
.collect()

Lineage
lines
errors =
errors.persist()
.collect()
ﬁlter(_.startsWith("ERROR"))

Lineage
lines
errors
errors =
errors.persist()
.collect()

Lineage
lines
errors
errors =
errors.persist()
.collect()
ﬁlter(_.contains(“HDFS”))

Lineage
lines
errors
hdfs errors
errors =
errors.persist()
.collect()

Lineage
lines
errors
hdfs errors
errors =
errors.persist()
.collect()
map(_.split(’t’)(3))

Lineage
lines
errors
hdfs errors
time ﬁelds
errors =
errors.persist()
.collect()

Lineage
Fault recovery
If a partition is lost,
derived it back from the
lineage
lines
errors
hdfs errors
time ﬁelds
errors =
errors.persist()
.collect()

Representation
Challenge: track lineage across
transformations
1. Partitions
2. Data locality for partition p
3. List dependencies
4. Iterator function to compute a dataset
based on its parents
5. Metadata for the partitioner scheme

Narrow dependencies
pipelined execution on one cluster node
map, ﬁlter
union

Wide dependencies
require data from all parent partitions to be available and to be
shuﬄed across the nodes using a MapReduce-like operation
groupByKey
join with inputs
not co-partitioned

Scheduling
Task are allocated based on data locality (delayed scheduling)
1. Action is triggered => compute the RDD
2. Based on lineage, build a graph of stages to execute
3. Each stage contains as many pipelined
transformations with narrow dependencies as
possible
4. Launch tasks to compute missing partitions from
each stage until it has computed the target RDD
5. If a task fails => re-run it on another node as long as
its stage’s parents are still available.

Job execution
union
map
groupBy
join
B
C D
E
F
G
Stage 3Stage 2
A
Stage 1

Job execution
union
map
groupBy
join
B
C D
E
F
G
Stage 3Stage 2
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
G
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
join
B
F
G
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
map
C
union
D
E
join
B
F
G
Stage 2
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Job execution
map
C
union
D
E
join
B
F
G
Stage 3Stage 2
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()

Evaluation

Some critiques (to the paper) Some critiques
(to the paper)
• How general is this approach?
• We are still doing MapReduce
• Concerns wrt iterative algorithms still stand
• CPU bound workloads?
• Linear Algebra?
• How much tuning is required?
• How does the partitioner work?
• What is the cost of reconstructing an RDD from
lineage?
• Performance when data does not ﬁt in memory
• Eg. a join between two very large non co-
partitioned RDDs

References (Theory)
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing.
Zaharia et. al, Proceedings of NSDI’12. https://www.cs.berkeley.edu/~matei/papers/2012/
nsdi_spark.pdf
Spark: cluster computing with working sets. Zaharia et. al, Proceedings of HotCloud'10.
http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
The Google File System. Ghemawat, Gobioff, Leung, 19th ACM Symposium on Operating
Systems Principles, 2003. http://research.google.com/archive/gfs.html
MapReduce: Simpliﬁed Data Processing on Large Clusters. Dean, Ghemawat,
OSDI'04: Sixth Symposium on Operating System Design and Implementation.
http://research.google.com/archive/mapreduce.html
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007.
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Mesos: a platform for ﬁne-grained resource sharing in the data center, Hindman et. al,
Proceedings of NSDI’11. https://www.cs.berkeley.edu/~alig/papers/mesos.pdf

References (Practice)
• An overview of the pyspark API through pictures https://github.com/jkthompson/
pyspark-pictures
• Barry Brumitt’s presentation on MapReduce design patterns (UW CSE490)
http://courses.cs.washington.edu/courses/cse490h/08au/lectures/
MapReduceDesignPatterns-UW2.pdf
• The Dryad Project http://research.microsoft.com/en-us/projects/dryad/
• Apache Spark http://spark.apache.org
• Apache Hadoop https://hadoop.apache.org
• Apache Tez https://tez.apache.org
• Apache Mesos http://mesos.apache.org

Resilient Distributed Datasets

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Resilient Distributed Datasets (20)

Recently uploaded (20)

Resilient Distributed Datasets