SlideShare a Scribd company logo
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Apache Spark
Keys Botzum
Senior Principal Technologist, MapR Technologies
June 2014
®
© 2014 MapR Technologies 2
Agenda
•  MapReduce
•  Apache Spark
•  How Spark Works
•  Fault Tolerance and Performance
•  Examples
•  Spark and More
®
© 2014 MapR Technologies 3
MapR: Best Product, Best Business & Best
Customers
Top Ranked
Exponential
Growth
500+
Customers Cloud Leaders
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
<1% lifetime churn
>$1B
in incremental revenue
generated by 1 customer
®
© 2014 MapR Technologies 4© 2014 MapR Technologies
®
Review: MapReduce
®
© 2014 MapR Technologies 5
MapReduce: A Programming Model
•  MapReduce:
Simplified Data
Processing on Large
Clusters
(published 2004)
•  Parallel and Distributed
Algorithm:
•  Data Locality
•  Fault Tolerance
•  Linear Scalability
®
© 2014 MapR Technologies 6
MapReduce Basics
•  Assumes scalable distributed file system that
shards data
•  Map
–  Loading of the data and defining a set of keys
•  Reduce
–  Collects the organized key-based data to process
and output
•  Performance can be tweaked based on known
details of your source files and cluster shape
(size, total number)
®
© 2014 MapR Technologies 7
MapReduce Processing Model
•  Define mappers
•  Shuffling is automatic
•  Define reducers
•  For complex work, chain jobs together
®
© 2014 MapR Technologies 8
MapReduce: The Good
•  Built in fault tolerance
•  Optimized IO path
•  Scalable
•  Developer focuses on Map/Reduce, not
infrastructure
•  simple? API
®
© 2014 MapR Technologies 9
MapReduce: The Bad
•  Optimized for disk IO
–  Doesn’t leverage memory well
–  Iterative algorithms go through disk IO path again
and again
•  Primitive API
–  Developer’s have to build on very simple abstraction
–  Key/Value in/out
–  Even basic things like join require extensive code
•  Result often many files that need to be
combined appropriately
®
© 2014 MapR Technologies 10© 2014 MapR Technologies
®
Apache Spark
®
© 2014 MapR Technologies 11
Apache Spark
•  spark.apache.org
•  github.com/apache/spark
•  user@spark.apache.org
•  Originally developed in
2009 in UC Berkeley’s
AMP Lab
•  Fully open sourced in
2010 – now at Apache
Software Foundation
- Commercial Vendor Developing/Supporting
®
© 2014 MapR Technologies 12
Spark: Easy and Fast Big Data
•  Easy to Develop
–  Rich APIs in
Java, Scala,
Python
–  Interactive shell
•  Fast to Run
–  General execution
graphs
–  In-memory storage
2-5× less code
®
© 2014 MapR Technologies 13
Resilient Distributed Datasets (RDD)
•  Spark revolves around RDDs
•  Fault-tolerant read only collection of elements
that can be operated on in parallel
•  Cached in memory or on disk
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
®
© 2014 MapR Technologies 14
RDD Operations - Expressive
•  Transformations
–  Creation of a new RDD dataset from an existing
•  map, filter, distinct, union, sample, groupByKey, join,
reduce, etc…
•  Actions
–  Return a value after running a computation
•  collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-
operations
®
© 2014 MapR Technologies 15
Easy: Clean API
•  Resilient Distributed
Datasets
•  Collections of objects spread
across a cluster, stored in
RAM or on Disk
•  Built through parallel
transformations
•  Automatically rebuilt on
failure
•  Operations
•  Transformations
(e.g. map, filter,
groupBy)
•  Actions
(e.g. count,
collect, save)
Write programs in terms of transformations on
distributed datasets
®
© 2014 MapR Technologies 16
Easy: Expressive API
•  map •  reduce
®
© 2014 MapR Technologies 17
Easy: Expressive API
•  map
•  filter
•  groupBy
•  sort
•  union
•  join
•  leftOuterJoin
•  rightOuterJoin
•  reduce
•  count
•  fold
•  reduceByKey
•  groupByKey
•  cogroup
•  cross
•  zip
sample
take
first
partitionBy
mapWith
pipe
save ...
®
© 2014 MapR Technologies 18
Easy: Example – Word Count
•  Spark•  Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
®
© 2014 MapR Technologies 19
Easy: Example – Word Count
•  Spark•  Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
®
© 2014 MapR Technologies 20
Easy: Works Well With Hadoop
•  Data Compatibility
•  Access your existing
Hadoop Data
•  Use the same data
formats
•  Adheres to data
locality for efficient
processing
•  Deployment
Models
•  “Standalone”
deployment
•  YARN-based
deployment
•  Mesos-based
deployment
•  Deploy on existing
Hadoop cluster or
side-by-side
®
© 2014 MapR Technologies 21
Easy: User-Driven Roadmap
•  Language support
–  Improved Python
support
–  SparkR
–  Java 8
–  Integrated Schema
and SQL support in
Spark’s APIs
•  Better ML
–  Sparse Data
Support
–  Model Evaluation
Framework
–  Performance Testing
®
© 2014 MapR Technologies 22
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
®
© 2014 MapR Technologies 23
Fast: Logistic Regression Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110	
  s	
  /	
  iteration	
  
first	
  iteration	
  80	
  s	
  
further	
  iterations	
  1	
  s	
  
®
© 2014 MapR Technologies 24
Easy: Multi-language Support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
®
© 2014 MapR Technologies 25
Easy: Interactive Shell
Scala based shell
% /opt/mapr/spark/spark-0.9.1/bin/spark-shell
scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)"
scala> logs.count()"
…"
res0: Long = 232681
scala> logs.filter(l => l.contains("ERROR")).count()"
…."
res1: Long = 205

Python based shell as well - pyspark
®
© 2014 MapR Technologies 26© 2014 MapR Technologies
®
Fault Tolerance and Performance
®
© 2014 MapR Technologies 27
Fast: Using RAM, Operator Graphs
•  In-memory Caching
•  Data Partitions read
from RAM instead of
disk
•  Operator Graphs
•  Scheduling
Optimizations
•  Fault Tolerance
=	
  cached	
  partition	
  
=	
  RDD	
  
join	
  
filter	
  
groupBy	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
A:	
   B:	
  
C:	
   D:	
   E:	
  
F:	
  
map	
  
®
© 2014 MapR Technologies 28
Directed Acylic Graph (DAG)
•  Directed
–  Only in a single direction
•  Acyclic
–  No looping
•  This supports fault-tolerance
®
© 2014 MapR Technologies 29
Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter	
  
(func	
  =	
  startsWith(…))	
  
map	
  
(func	
  =	
  split(...))	
  
®
© 2014 MapR Technologies 30
RDD Persistence / Caching
•  Variety of storage levels
–  memory_only (default), memory_and_disk, etc…
•  API Calls
–  persist(StorageLevel)
–  cache() – shorthand for
persist(StorageLevel.MEMORY_ONLY)
•  Considerations
–  Read from disk vs. recompute (memory_and_disk)
–  Total memory storage size (memory_only_ser)
–  Replicate to second node for faster fault recovery
(memory_only_2)
•  Think about this option if supporting a time sensitive client
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-
persistence
®
© 2014 MapR Technologies 31
PageRank Performance
171
80
23
14
0
50
100
150
200
30 60
Iterationtime(s)
Number of machines
Hadoop
Spark
®
© 2014 MapR Technologies 32
Other Iterative Algorithms
0.96
110
0 25 50 75 100 125
Logistic
Regression
4.1
155
0 30 60 90 120 150 180
K-Means
Clustering
Hadoop
Spark
Time per Iteration (s)
®
© 2014 MapR Technologies 33
Fast: Scaling Down
69	
  
58	
  
41	
  
30	
  
12	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
Cache	
  
disabled	
  
25%	
   50%	
   75%	
   Fully	
  
cached	
  
Execution	
  time	
  (s)	
  
%	
  of	
  working	
  set	
  in	
  cache	
  
®
© 2014 MapR Technologies 34
Comparison to Storm
•  Higher throughput than Storm
–  Spark Streaming: 670k records/sec/node
–  Storm: 115k records/sec/node
–  Commercial systems: 100-500k records/sec/node
0	
  
10	
  
20	
  
30	
  
100	
   1000	
  
Throughput	
  per	
  node	
  
(MB/s)	
  
Record	
  Size	
  (bytes)	
  
WordCount	
  
Spark	
  
Storm	
  
0	
  
20	
  
40	
  
60	
  
100	
   1000	
  
Throughput	
  per	
  node	
  
(MB/s)	
  
Record	
  Size	
  (bytes)	
  
Grep	
  
Spark	
  
Storm	
  
®
© 2014 MapR Technologies 35© 2014 MapR Technologies
®
How Spark Works
®
© 2014 MapR Technologies 36
Working With RDDs
®
© 2014 MapR Technologies 37
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 38
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 39
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action
 Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)!
linesWithSpark.count()!
74!
!
linesWithSpark.first()!
# Apache Spark!
textFile = sc.textFile(”SomeFile.txt”)!
®
© 2014 MapR Technologies 40© 2014 MapR Technologies
®
Example: Log Mining
®
© 2014 MapR Technologies 41
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
®
© 2014 MapR Technologies 42
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
®
© 2014 MapR Technologies 43
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
®
© 2014 MapR Technologies 44
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base RDD
®
© 2014 MapR Technologies 45
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
®
© 2014 MapR Technologies 46
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed RDD
®
© 2014 MapR Technologies 47
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
®
© 2014 MapR Technologies 48
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Action
®
© 2014 MapR Technologies 49
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
®
© 2014 MapR Technologies 50
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks
®
© 2014 MapR Technologies 51
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
®
© 2014 MapR Technologies 52
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
®
© 2014 MapR Technologies 53
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
®
© 2014 MapR Technologies 54
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
®
© 2014 MapR Technologies 55
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
®
© 2014 MapR Technologies 56
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
®
© 2014 MapR Technologies 57
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
results
results
results
®
© 2014 MapR Technologies 58
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data è Faster Results
Full-text search of Wikipedia
•  60GB on 20 EC2 machines
•  0.5 sec from cache vs. 20s for on-disk
®
© 2014 MapR Technologies 59© 2014 MapR Technologies
®
Example: Page Rank
®
© 2014 MapR Technologies 60
Example: PageRank
•  Good example of a more complex algorithm
–  Multiple stages of map & reduce
•  Benefits from Spark’s in-memory caching
–  Multiple iterations over the same data
®
© 2014 MapR Technologies 61
Basic Idea
Give pages ranks
(scores) based on links
to them
•  Links from many
pages è high rank
•  Link from a high-rank
page è high rank
Image:	
  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png	
  	
  
®
© 2014 MapR Technologies 62
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
1.0	
   1.0	
  
1.0	
  
1.0	
  
®
© 2014 MapR Technologies 63
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
1.0	
   1.0	
  
1.0	
  
1.0	
  
1	
  
0.5	
  
0.5	
  
0.5	
  
1	
  
0.5	
  
®
© 2014 MapR Technologies 64
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.58	
   1.0	
  
1.85	
  
0.58	
  
®
© 2014 MapR Technologies 65
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.58	
  
0.29	
  
0.29	
  
0.5	
  
1.85	
  
0.58	
   1.0	
  
1.85	
  
0.58	
  
0.5	
  
®
© 2014 MapR Technologies 66
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.39	
   1.72	
  
1.31	
  
0.58	
  
. . .
®
© 2014 MapR Technologies 67
Algorithm
1.  Start each page at a rank of 1
2.  On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3.  Set each page’s rank to 0.15 + 0.85 × contribs
0.46	
   1.37	
  
1.44	
  
0.73	
  
Final	
  state:	
  
®
© 2014 MapR Technologies 68
Scala Implementation
val links = // load RDD of (url, neighbors) pairs
var ranks = // give each url rank of 1.0
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).values.flatMap {
case (urls, rank)) =>
urls.map(dest => (dest, rank/urls.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/
apache/spark/examples/SparkPageRank.scala
®
© 2014 MapR Technologies 69© 2014 MapR Technologies
®
Spark and More
®
© 2014 MapR Technologies 70
Easy: Unified Platform
Spark SQL
(SQL)
Spark
Streaming
(Streaming)
MLLib
(Machine
learning)
Spark (General execution engine)
GraphX
(Graph
computation)
Continued innovation bringing new functionality, e.g.,:
•  BlinkDB (Approximate Queries)
•  SparkR (R wrapper for Spark)
•  Tachyon (off-heap RDD caching)
®
© 2014 MapR Technologies 71
Spark on MapR
•  Certified Spark Distribution
•  Fully supported and packaged by MapR in
partnership with Databricks
–  mapr-spark package with Spark, Shark, Spark
Streaming today
–  Spark-python, GraphX and MLLib soon
•  YARN integration
–  Spark can then allocate resources from cluster
when needed
®
© 2014 MapR Technologies 72
References
•  Based on slides from Pat McDonough at
•  Spark web site: http://spark.apache.org/
•  Spark on MapR:
–  http://www.mapr.com/products/apache-spark
–  http://doc.mapr.com/display/MapR/Installing+Spark
+and+Shark
®
© 2014 MapR Technologies 73
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot (20)

PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PPTX
Data Federation with Apache Spark
DataWorks Summit
 
PPTX
JSON improvements in MySQL 8.0
Mydbops
 
PDF
Stl meetup cloudera platform - january 2020
Adam Doyle
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
PPTX
YARN High Availability
DataWorks Summit
 
PPT
Oracle GoldenGate
oracleonthebrain
 
PPTX
Introduction to Redis
TO THE NEW | Technology
 
PDF
ClickHouse Keeper
Altinity Ltd
 
PPTX
Securing Hadoop with Apache Ranger
DataWorks Summit
 
PDF
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
PPT
Hadoop Security Architecture
Owen O'Malley
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Oracle db performance tuning
Simon Huang
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Paris Redis Meetup Introduction
Gregory Boissinot
 
PDF
Spark with Delta Lake
Knoldus Inc.
 
PDF
Intro to Delta Lake
Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Data Federation with Apache Spark
DataWorks Summit
 
JSON improvements in MySQL 8.0
Mydbops
 
Stl meetup cloudera platform - january 2020
Adam Doyle
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
YARN High Availability
DataWorks Summit
 
Oracle GoldenGate
oracleonthebrain
 
Introduction to Redis
TO THE NEW | Technology
 
ClickHouse Keeper
Altinity Ltd
 
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
StampedeCon
 
Hadoop Security Architecture
Owen O'Malley
 
Apache Spark Architecture
Alexey Grishchenko
 
Oracle db performance tuning
Simon Huang
 
Introduction to Spark Internals
Pietro Michiardi
 
Paris Redis Meetup Introduction
Gregory Boissinot
 
Spark with Delta Lake
Knoldus Inc.
 
Intro to Delta Lake
Databricks
 

Viewers also liked (9)

PPTX
Modern Data Architecture
Alexey Grishchenko
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
MapR and Cisco Make IT Better
MapR Technologies
 
PDF
Hands on MapR -- Viadea
viadea
 
PDF
MapR Tutorial Series
selvaraaju
 
PDF
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
PPTX
Deep Learning for Fraud Detection
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PDF
MapR Data Analyst
selvaraaju
 
Modern Data Architecture
Alexey Grishchenko
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
MapR and Cisco Make IT Better
MapR Technologies
 
Hands on MapR -- Viadea
viadea
 
MapR Tutorial Series
selvaraaju
 
MapR M7: Providing an enterprise quality Apache HBase API
mcsrivas
 
Deep Learning for Fraud Detection
DataWorks Summit/Hadoop Summit
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
MapR Data Analyst
selvaraaju
 
Ad

Similar to Apache Spark & Hadoop (20)

PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
PPTX
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
PDF
Is Spark Replacing Hadoop
MapR Technologies
 
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
PDF
Introduction to Spark
Carol McDonald
 
PDF
Apache Spark Overview
Carol McDonald
 
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PDF
PySpark with Juypter
Li Ming Tsai
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
Introduction to Spark
Li Ming Tsai
 
PPTX
SparkNotes
Demet Aksoy
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
Apache Spark
Uwe Printz
 
PPTX
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Is Spark Replacing Hadoop
MapR Technologies
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Introduction to Spark
Carol McDonald
 
Apache Spark Overview
Carol McDonald
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PySpark with Juypter
Li Ming Tsai
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Introduction to Spark
Li Ming Tsai
 
SparkNotes
Demet Aksoy
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Hadoop and Spark
Shravan (Sean) Pabba
 
Meetup ml spark_ppt
Snehal Nagmote
 
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Apache Spark
Uwe Printz
 
2016-07-21-Godil-presentation.pptx
D21CE161GOSWAMIPARTH
 
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
MapR Technologies
 
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
PPTX
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
PPTX
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
PDF
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
PDF
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
PPTX
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PPTX
MapR Product Update - Spring 2017
MapR Technologies
 
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
PPTX
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 

Recently uploaded (20)

PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 

Apache Spark & Hadoop

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Apache Spark Keys Botzum Senior Principal Technologist, MapR Technologies June 2014
  • 2. ® © 2014 MapR Technologies 2 Agenda •  MapReduce •  Apache Spark •  How Spark Works •  Fault Tolerance and Performance •  Examples •  Spark and More
  • 3. ® © 2014 MapR Technologies 3 MapR: Best Product, Best Business & Best Customers Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  • 4. ® © 2014 MapR Technologies 4© 2014 MapR Technologies ® Review: MapReduce
  • 5. ® © 2014 MapR Technologies 5 MapReduce: A Programming Model •  MapReduce: Simplified Data Processing on Large Clusters (published 2004) •  Parallel and Distributed Algorithm: •  Data Locality •  Fault Tolerance •  Linear Scalability
  • 6. ® © 2014 MapR Technologies 6 MapReduce Basics •  Assumes scalable distributed file system that shards data •  Map –  Loading of the data and defining a set of keys •  Reduce –  Collects the organized key-based data to process and output •  Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 7. ® © 2014 MapR Technologies 7 MapReduce Processing Model •  Define mappers •  Shuffling is automatic •  Define reducers •  For complex work, chain jobs together
  • 8. ® © 2014 MapR Technologies 8 MapReduce: The Good •  Built in fault tolerance •  Optimized IO path •  Scalable •  Developer focuses on Map/Reduce, not infrastructure •  simple? API
  • 9. ® © 2014 MapR Technologies 9 MapReduce: The Bad •  Optimized for disk IO –  Doesn’t leverage memory well –  Iterative algorithms go through disk IO path again and again •  Primitive API –  Developer’s have to build on very simple abstraction –  Key/Value in/out –  Even basic things like join require extensive code •  Result often many files that need to be combined appropriately
  • 10. ® © 2014 MapR Technologies 10© 2014 MapR Technologies ® Apache Spark
  • 11. ® © 2014 MapR Technologies 11 Apache Spark •  spark.apache.org •  github.com/apache/spark •  [email protected] •  Originally developed in 2009 in UC Berkeley’s AMP Lab •  Fully open sourced in 2010 – now at Apache Software Foundation - Commercial Vendor Developing/Supporting
  • 12. ® © 2014 MapR Technologies 12 Spark: Easy and Fast Big Data •  Easy to Develop –  Rich APIs in Java, Scala, Python –  Interactive shell •  Fast to Run –  General execution graphs –  In-memory storage 2-5× less code
  • 13. ® © 2014 MapR Technologies 13 Resilient Distributed Datasets (RDD) •  Spark revolves around RDDs •  Fault-tolerant read only collection of elements that can be operated on in parallel •  Cached in memory or on disk http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 14. ® © 2014 MapR Technologies 14 RDD Operations - Expressive •  Transformations –  Creation of a new RDD dataset from an existing •  map, filter, distinct, union, sample, groupByKey, join, reduce, etc… •  Actions –  Return a value after running a computation •  collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- operations
  • 15. ® © 2014 MapR Technologies 15 Easy: Clean API •  Resilient Distributed Datasets •  Collections of objects spread across a cluster, stored in RAM or on Disk •  Built through parallel transformations •  Automatically rebuilt on failure •  Operations •  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 16. ® © 2014 MapR Technologies 16 Easy: Expressive API •  map •  reduce
  • 17. ® © 2014 MapR Technologies 17 Easy: Expressive API •  map •  filter •  groupBy •  sort •  union •  join •  leftOuterJoin •  rightOuterJoin •  reduce •  count •  fold •  reduceByKey •  groupByKey •  cogroup •  cross •  zip sample take first partitionBy mapWith pipe save ...
  • 18. ® © 2014 MapR Technologies 18 Easy: Example – Word Count •  Spark•  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 19. ® © 2014 MapR Technologies 19 Easy: Example – Word Count •  Spark•  Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 20. ® © 2014 MapR Technologies 20 Easy: Works Well With Hadoop •  Data Compatibility •  Access your existing Hadoop Data •  Use the same data formats •  Adheres to data locality for efficient processing •  Deployment Models •  “Standalone” deployment •  YARN-based deployment •  Mesos-based deployment •  Deploy on existing Hadoop cluster or side-by-side
  • 21. ® © 2014 MapR Technologies 21 Easy: User-Driven Roadmap •  Language support –  Improved Python support –  SparkR –  Java 8 –  Integrated Schema and SQL support in Spark’s APIs •  Better ML –  Sparse Data Support –  Model Evaluation Framework –  Performance Testing
  • 22. ® © 2014 MapR Technologies 22 Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  • 23. ® © 2014 MapR Technologies 23 Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110  s  /  iteration   first  iteration  80  s   further  iterations  1  s  
  • 24. ® © 2014 MapR Technologies 24 Easy: Multi-language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 25. ® © 2014 MapR Technologies 25 Easy: Interactive Shell Scala based shell % /opt/mapr/spark/spark-0.9.1/bin/spark-shell scala> val logs = sc.textFile("hdfs:///user/keys/logdata”)" scala> logs.count()" …" res0: Long = 232681 scala> logs.filter(l => l.contains("ERROR")).count()" …." res1: Long = 205 Python based shell as well - pyspark
  • 26. ® © 2014 MapR Technologies 26© 2014 MapR Technologies ® Fault Tolerance and Performance
  • 27. ® © 2014 MapR Technologies 27 Fast: Using RAM, Operator Graphs •  In-memory Caching •  Data Partitions read from RAM instead of disk •  Operator Graphs •  Scheduling Optimizations •  Fault Tolerance =  cached  partition   =  RDD   join   filter   groupBy   Stage  3   Stage  1   Stage  2   A:   B:   C:   D:   E:   F:   map  
  • 28. ® © 2014 MapR Technologies 28 Directed Acylic Graph (DAG) •  Directed –  Only in a single direction •  Acyclic –  No looping •  This supports fault-tolerance
  • 29. ® © 2014 MapR Technologies 29 Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter   (func  =  startsWith(…))   map   (func  =  split(...))  
  • 30. ® © 2014 MapR Technologies 30 RDD Persistence / Caching •  Variety of storage levels –  memory_only (default), memory_and_disk, etc… •  API Calls –  persist(StorageLevel) –  cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) •  Considerations –  Read from disk vs. recompute (memory_and_disk) –  Total memory storage size (memory_only_ser) –  Replicate to second node for faster fault recovery (memory_only_2) •  Think about this option if supporting a time sensitive client http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd- persistence
  • 31. ® © 2014 MapR Technologies 31 PageRank Performance 171 80 23 14 0 50 100 150 200 30 60 Iterationtime(s) Number of machines Hadoop Spark
  • 32. ® © 2014 MapR Technologies 32 Other Iterative Algorithms 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop Spark Time per Iteration (s)
  • 33. ® © 2014 MapR Technologies 33 Fast: Scaling Down 69   58   41   30   12   0   20   40   60   80   100   Cache   disabled   25%   50%   75%   Fully   cached   Execution  time  (s)   %  of  working  set  in  cache  
  • 34. ® © 2014 MapR Technologies 34 Comparison to Storm •  Higher throughput than Storm –  Spark Streaming: 670k records/sec/node –  Storm: 115k records/sec/node –  Commercial systems: 100-500k records/sec/node 0   10   20   30   100   1000   Throughput  per  node   (MB/s)   Record  Size  (bytes)   WordCount   Spark   Storm   0   20   40   60   100   1000   Throughput  per  node   (MB/s)   Record  Size  (bytes)   Grep   Spark   Storm  
  • 35. ® © 2014 MapR Technologies 35© 2014 MapR Technologies ® How Spark Works
  • 36. ® © 2014 MapR Technologies 36 Working With RDDs
  • 37. ® © 2014 MapR Technologies 37 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)!
  • 38. ® © 2014 MapR Technologies 38 Working With RDDs RDD RDD RDD RDD Transformations linesWithSpark = textFile.filter(lambda line: "Spark” in line)! textFile = sc.textFile(”SomeFile.txt”)!
  • 39. ® © 2014 MapR Technologies 39 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line)! linesWithSpark.count()! 74! ! linesWithSpark.first()! # Apache Spark! textFile = sc.textFile(”SomeFile.txt”)!
  • 40. ® © 2014 MapR Technologies 40© 2014 MapR Technologies ® Example: Log Mining
  • 41. ® © 2014 MapR Technologies 41 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 42. ® © 2014 MapR Technologies 42 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  • 43. ® © 2014 MapR Technologies 43 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  • 44. ® © 2014 MapR Technologies 44 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  • 45. ® © 2014 MapR Technologies 45 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 46. ® © 2014 MapR Technologies 46 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  • 47. ® © 2014 MapR Technologies 47 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  • 48. ® © 2014 MapR Technologies 48 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  • 49. ® © 2014 MapR Technologies 49 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  • 50. ® © 2014 MapR Technologies 50 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 51. ® © 2014 MapR Technologies 51 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  • 52. ® © 2014 MapR Technologies 52 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 53. ® © 2014 MapR Technologies 53 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  • 54. ® © 2014 MapR Technologies 54 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  • 55. ® © 2014 MapR Technologies 55 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  • 56. ® © 2014 MapR Technologies 56 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  • 57. ® © 2014 MapR Technologies 57 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  • 58. ® © 2014 MapR Technologies 58 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from cache vs. 20s for on-disk
  • 59. ® © 2014 MapR Technologies 59© 2014 MapR Technologies ® Example: Page Rank
  • 60. ® © 2014 MapR Technologies 60 Example: PageRank •  Good example of a more complex algorithm –  Multiple stages of map & reduce •  Benefits from Spark’s in-memory caching –  Multiple iterations over the same data
  • 61. ® © 2014 MapR Technologies 61 Basic Idea Give pages ranks (scores) based on links to them •  Links from many pages è high rank •  Link from a high-rank page è high rank Image:  en.wikipedia.org/wiki/File:PageRank-­‐hi-­‐res-­‐2.png    
  • 62. ® © 2014 MapR Technologies 62 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 1.0   1.0   1.0   1.0  
  • 63. ® © 2014 MapR Technologies 63 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 1.0   1.0   1.0   1.0   1   0.5   0.5   0.5   1   0.5  
  • 64. ® © 2014 MapR Technologies 64 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.58   1.0   1.85   0.58  
  • 65. ® © 2014 MapR Technologies 65 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.58   0.29   0.29   0.5   1.85   0.58   1.0   1.85   0.58   0.5  
  • 66. ® © 2014 MapR Technologies 66 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.39   1.72   1.31   0.58   . . .
  • 67. ® © 2014 MapR Technologies 67 Algorithm 1.  Start each page at a rank of 1 2.  On each iteration, have page p contribute rankp / |neighborsp| to its neighbors 3.  Set each page’s rank to 0.15 + 0.85 × contribs 0.46   1.37   1.44   0.73   Final  state:  
  • 68. ® © 2014 MapR Technologies 68 Scala Implementation val links = // load RDD of (url, neighbors) pairs var ranks = // give each url rank of 1.0 for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).values.flatMap { case (urls, rank)) => urls.map(dest => (dest, rank/urls.size)) } ranks = contribs.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) } ranks.saveAsTextFile(...) https://github.com/apache/spark/blob/master/examples/src/main/scala/org/ apache/spark/examples/SparkPageRank.scala
  • 69. ® © 2014 MapR Technologies 69© 2014 MapR Technologies ® Spark and More
  • 70. ® © 2014 MapR Technologies 70 Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: •  BlinkDB (Approximate Queries) •  SparkR (R wrapper for Spark) •  Tachyon (off-heap RDD caching)
  • 71. ® © 2014 MapR Technologies 71 Spark on MapR •  Certified Spark Distribution •  Fully supported and packaged by MapR in partnership with Databricks –  mapr-spark package with Spark, Shark, Spark Streaming today –  Spark-python, GraphX and MLLib soon •  YARN integration –  Spark can then allocate resources from cluster when needed
  • 72. ® © 2014 MapR Technologies 72 References •  Based on slides from Pat McDonough at •  Spark web site: http://spark.apache.org/ •  Spark on MapR: –  http://www.mapr.com/products/apache-spark –  http://doc.mapr.com/display/MapR/Installing+Spark +and+Shark
  • 73. ® © 2014 MapR Technologies 73 Q&A @mapr maprtech [email protected] Engage with us! MapR maprtech mapr-technologies