Apache Spark with Scala

Apache Spark
Fernando Rodriguez Olivera
@frodriguez
Buenos Aires, Argentina, Nov 2014
JAVACONF 2014

Twitter: @frodriguez
Professor at Universidad Austral (Distributed Systems, Compiler
Design, Operating Systems, …)
Creator of mvnrepository.com
Organizer at Buenos Aires High Scalability Group, Professor at
nosqlessentials.com

Apache Spark
Apache Spark is a Fast and General Engine
for Large-Scale data processing
Supports for Batch, Interactive and Stream
processing with Uniﬁed API
In-Memory computing primitives

Hadoop MR Limits
Job Job Job
Hadoop HDFS
- Communication between jobs through FS
- Fault-Tolerance (between jobs) by Persistence to FS
- Memory not managed (relies on OS caches)
MapReduce designed for Batch Processing:
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc

Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-ofﬁcially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized

Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-ofﬁcially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
3X faster using 10X fewer machines

Hadoop vs Spark for Iterative Proc
source: https://spark.apache.org/
Logistic regression in Hadoop and Spark

Apache Spark
Apache Spark (Core)
Spark
SQL
Spark
Streaming
ML lib GraphX
Powered by Scala and Akka

Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Partitioned and Distributed

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Stored in Memory

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Stored in Memory
Partitions Recomputed on Failure

RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
e.g: apply
function
to count
chars
Compute
Function
(transformation)

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
N
Int
Action

Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
Partitions
Compute Function
Dependencies
Preferred Compute
Location
(for each partition)
RDD Implementation
Partitioner
depends on
N
Int
Action

Spark API
val spark = new SparkContext()
val lines = spark.textFile(“hdfs://docs/”) // RDD[String]
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String]
val count = nonEmpty.count
Scala
SparkContext spark = new SparkContext();
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”)
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0);
long count = nonEmpty.count();
Java8Python
spark = SparkContext()
lines = spark.textFile(“hdfs://docs/”)
nonEmpty = lines.filter(lambda line: len(line) > 0)
count = nonEmpty.count()

RDD Operations
map(func)
flatMap(func)
filter(func)
take(N)
count()
collect()
Transformations Actions
groupByKey()
reduceByKey(func)
reduce(func)
… …
mapValues(func)
takeOrdered(N)
top(N)

Text Processing Example
Top Words by Frequency
(Step by step)

Create RDD from External Data
// Step 1 - Create RDD from Hadoop Text File
val docs = spark.textFile(“/docs/”)
Hadoop FileSystem,
I/O Formats, Codecs
HBaseS3HDFS MongoDB
Cassandra
…
Apache Spark
Spark can read/write from any data source supported by Hadoop
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop)
ElasticSearch

Function map
Hello World
A New Line
hello
...
The end
.map(line => line.toLowerCase)
RDD[String] RDD[String]
hello world
a new line
hello
...
the end
.map(_.toLowerCase)
// Step 2 - Convert lines to lower case
val lower = docs.map(line => line.toLowerCase)
=

Functions map and ﬂatMap
hello world
a new line
hello
...
the end
RDD[String]

hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]

hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*

hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*

hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
Note: ﬂatten() not available in spark, only ﬂatMap
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
// Step 3 - Split lines into words
val words = lower.flatMap(line => line.split(“s+“))
.flatten
hello
a
...
world
new
line
RDD[String]
*

Key-Value Pairs
hello
a
...
world
new
line
hello
hello
a
...
world
new
line
hello
.map(word => Tuple2(word, 1))
1
1
1
1
1
1
.map(word => (word, 1))
RDD[String] RDD[(String, Int)]
// Step 4 - Split lines into words
val counts = words.map(word => (word, 1))
=
RDD[Tuple2[String, Int]]
Pair RDD

Shufﬂing
hello
a
world
new
line
hello
1
1
1
1
1
1
RDD[(String, Int)]

Shufﬂing
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]

Shufﬂing
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]

Shufﬂing
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]

Shufﬂing
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
// Step 5 - Count all words
val freq = counts.reduceByKey(_ + _)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]

Top N (Prepare data)
world
a
1
1
new 1
line
hello
1
2
// Step 6 - Swap tuples (partial code)
freq.map(_.swap)
.map(_.swap)
world
a
1
1
new1
line
hello
1
2
RDD[(String, Int)] RDD[(Int, String)]

Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]

world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
.sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)

world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)] Array[(Int, String)]
hello
world
2
1
.take(N).sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)

Top N
Array[(Int, String)]
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
world
a
1
1
.top(N)
hello
line
2
1
hello
line
2
1
local top N *
local top N *
reduction
// Step 6 - Swap tuples (complete code)
val top = freq.map(_.swap).top(N)
* local top N implemented by bounded priority queues

val spark = new SparkContext()
// RDD creation from external data source
val docs = spark.textFile(“hdfs://docs/”)
// Split lines into words
val lower = docs.map(line => line.toLowerCase)
val words = lower.flatMap(line => line.split(“s+“))
val counts = words.map(word => (word, 1))
// Count all words (automatic combination)
val freq = counts.reduceByKey(_ + _)
// Swap tuples and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
Top Words by Frequency (Full Code)

RDD Persistence (in-memory)
…
...
...
...
...
…
…
…
...
RDD
.cache()
.persist()
.persist(storageLevel)
StorageLevel:
MEMORY_ONLY,
MEMORY_ONLY_SER,
MEMORY_AND_DISK,
MEMORY_AND_DISK_SER,
DISK_ONLY, …
(memory only)
(memory only)
(lazy persistence & caching)

SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
SchemaRDD
RRD of Row + Column Metadata
Queries with SQL
Support for Reﬂection, JSON,
Parquet, …

SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
topWords
case class Word(text: String, n: Int)
val wordsFreq = freq.map {
case (text, count) => Word(text, count)
} // RDD[Word]
wordsFreq.registerTempTable("wordsFreq")
val topWords = sql("select text, n
from wordsFreq
order by n desc
limit 20”) // RDD[Row]
topWords.collect().foreach(println)

nums = words.filter(_.matches(“[0-9]+”))
RDD Lineage
HadoopRDDwords = sc.textFile(“hdfs://large/file/”)
.map(_.toLowerCase)
alpha.count()
MappedRDD
alpha = words.filter(_.matches(“[a-z]+”))
FlatMappedRDD.flatMap(_.split(“ “))
FilteredRDD
Lineage
(built on the driver
by the transformations)
FilteredRDD
Action (run job on the cluster)
RDD Transformations

Deployment with Hadoop
A
B
C
D
/large/ﬁle
Data
Node 1
Data
Node 3
Data
Node 4
Data
Node 2
A A AB BBCC
CD DDRF 3
Name
Node
Spark
Master
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Client
Submit App
(mode=cluster)
Driver Executors Executors Executors
allocates resources
(cores and memory)
Application
DN + Spark
HDFSSpark

twitter: @frodriguez

Apache Spark with Scala

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Spark with Scala (20)

Recently uploaded (20)

Apache Spark with Scala