SlideShare a Scribd company logo
Apache Spark
Fernando Rodriguez Olivera
@frodriguez
Buenos Aires, Argentina, Nov 2014
JAVACONF 2014
Fernando Rodriguez Olivera
Twitter: @frodriguez
Professor at Universidad Austral (Distributed Systems, Compiler
Design, Operating Systems, …)
Creator of mvnrepository.com
Organizer at Buenos Aires High Scalability Group, Professor at
nosqlessentials.com
Apache Spark
Apache Spark is a Fast and General Engine
for Large-Scale data processing
Supports for Batch, Interactive and Stream
processing with Unified API
In-Memory computing primitives
Hadoop MR Limits
Job Job Job
Hadoop HDFS
- Communication between jobs through FS
- Fault-Tolerance (between jobs) by Persistence to FS
- Memory not managed (relies on OS caches)
MapReduce designed for Batch Processing:
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
Daytona Gray Sort 100TB Benchmark
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Data Size Time Nodes Cores
Hadoop MR
(2013)
102.5 TB 72 min 2,100
50,400
physical
Apache
Spark
(2014)
100 TB 23 min 206
6,592
virtualized
3X faster using 10X fewer machines
Hadoop vs Spark for Iterative Proc
source: https://spark.apache.org/
Logistic regression in Hadoop and Spark
Apache Spark
Apache Spark (Core)
Spark
SQL
Spark
Streaming
ML lib GraphX
Powered by Scala and Akka
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Resilient Distributed Datasets (RDD)
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
Immutable Collection of Objects
Partitioned and Distributed
Stored in Memory
Partitions Recomputed on Failure
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
depends on
N
Int
Action
RDD Transformations and Actions
Hello World
...
...
...
...
A New Line
hello
The End
...
RDD of Strings
11
...
...
...
...
10
5
7
...
RDD of Ints
e.g: apply
function
to count
chars
Compute
Function
(transformation)
Partitions
Compute Function
Dependencies
Preferred Compute
Location
(for each partition)
RDD Implementation
Partitioner
depends on
N
Int
Action
Spark API
val spark = new SparkContext()
val lines = spark.textFile(“hdfs://docs/”) // RDD[String]
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String]
val count = nonEmpty.count
Scala
SparkContext spark = new SparkContext();
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”)
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0);
long count = nonEmpty.count();
Java8Python
spark = SparkContext()
lines = spark.textFile(“hdfs://docs/”)
nonEmpty = lines.filter(lambda line: len(line) > 0)
count = nonEmpty.count()
RDD Operations
map(func)
flatMap(func)
filter(func)
take(N)
count()
collect()
Transformations Actions
groupByKey()
reduceByKey(func)
reduce(func)
… …
mapValues(func)
takeOrdered(N)
top(N)
Text Processing Example
Top Words by Frequency
(Step by step)
Create RDD from External Data
// Step 1 - Create RDD from Hadoop Text File
val docs = spark.textFile(“/docs/”)
Hadoop FileSystem,
I/O Formats, Codecs
HBaseS3HDFS MongoDB
Cassandra
…
Apache Spark
Spark can read/write from any data source supported by Hadoop
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop)
ElasticSearch
Function map
Hello World
A New Line
hello
...
The end
.map(line => line.toLowerCase)
RDD[String] RDD[String]
hello world
a new line
hello
...
the end
.map(_.toLowerCase)
// Step 2 - Convert lines to lower case
val lower = docs.map(line => line.toLowerCase)
=
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
Functions map and flatMap
hello world
a new line
hello
...
the end
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
.flatten
hello
a
...
world
new
line
RDD[String]
*
Functions map and flatMap
hello world
a new line
hello
...
the end
.flatMap(line => line.split(“s+“))
Note: flatten() not available in spark, only flatMap
RDD[String]
.map( … )
_.split(“s+”)
a
hello
hello
...
the
world
new line
end
RDD[Array[String]]
// Step 3 - Split lines into words
val words = lower.flatMap(line => line.split(“s+“))
.flatten
hello
a
...
world
new
line
RDD[String]
*
Key-Value Pairs
hello
a
...
world
new
line
hello
hello
a
...
world
new
line
hello
.map(word => Tuple2(word, 1))
1
1
1
1
1
1
.map(word => (word, 1))
RDD[String] RDD[(String, Int)]
// Step 4 - Split lines into words
val counts = words.map(word => (word, 1))
=
RDD[Tuple2[String, Int]]
Pair RDD
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Shuffling
hello
a
world
new
line
hello
1
1
1
1
1
1
.reduceByKey((a, b) => a + b)
// Step 5 - Count all words
val freq = counts.reduceByKey(_ + _)
world
a
1
1
new 1
line
hello
1
1
.groupByKey
RDD[(String, Iterator[Int])]
1
RDD[(String, Int)]
world
a
1
1
new 1
line
hello
1
2
.mapValues
_.reduce(…)
(a,b) => a+b
RDD[(String, Int)]
Top N (Prepare data)
world
a
1
1
new 1
line
hello
1
2
// Step 6 - Swap tuples (partial code)
freq.map(_.swap)
.map(_.swap)
world
a
1
1
new1
line
hello
1
2
RDD[(String, Int)] RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
.sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N (First Attempt)
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)] Array[(Int, String)]
hello
world
2
1
.take(N).sortByKey
RDD[(Int, String)]
hello
world
2
1
a1
new
line
1
1
(sortByKey(false) for descending)
Top N
Array[(Int, String)]
world
a
1
1
new1
line
hello
1
2
RDD[(Int, String)]
world
a
1
1
.top(N)
hello
line
2
1
hello
line
2
1
local top N *
local top N *
reduction
// Step 6 - Swap tuples (complete code)
val top = freq.map(_.swap).top(N)
* local top N implemented by bounded priority queues
val spark = new SparkContext()
// RDD creation from external data source
val docs = spark.textFile(“hdfs://docs/”)
// Split lines into words
val lower = docs.map(line => line.toLowerCase)
val words = lower.flatMap(line => line.split(“s+“))
val counts = words.map(word => (word, 1))
// Count all words (automatic combination)
val freq = counts.reduceByKey(_ + _)
// Swap tuples and get top results
val top = freq.map(_.swap).top(N)
top.foreach(println)
Top Words by Frequency (Full Code)
RDD Persistence (in-memory)
…
...
...
...
...
…
…
…
...
RDD
.cache()
.persist()
.persist(storageLevel)
StorageLevel:
MEMORY_ONLY,
MEMORY_ONLY_SER,
MEMORY_AND_DISK,
MEMORY_AND_DISK_SER,
DISK_ONLY, …
(memory only)
(memory only)
(lazy persistence & caching)
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
SchemaRDD
RRD of Row + Column Metadata
Queries with SQL
Support for Reflection, JSON,
Parquet, …
SchemaRDD
Row
...
...
...
...
Row
Row
Row
...
topWords
case class Word(text: String, n: Int)
val wordsFreq = freq.map {
case (text, count) => Word(text, count)
} // RDD[Word]
wordsFreq.registerTempTable("wordsFreq")
val topWords = sql("select text, n
from wordsFreq
order by n desc
limit 20”) // RDD[Row]
topWords.collect().foreach(println)
nums = words.filter(_.matches(“[0-9]+”))
RDD Lineage
HadoopRDDwords = sc.textFile(“hdfs://large/file/”)
.map(_.toLowerCase)
alpha.count()
MappedRDD
alpha = words.filter(_.matches(“[a-z]+”))
FlatMappedRDD.flatMap(_.split(“ “))
FilteredRDD
Lineage
(built on the driver
by the transformations)
FilteredRDD
Action (run job on the cluster)
RDD Transformations
Deployment with Hadoop
A
B
C
D
/large/file
Data
Node 1
Data
Node 3
Data
Node 4
Data
Node 2
A A AB BBCC
CD DDRF 3
Name
Node
Spark
Master
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Client
Submit App
(mode=cluster)
Driver Executors Executors Executors
allocates resources
(cores and memory)
Application
DN + Spark
HDFSSpark
Fernando Rodriguez Olivera
twitter: @frodriguez

More Related Content

PDF
BigData_Chp2: Hadoop & Map-Reduce
Lilia Sfaxi
 
PDF
Big Data, Hadoop & Spark
Alexia Audevart
 
PDF
BigData_TP1: Initiation à Hadoop et Map-Reduce
Lilia Sfaxi
 
PDF
BigData_TP2: Design Patterns dans Hadoop
Lilia Sfaxi
 
PDF
TP1 Big Data - MapReduce
Amal Abid
 
PDF
BigData_TP4 : Cassandra
Lilia Sfaxi
 
PDF
BigData_Chp4: NOSQL
Lilia Sfaxi
 
PDF
BigData_Chp3: Data Processing
Lilia Sfaxi
 
BigData_Chp2: Hadoop & Map-Reduce
Lilia Sfaxi
 
Big Data, Hadoop & Spark
Alexia Audevart
 
BigData_TP1: Initiation à Hadoop et Map-Reduce
Lilia Sfaxi
 
BigData_TP2: Design Patterns dans Hadoop
Lilia Sfaxi
 
TP1 Big Data - MapReduce
Amal Abid
 
BigData_TP4 : Cassandra
Lilia Sfaxi
 
BigData_Chp4: NOSQL
Lilia Sfaxi
 
BigData_Chp3: Data Processing
Lilia Sfaxi
 

What's hot (20)

PDF
BigData_TP5 : Neo4J
Lilia Sfaxi
 
PDF
Cours Big Data Chap4 - Spark
Amal Abid
 
PDF
Cours Big Data Chap3
Amal Abid
 
PDF
BigData_TP3 : Spark
Lilia Sfaxi
 
PDF
Intégration des données avec Talend ETL
Lilia Sfaxi
 
PDF
BigData_Chp1: Introduction à la Big Data
Lilia Sfaxi
 
PDF
Cours Big Data Chap1
Amal Abid
 
PDF
Chapitre 2 hadoop
Mouna Torjmen
 
PDF
Rapport pfe isi_Big data Analytique
Yosra ADDALI
 
PDF
Cours Big Data Chap2
Amal Abid
 
PDF
Introduction à Neo4j
Neo4j
 
PDF
Chapitre 3 spark
Mouna Torjmen
 
PDF
Présentation des bases de données orientées graphes
Koffi Sani
 
PDF
Technologies pour le Big Data
Minyar Sassi Hidri
 
PDF
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
 
PPT
Projet BI - 2 - Conception base de données
Jean-Marc Dupont
 
PPTX
Introduction à Neo4j
Neo4j
 
PDF
Une introduction à Hive
Modern Data Stack France
 
PPTX
Présentation pfe Big Data Hachem SELMI et Ahmed DRIDI
HaShem Selmi
 
PDF
Données liées et Web sémantique : quand le lien fait sens.
Fabien Gandon
 
BigData_TP5 : Neo4J
Lilia Sfaxi
 
Cours Big Data Chap4 - Spark
Amal Abid
 
Cours Big Data Chap3
Amal Abid
 
BigData_TP3 : Spark
Lilia Sfaxi
 
Intégration des données avec Talend ETL
Lilia Sfaxi
 
BigData_Chp1: Introduction à la Big Data
Lilia Sfaxi
 
Cours Big Data Chap1
Amal Abid
 
Chapitre 2 hadoop
Mouna Torjmen
 
Rapport pfe isi_Big data Analytique
Yosra ADDALI
 
Cours Big Data Chap2
Amal Abid
 
Introduction à Neo4j
Neo4j
 
Chapitre 3 spark
Mouna Torjmen
 
Présentation des bases de données orientées graphes
Koffi Sani
 
Technologies pour le Big Data
Minyar Sassi Hidri
 
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
 
Projet BI - 2 - Conception base de données
Jean-Marc Dupont
 
Introduction à Neo4j
Neo4j
 
Une introduction à Hive
Modern Data Stack France
 
Présentation pfe Big Data Hachem SELMI et Ahmed DRIDI
HaShem Selmi
 
Données liées et Web sémantique : quand le lien fait sens.
Fabien Gandon
 
Ad

Viewers also liked (20)

PDF
Apache Spark & Streaming
Fernando Rodriguez
 
PDF
Apache Spark
Uwe Printz
 
DOC
E6 dieu hanh cho tuong lai
co_doc_nhan
 
PPT
Scala idioms
Knoldus Inc.
 
PDF
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
Dominik Gruber
 
PDF
Performance
Christophe Marchal
 
PDF
Spark, the new age of data scientist
Massimiliano Martella
 
PDF
Preso spark leadership
sjoerdluteyn
 
PPTX
Spark - Philly JUG
Brian O'Neill
 
PDF
Spark introduction - In Chinese
colorant
 
PDF
Spark the next top compute model
Dean Wampler
 
PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
The Future of Data Science
sarith divakar
 
PDF
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
Hadoop / Spark Conference Japan
 
PPTX
Pixie dust overview
David Taieb
 
PDF
Why dont you_create_new_spark_jl
Shintaro Fukushima
 
PDF
Spark in 15 min
Christophe Marchal
 
PDF
Spark tutorial py con 2016 part 2
David Taieb
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PPTX
How Spark is Enabling the New Wave of Converged Applications
MapR Technologies
 
Apache Spark & Streaming
Fernando Rodriguez
 
Apache Spark
Uwe Printz
 
E6 dieu hanh cho tuong lai
co_doc_nhan
 
Scala idioms
Knoldus Inc.
 
2014-11-26 | Creating a BitTorrent Client with Scala and Akka, Part 1 (Vienna...
Dominik Gruber
 
Performance
Christophe Marchal
 
Spark, the new age of data scientist
Massimiliano Martella
 
Preso spark leadership
sjoerdluteyn
 
Spark - Philly JUG
Brian O'Neill
 
Spark introduction - In Chinese
colorant
 
Spark the next top compute model
Dean Wampler
 
Intro to Apache Spark
clairvoyantllc
 
The Future of Data Science
sarith divakar
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
Hadoop / Spark Conference Japan
 
Pixie dust overview
David Taieb
 
Why dont you_create_new_spark_jl
Shintaro Fukushima
 
Spark in 15 min
Christophe Marchal
 
Spark tutorial py con 2016 part 2
David Taieb
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
How Spark is Enabling the New Wave of Converged Applications
MapR Technologies
 
Ad

Similar to Apache Spark with Scala (20)

PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PPTX
SparkNotes
Demet Aksoy
 
PPTX
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PDF
Introduction to Spark
Li Ming Tsai
 
PPTX
Apache Spark - Aram Mkrtchyan
Hovhannes Kuloghlyan
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
Apache spark basics
sparrowAnalytics.com
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Writing your own RDD for fun and profit
Pawel Szulc
 
PDF
Apache Spark RDDs
Dean Chen
 
ODP
Introduction to Spark with Scala
Himanshu Gupta
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
SparkNotes
Demet Aksoy
 
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Introduction to Spark
Li Ming Tsai
 
Apache Spark - Aram Mkrtchyan
Hovhannes Kuloghlyan
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Meetup ml spark_ppt
Snehal Nagmote
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Apache spark basics
sparrowAnalytics.com
 
Scala and spark
Fabio Fumarola
 
Spark real world use cases and optimizations
Gal Marder
 
Dive into spark2
Gal Marder
 
Writing your own RDD for fun and profit
Pawel Szulc
 
Apache Spark RDDs
Dean Chen
 
Introduction to Spark with Scala
Himanshu Gupta
 

Recently uploaded (20)

PPTX
Tunnel Ventilation System in Kanpur Metro
220105053
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PDF
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PDF
All chapters of Strength of materials.ppt
girmabiniyam1234
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
Tunnel Ventilation System in Kanpur Metro
220105053
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
CAD-CAM U-1 Combined Notes_57761226_2025_04_22_14_40.pdf
shailendrapratap2002
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
All chapters of Strength of materials.ppt
girmabiniyam1234
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 

Apache Spark with Scala

  • 1. Apache Spark Fernando Rodriguez Olivera @frodriguez Buenos Aires, Argentina, Nov 2014 JAVACONF 2014
  • 2. Fernando Rodriguez Olivera Twitter: @frodriguez Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing Supports for Batch, Interactive and Stream processing with Unified API In-Memory computing primitives
  • 4. Hadoop MR Limits Job Job Job Hadoop HDFS - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) MapReduce designed for Batch Processing: Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 5. Daytona Gray Sort 100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized
  • 6. Daytona Gray Sort 100TB Benchmark source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines
  • 7. Hadoop vs Spark for Iterative Proc source: https://spark.apache.org/ Logistic regression in Hadoop and Spark
  • 8. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka
  • 9. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects
  • 10. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed
  • 11. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 12. Resilient Distributed Datasets (RDD) Hello World ... ... ... ... A New Line hello The End ... RDD of Strings Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 13. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings
  • 14. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings e.g: apply function to count chars Compute Function (transformation)
  • 15. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation)
  • 16. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on
  • 17. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) depends on N Int Action
  • 18. RDD Transformations and Actions Hello World ... ... ... ... A New Line hello The End ... RDD of Strings 11 ... ... ... ... 10 5 7 ... RDD of Ints e.g: apply function to count chars Compute Function (transformation) Partitions Compute Function Dependencies Preferred Compute Location (for each partition) RDD Implementation Partitioner depends on N Int Action
  • 19. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java8Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. Text Processing Example Top Words by Frequency (Step by step)
  • 22. Create RDD from External Data // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) Hadoop FileSystem, I/O Formats, Codecs HBaseS3HDFS MongoDB Cassandra … Apache Spark Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) ElasticSearch
  • 23. Function map Hello World A New Line hello ... The end .map(line => line.toLowerCase) RDD[String] RDD[String] hello world a new line hello ... the end .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase) =
  • 24. Functions map and flatMap hello world a new line hello ... the end RDD[String]
  • 25. Functions map and flatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]]
  • 26. Functions map and flatMap hello world a new line hello ... the end RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 27. Functions map and flatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] .flatten hello a ... world new line RDD[String] *
  • 28. Functions map and flatMap hello world a new line hello ... the end .flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap RDD[String] .map( … ) _.split(“s+”) a hello hello ... the world new line end RDD[Array[String]] // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) .flatten hello a ... world new line RDD[String] *
  • 29. Key-Value Pairs hello a ... world new line hello hello a ... world new line hello .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 .map(word => (word, 1)) RDD[String] RDD[(String, Int)] // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) = RDD[Tuple2[String, Int]] Pair RDD
  • 32. Shuffling hello a world new line hello 1 1 1 1 1 1 world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 33. Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) => a + b) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 34. Shuffling hello a world new line hello 1 1 1 1 1 1 .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 1 .groupByKey RDD[(String, Iterator[Int])] 1 RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b RDD[(String, Int)]
  • 35. Top N (Prepare data) world a 1 1 new 1 line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap) .map(_.swap) world a 1 1 new1 line hello 1 2 RDD[(String, Int)] RDD[(Int, String)]
  • 36. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)]
  • 37. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] .sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 38. Top N (First Attempt) world a 1 1 new1 line hello 1 2 RDD[(Int, String)] Array[(Int, String)] hello world 2 1 .take(N).sortByKey RDD[(Int, String)] hello world 2 1 a1 new line 1 1 (sortByKey(false) for descending)
  • 39. Top N Array[(Int, String)] world a 1 1 new1 line hello 1 2 RDD[(Int, String)] world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N) * local top N implemented by bounded priority queues
  • 40. val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println) Top Words by Frequency (Full Code)
  • 42. SchemaRDD Row ... ... ... ... Row Row Row ... SchemaRDD RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 43. SchemaRDD Row ... ... ... ... Row Row Row ... topWords case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 44. nums = words.filter(_.matches(“[0-9]+”)) RDD Lineage HadoopRDDwords = sc.textFile(“hdfs://large/file/”) .map(_.toLowerCase) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FlatMappedRDD.flatMap(_.split(“ “)) FilteredRDD Lineage (built on the driver by the transformations) FilteredRDD Action (run job on the cluster) RDD Transformations
  • 45. Deployment with Hadoop A B C D /large/file Data Node 1 Data Node 3 Data Node 4 Data Node 2 A A AB BBCC CD DDRF 3 Name Node Spark Master Spark Worker Spark Worker Spark Worker Spark Worker Client Submit App (mode=cluster) Driver Executors Executors Executors allocates resources (cores and memory) Application DN + Spark HDFSSpark