SlideShare a Scribd company logo
Tuning and Debugging in
Apache Spark
Patrick Wendell @pwendell
February 20, 2015
About Me
Apache Spark committer and PMC, release manager
Worked on Spark at UC Berkeley when the project started
Today, managing Spark efforts at Databricks
2
About Databricks
Founded by creators of Spark in 2013
Donated Spark to ASF and remain largest contributor
End-to-End hosted service: Databricks Cloud
3
Today’s Talk
Help you understand and debug Spark programs
Assumes you know Spark core API concepts, focused on
internals
4
5
Spark’s Execution Model
6
The	
  key	
  to	
  tuning	
  
Spark	
  apps	
  is	
  a	
  
sound	
  grasp	
  of	
  
Spark’s	
  internal	
  
mechanisms.	
  
Key Question
How does a user program get translated into units of
physical execution: jobs, stages, and tasks:
7
?	
  
RDD API Refresher
RDDs are a distributed collection of records
rdd = spark.parallelize(range(10000), 10)
Transformations create new RDDs from existing ones
errors = rdd.filter(lambda line: “ERROR” in line)
Actions materialize a value in the user program
size = errors.count()
8
RDD API Example
// Read input file
val input = sc.textFile("input.txt")
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b, 2 }
9
INFO Server started!
INFO Bound to port 8080!
!
WARN Cannot find srv.conf!
input.txt!
RDD API Example
// Read input file
val input = sc.textFile( )
val tokenized = input
.map(line => line.split(" "))
.filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels
.map(words => (words(0), 1)).
.reduceByKey{ (a, b) => a + b }
10
Transformations
sc.textFile().map().filter().map().reduceByKey()
11
DAG View of RDD’s
textFile() map() filter() map() reduceByKey()
12
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input	
  	
   tokenized	
   counts	
  
Transformations build up a DAG, but don’t “do anything”
13
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
14
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
15
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
16
Evaluation of the DAG
We mentioned “actions” a few slides ago. Let’s forget them for
a minute.
DAG’s are materialized through a method sc.runJob:
def runJob[T, U](
rdd: RDD[T], 1. RDD to compute
partitions: Seq[Int], 2. Which partitions
func: (Iterator[T]) => U)) 3. Fn to produce results
: Array[U] à results for each part.
17
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
18
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input	
  	
   tokenized	
   counts	
  
runJob(counts)
Physical Optimizations
1.  Certain types of transformations can be pipelined.
2.  If dependent RDD’s have already been cached (or
persisted in a shuffle) the graph can be truncated.
Once pipelining and truncation occur, Spark produces a
a set of stages each stage is composed of tasks
19
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
20
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input	
  	
   tokenized	
   counts	
  
runJob(counts)
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
21
input	
  	
   tokenized	
   counts	
  
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back
to an RDD with no dependencies (e.g. HadoopRDD).
22
input	
  	
   tokenized	
   counts	
  
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
Stage Graph
23
Task 1
Task 2
Task 3
Task 1
Task 2
Stage 1 Stage 2
Each task will:
1.  Read
Hadoop
input
2.  Perform
maps and
filters
3.  Write partial
sums
Each task will:
1.  Read
partial
sums
2.  Invoke user
function
passed to
runJob.
Shuffle write Shuffle readInput read
Units of Physical Execution
Jobs: Work required to compute RDD in runJob.
Stages: A wave of work within a job, corresponding to
one or more pipelined RDD’s.
Tasks: A unit of work within a stage, corresponding to
one RDD partition.
Shuffle: The transfer of data between stages.
24
Seeing this on your own
scala> counts.toDebugString
res84: String =
(2) ShuffledRDD[296] at reduceByKey at <console>:17
+-(3) MappedRDD[295] at map at <console>:17
| FilteredRDD[294] at filter at <console>:15
| MappedRDD[293] at map at <console>:15
| input.text MappedRDD[292] at textFile at <console>:13
| input.text HadoopRDD[291] at textFile at <console>:13
25
(indentations indicate a shuffle boundary)
Example: count() action
class RDD {
def count(): Long = {
results = sc.runJob(
this, 1. RDD = self
0 until partitions.size, 2. Partitions = all partitions
it => it.size() 3. Function = size of the partition
)
return results.sum
}
}
26
Example: take(N) action
class RDD {
def take(n: Int) {
val results = new ArrayBuffer[T]
var partition = 0
while (results.size < n) {
result ++= sc.runJob(this, partition, it => it.toArray)
partition = partition + 1
}
return results.take(n)
}
}
27
Putting it All Together
28
Named after action calling runJob
Named after last RDD in pipeline
29
Determinants of Performance in Spark
Quantity of Data Shuffled
In general, avoiding shuffle will make your program run
faster.
1.  Use the built in aggregateByKey() operator instead of
writing your own aggregations.
2.  Filter input earlier in the program rather than later.
3.  Go to this afternoon’s talk!
30
Degree of Parallelism
> input = sc.textFile("s3n://log-files/2014/*.log.gz") #matches thousands of files
> input.getNumPartitions()
35154
> lines = input.filter(lambda line: line.startswith("2014-10-17 08:")) # selective
> lines.getNumPartitions()
35154
> lines = lines.coalesce(5).cache() # We coalesce the lines RDD before caching
> lines.getNumPartitions()
5
>>> lines.count() # occurs on coalesced RDD
31
Degree of Parallelism
If you have a huge number of mostly idle tasks (e.g. 10’s
of thousands), then it’s often good to coalesce.
If you are not using all slots in your cluster, repartition
can increase parallelism.
32
Choice of Serializer
Serialization is sometimes a bottleneck when shuffling
and caching data. Using the Kryo serializer is often faster.
val conf = new SparkConf()
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// Be strict about class registration
conf.set("spark.kryo.registrationRequired", "true")
conf.registerKryoClasses(Array(classOf[MyClass],
classOf[MyOtherClass]))
33
Cache Format
By default Spark will cache() data using MEMORY_ONLY
level, deserialized JVM objects
MEMORY_ONLY_SER can help cut down on GC
MEMORY_AND_DISK can avoid expensive
recompuations
34
Hardware
Spark scales horizontally, so more is better
Disk/Memory/Network balance depends on workload:
CPU intensive ML jobs vs IO intensive ETL jobs
Good to keep executor heap size to 64GB or less (can run
multiple on each node)
35
Other Performance Tweaks
Switching to LZF compression can improve shuffle
performance (sacrifices some robustness for massive
shuffles):
conf.set(“spark.io.compression.codec”, “lzf”)
Turn on speculative execution to help prevent stragglers
conf.set(“spark.speculation”, “true”)
36
Other Performance Tweaks
Make sure to give Spark as many disks as possible to
allow striping shuffle output
SPARK_LOCAL_DIRS in Mesos/Standalone
In YARN mode, inherits YARN’s local directories
37
38
One Weird Trick for Great Performance
Use Higher Level API’s!
DataFrame APIs for core processing
Works across Scala, Java, Python and R
Spark ML for machine learning
Spark SQL for structured query processing
39
40
See also
Chapter 8: Tuning and
Debugging Spark.
Come to Spark Summit 2015!
41
June 15-17 in San Francisco
Thank you.
Any questions?
42
Extra Slides
43
Internals of the RDD Interface
44
1)  List of partitions
2)  Set of dependencies on parent RDDs
3)  Function to compute a partition, given parents
4)  Optional partitioning info for k/v RDDs (Partitioner)
RDD
Partition 1
Partition 2
Partition 3
Example: Hadoop RDD
45
Partitions = 1 per HDFS block
Dependencies = None
compute(partition) = read corresponding HDFS block
Partitioner = None
> rdd = spark.hadoopFile(“hdfs://click_logs/”)
Example: Filtered RDD
46
Partitions = parent partitions
Dependencies = a single parent
compute(partition) = call parent.compute(partition) and filter
Partitioner = parent partitioner
> filtered = rdd.filter(lambda x: x contains “ERROR”)
Example: Joined RDD
47
Partitions = number chosen by user or heuristics
Dependencies = ShuffleDependency on two or more parents
compute(partition) = read and join data from all parents
Partitioner = HashPartitioner(# partitions)
48
A More Complex DAG
Joined RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Mapped RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
JDBC RDD
Partition 1
Partition 2
Filtered RDD
Partition 1
Partition 2
Partition 3
.count()	
  
49
A More Complex DAG
Stage 3
Task 1
Task 2
Task 3
Stage 2
Task 1
Task 2
Stage 1
Task 1
Task 2
Shuffle
Read
Shuffle
Write
50
RDD
Partition 1
Partition 2
Partition 3
Parent
Partition 1
Partition 2
Partition 3
Narrow and Wide Transformations
RDD
Partition 1
Partition 2
Partition 3
Parent 1
Partition 1
Partition 2
Parent 2
Partition 1
Partition 2
FilteredRDD JoinedRDD

More Related Content

What's hot (20)

PDF
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PDF
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
Nikolay Samokhvalov
 
PPTX
Live Demo: Introducing the Spark Connector for MongoDB
MongoDB
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
PDF
RDF, SPARQL and Semantic Repositories
Marin Dimitrov
 
PPTX
BeeGFS Enterprise Deployment
Dirk Petersen
 
PDF
Physical Plans in Spark SQL
Databricks
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
PPTX
Add Redis to Postgres to Make Your Microservices Go Boom!
Dave Nielsen
 
PDF
Ashrae thermal guidelines svlg 2015 (1)
Gilmar F A Silva, MBA, DCDA®,DCCP®, CETa®
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PDF
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Denodo
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Knowledge Graphs to Power Financial Chat Bots
Neo4j
 
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Luigi presentation NYC Data Science
Erik Bernhardsson
 
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
Nikolay Samokhvalov
 
Live Demo: Introducing the Spark Connector for MongoDB
MongoDB
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
RDF, SPARQL and Semantic Repositories
Marin Dimitrov
 
BeeGFS Enterprise Deployment
Dirk Petersen
 
Physical Plans in Spark SQL
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
The Hive
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Dave Nielsen
 
Ashrae thermal guidelines svlg 2015 (1)
Gilmar F A Silva, MBA, DCDA®,DCCP®, CETa®
 
Spark introduction and architecture
Sohil Jain
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Building a Consistent Hybrid Cloud Semantic Model In Denodo
Denodo
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Knowledge Graphs to Power Financial Chat Bots
Neo4j
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 

Viewers also liked (20)

PDF
Apache Spark RDDs
Dean Chen
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PDF
Data Source API in Spark
Databricks
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PPTX
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
PDF
TensorFlow User Group #1
陽平 山口
 
PDF
デブサミ2017 公募セッション募集要項
Developers Summit
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Culture
Reed Hastings
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Tensor flow usergroup 2016 (公開版)
Hiroki Nakahara
 
PDF
Spark 2.x Troubleshooting Guide
IBM
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
PDF
Strata + Hadoop World 2014 レポート #cwt2014
Cloudera Japan
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
Apache Spark RDDs
Dean Chen
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Data Source API in Spark
Databricks
 
New directions for Apache Spark in 2015
Databricks
 
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Introduction to Spark Internals
Pietro Michiardi
 
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
TensorFlow User Group #1
陽平 山口
 
デブサミ2017 公募セッション募集要項
Developers Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Culture
Reed Hastings
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Tensor flow usergroup 2016 (公開版)
Hiroki Nakahara
 
Spark 2.x Troubleshooting Guide
IBM
 
Apache Spark Architecture
Alexey Grishchenko
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Strata + Hadoop World 2014 レポート #cwt2014
Cloudera Japan
 
Transformations and actions a visual guide training
Spark Summit
 
Ad

Similar to Tuning and Debugging in Apache Spark (20)

PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PPTX
Spark
Heena Madan
 
PDF
Scala Meetup Hamburg - Spark
Ivan Morozov
 
PPT
Bigdata processing with Spark - part II
Arjen de Vries
 
PPT
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Spark 计算模型
wang xing
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PPTX
SparkNotes
Demet Aksoy
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
Big Data Analytics with Apache Spark
MarcoYuriFujiiMelo
 
PDF
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
PDF
Apache Spark: What's under the hood
Adarsh Pannu
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Advanced spark training advanced spark internals and tuning reynold xin
caidezhi655
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Introduction to Apache Spark
Vincent Poncet
 
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Bigdata processing with Spark - part II
Arjen de Vries
 
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Scala and spark
Fabio Fumarola
 
Spark 计算模型
wang xing
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
SparkNotes
Demet Aksoy
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Big Data Analytics with Apache Spark
MarcoYuriFujiiMelo
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Apache Spark: What's under the hood
Adarsh Pannu
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
Import Data Form Excel to Tally Services
Tally xperts
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 

Tuning and Debugging in Apache Spark

  • 1. Tuning and Debugging in Apache Spark Patrick Wendell @pwendell February 20, 2015
  • 2. About Me Apache Spark committer and PMC, release manager Worked on Spark at UC Berkeley when the project started Today, managing Spark efforts at Databricks 2
  • 3. About Databricks Founded by creators of Spark in 2013 Donated Spark to ASF and remain largest contributor End-to-End hosted service: Databricks Cloud 3
  • 4. Today’s Talk Help you understand and debug Spark programs Assumes you know Spark core API concepts, focused on internals 4
  • 6. 6 The  key  to  tuning   Spark  apps  is  a   sound  grasp  of   Spark’s  internal   mechanisms.  
  • 7. Key Question How does a user program get translated into units of physical execution: jobs, stages, and tasks: 7 ?  
  • 8. RDD API Refresher RDDs are a distributed collection of records rdd = spark.parallelize(range(10000), 10) Transformations create new RDDs from existing ones errors = rdd.filter(lambda line: “ERROR” in line) Actions materialize a value in the user program size = errors.count() 8
  • 9. RDD API Example // Read input file val input = sc.textFile("input.txt") val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 } 9 INFO Server started! INFO Bound to port 8080! ! WARN Cannot find srv.conf! input.txt!
  • 10. RDD API Example // Read input file val input = sc.textFile( ) val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b } 10
  • 12. DAG View of RDD’s textFile() map() filter() map() reduceByKey() 12 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input     tokenized   counts  
  • 13. Transformations build up a DAG, but don’t “do anything” 13
  • 14. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 14
  • 15. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 15
  • 16. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 16
  • 17. Evaluation of the DAG We mentioned “actions” a few slides ago. Let’s forget them for a minute. DAG’s are materialized through a method sc.runJob: def runJob[T, U]( rdd: RDD[T], 1. RDD to compute partitions: Seq[Int], 2. Which partitions func: (Iterator[T]) => U)) 3. Fn to produce results : Array[U] à results for each part. 17
  • 18. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 18 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input     tokenized   counts   runJob(counts)
  • 19. Physical Optimizations 1.  Certain types of transformations can be pipelined. 2.  If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated. Once pipelining and truncation occur, Spark produces a a set of stages each stage is composed of tasks 19
  • 20. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 20 Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 input     tokenized   counts   runJob(counts)
  • 21. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 21 input     tokenized   counts   Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 22. How runJob Works Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD). 22 input     tokenized   counts   Mapped RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Partition 3 Mapped RDD Partition 1 Partition 2 Partition 3 Shuffle RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 Partition 3 runJob(counts)
  • 23. Stage Graph 23 Task 1 Task 2 Task 3 Task 1 Task 2 Stage 1 Stage 2 Each task will: 1.  Read Hadoop input 2.  Perform maps and filters 3.  Write partial sums Each task will: 1.  Read partial sums 2.  Invoke user function passed to runJob. Shuffle write Shuffle readInput read
  • 24. Units of Physical Execution Jobs: Work required to compute RDD in runJob. Stages: A wave of work within a job, corresponding to one or more pipelined RDD’s. Tasks: A unit of work within a stage, corresponding to one RDD partition. Shuffle: The transfer of data between stages. 24
  • 25. Seeing this on your own scala> counts.toDebugString res84: String = (2) ShuffledRDD[296] at reduceByKey at <console>:17 +-(3) MappedRDD[295] at map at <console>:17 | FilteredRDD[294] at filter at <console>:15 | MappedRDD[293] at map at <console>:15 | input.text MappedRDD[292] at textFile at <console>:13 | input.text HadoopRDD[291] at textFile at <console>:13 25 (indentations indicate a shuffle boundary)
  • 26. Example: count() action class RDD { def count(): Long = { results = sc.runJob( this, 1. RDD = self 0 until partitions.size, 2. Partitions = all partitions it => it.size() 3. Function = size of the partition ) return results.sum } } 26
  • 27. Example: take(N) action class RDD { def take(n: Int) { val results = new ArrayBuffer[T] var partition = 0 while (results.size < n) { result ++= sc.runJob(this, partition, it => it.toArray) partition = partition + 1 } return results.take(n) } } 27
  • 28. Putting it All Together 28 Named after action calling runJob Named after last RDD in pipeline
  • 30. Quantity of Data Shuffled In general, avoiding shuffle will make your program run faster. 1.  Use the built in aggregateByKey() operator instead of writing your own aggregations. 2.  Filter input earlier in the program rather than later. 3.  Go to this afternoon’s talk! 30
  • 31. Degree of Parallelism > input = sc.textFile("s3n://log-files/2014/*.log.gz") #matches thousands of files > input.getNumPartitions() 35154 > lines = input.filter(lambda line: line.startswith("2014-10-17 08:")) # selective > lines.getNumPartitions() 35154 > lines = lines.coalesce(5).cache() # We coalesce the lines RDD before caching > lines.getNumPartitions() 5 >>> lines.count() # occurs on coalesced RDD 31
  • 32. Degree of Parallelism If you have a huge number of mostly idle tasks (e.g. 10’s of thousands), then it’s often good to coalesce. If you are not using all slots in your cluster, repartition can increase parallelism. 32
  • 33. Choice of Serializer Serialization is sometimes a bottleneck when shuffling and caching data. Using the Kryo serializer is often faster. val conf = new SparkConf() conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") // Be strict about class registration conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[MyClass], classOf[MyOtherClass])) 33
  • 34. Cache Format By default Spark will cache() data using MEMORY_ONLY level, deserialized JVM objects MEMORY_ONLY_SER can help cut down on GC MEMORY_AND_DISK can avoid expensive recompuations 34
  • 35. Hardware Spark scales horizontally, so more is better Disk/Memory/Network balance depends on workload: CPU intensive ML jobs vs IO intensive ETL jobs Good to keep executor heap size to 64GB or less (can run multiple on each node) 35
  • 36. Other Performance Tweaks Switching to LZF compression can improve shuffle performance (sacrifices some robustness for massive shuffles): conf.set(“spark.io.compression.codec”, “lzf”) Turn on speculative execution to help prevent stragglers conf.set(“spark.speculation”, “true”) 36
  • 37. Other Performance Tweaks Make sure to give Spark as many disks as possible to allow striping shuffle output SPARK_LOCAL_DIRS in Mesos/Standalone In YARN mode, inherits YARN’s local directories 37
  • 38. 38 One Weird Trick for Great Performance
  • 39. Use Higher Level API’s! DataFrame APIs for core processing Works across Scala, Java, Python and R Spark ML for machine learning Spark SQL for structured query processing 39
  • 40. 40 See also Chapter 8: Tuning and Debugging Spark.
  • 41. Come to Spark Summit 2015! 41 June 15-17 in San Francisco
  • 44. Internals of the RDD Interface 44 1)  List of partitions 2)  Set of dependencies on parent RDDs 3)  Function to compute a partition, given parents 4)  Optional partitioning info for k/v RDDs (Partitioner) RDD Partition 1 Partition 2 Partition 3
  • 45. Example: Hadoop RDD 45 Partitions = 1 per HDFS block Dependencies = None compute(partition) = read corresponding HDFS block Partitioner = None > rdd = spark.hadoopFile(“hdfs://click_logs/”)
  • 46. Example: Filtered RDD 46 Partitions = parent partitions Dependencies = a single parent compute(partition) = call parent.compute(partition) and filter Partitioner = parent partitioner > filtered = rdd.filter(lambda x: x contains “ERROR”)
  • 47. Example: Joined RDD 47 Partitions = number chosen by user or heuristics Dependencies = ShuffleDependency on two or more parents compute(partition) = read and join data from all parents Partitioner = HashPartitioner(# partitions)
  • 48. 48 A More Complex DAG Joined RDD Partition 1 Partition 2 Partition 3 Filtered RDD Partition 1 Partition 2 Mapped RDD Partition 1 Partition 2 Hadoop RDD Partition 1 Partition 2 JDBC RDD Partition 1 Partition 2 Filtered RDD Partition 1 Partition 2 Partition 3 .count()  
  • 49. 49 A More Complex DAG Stage 3 Task 1 Task 2 Task 3 Stage 2 Task 1 Task 2 Stage 1 Task 1 Task 2 Shuffle Read Shuffle Write
  • 50. 50 RDD Partition 1 Partition 2 Partition 3 Parent Partition 1 Partition 2 Partition 3 Narrow and Wide Transformations RDD Partition 1 Partition 2 Partition 3 Parent 1 Partition 1 Partition 2 Parent 2 Partition 1 Partition 2 FilteredRDD JoinedRDD