SlideShare a Scribd company logo
Tuning Apache Spark for large-
scale workloads
Sital Kedia

Gaoxiang Liu Facebook
Agenda
• Apache Spark at Facebook
• Scaling Spark Driver
• Scaling Spark Executor
• Scaling External Shuffle
• Application tuning
• Tools
• Used for large scale batch workload
• Tens of thousands of jobs/day and growing
• Running on in the order of thousands of nodes
• Job scalability -
• Processes hundreds of TBs of compressed input data and
shuffle data
• Runs hundreds of thousands of tasks
Apache Spark at Facebook
Spark Architecture
Cluster

Manager
shuffle

service
Executor
Driver shuffle

service
Executor
shuffle

service
Executor
Scaling Spark Driver
Dynamic Executor Allocation
Cluster

Manager
Task Queue
Scheduler
Worker Node
shuffle

service
shuffle

service
shuffle

service
Executor
Request 

Executor
Release

Executor
• Better Resource utilization
• Good for multi-tenant environment
spark.dynamicAllocation.enabled = true

spark.dynamicAllocation.executorIdleTimeout = 2m
spark.dynamicAllocation.minExecutors = 1
spark.dynamicAllocation.maxExecutors = 2000
Multi-threaded Event Processor
[SPARK-18838]
Task

Scheduler
Single

threaded

event 

processor
Event

listeners
Event

Queue
Single threaded event processor
architecture
Task

Scheduler
Single threaded

Executor service

per listener
Multi-threaded event processor
architecture
Better Fetch Failure handling
[SPARK-19753] Avoid multiple retries of stages in case of Fetch Failure
Single fetch failure causing

single retriesSingle fetch failure causing 

multiple retries
• Avoid duplicate task run in case of Fetch Failure (SPARK-20163)
• Configurable max number of Fetch Failures (SPARK-13369)
• Ongoing effort (SPARK-20178)
Better Fetch Failure handling
spark.max.fetch.failures.per.stage = 10
• Frequent driver OOM when running many tasks in parallel
• Huge backlog of RPC requests built on Netty server of the
driver
• Increase RPC server thread to fix OOM
Tune RPC Server threads
Scaling Spark Driver
spark.rpc.io.serverThreads = 64
Scaling Spark Executor
Executor memory layout
Shuffle

Memory
User

Memory
Reserved

Memory 

(300 MB)
Memory

Buffer
}
}
}
spark.memory.fraction * (spark.executor.memory - 300 MB)
(1- spark.memory.fraction) * (spark.executor.memory - 300 MB)
spark.yarn.executor.memoryOverhead =
0.1 * (spark.executor.memory)
Tuning memory configurations
Shuffle

Memory
User

Memory
Reserved

Memory 

(300 MB)
Memory

Buffer
}
}
}
spark.memory.offHeap.enabled = true

spark.memory.offHeap.size = 3g
spark.executor.memory = 3g
spark.yarn.executor.memoryOverhead = 0.1 * 

(spark.executor.memory + spark.memory.offHeap.size)
Enable off-heap memory
Tuning memory configurations
• Large contiguous in-memory buffers allocated by Spark's shuffle
internals.
• G1GC suffers from fragmentation due to Humongous
Allocations, if object size is more than 32 MB (Maximum region
size of G1GC)
• Use parallel GC instead of G1GC
Garbage collection tuning
spark.executor.extraJavaOptions = -XX:ParallelGCThreads=4 -XX:+UseParallelGC
Eliminating disk I/O bottleneck
Sort & 

Spill to disk
























Temporary spill files on disk
Shuffle 

Partition
Final shuffle files on disk
In-memory

records
Eliminating disk I/O bottleneck
• Disk access is 10 - 100K times slower than memory access
• Make write buffer sizes for disk I/O configurable (SPARK-20074)
• Amortize disk I/O cost by doing buffered read/write
Tune Shuffle file buffer
spark.shuffle.file.buffer = 1 MB

spark.unsafe.sorter.spill.reader.buffer.size = 1MB

Eliminating disk I/O bottleneck
























Temporary spill 

files on disk
[SPARK-20014] Optimize spill files merging
















In-memory spill

file buffers


In-memory

spill merge
Shuffle 

Partition
Final shuffle 

file on disk
spark.file.transferTo = false

spark.shuffle.file.buffer = 1 MB

spark.shuffle.unsafe.file

.output.buffer = 5 MB
Eliminating disk I/O bottleneck
Tune compression block size
Compression block size vs size of shuffle files
SizeofShufflefiles(inTB)
130
155
180
205
230
Compression block size
32kb 128kb 512kb 1mb
• Default compression
block size of 32 kb is sub-
optimal
• Upto 20% reduction in
shuffle/spill file size by
increasing the block size
spark.io.compression.lz4.blockSize = 512KB
Various Memory leak fixes and
improvements
• Memory leak fixes (SPARK-14363, SPARK-17113, SPARK-18208)
• Snappy optimization (SPARK-14277)
• Reduce update frequency of shuffle bytes written metrics
(SPARK-15569)
• Configurable initial buffer size for Sorter(SPARK-15958)
Scaling External Shuffle Service
Cache Index files on Shuffle Server
SPARK-15074
index file
partition
partition
partition
Shuffle
service
Reducer
Reducer
Shuffle fetch
Shuffle fetch
Read index 

file
Read 

partition
Read 

partition
Read index 

file index file
partition
partition
partition
Shuffle
service
Reducer
Reducer
Shuffle fetch
Shuffle fetch
Read and cache

index file
Read 

partition
Read 

partition
spark.shuffle.service.index.cache.entries = 2048
Scaling External Shuffle Service
• Tune shuffle service worker thread and backlog
• Configurable shuffle registration timeout and retry
(SPARK-20640)
spark.shuffle.io.serverThreads = 128

spark.shuffle.io.backLog = 8192

spark.shuffle.registration.timeout = 2m

spark.shuffle.registration.maxAttempts = 5
https://code.facebook.com/posts/1671373793181703
Application tuning
• Improve performance of job latency (under same amount of
resource)
• Improve usability - eliminate manual tuning as much as possible
to achieve comparable job performance with manually tuned
parameters
Motivation
• Heuristics-based approach based on table input size
Auto tuning of mapper and reducer
• Max cap due to the
constrain of the scalability
of shuffle service and
drivers
• Min cap due to the
minimum guarantee of
resource to user's job
Tools
Tools
Spark UI metrics
Tools
Flame Graph
Executo Periodic
Jstack/PerfExecutoExecutor
Filter executor
threads
Worker Jstack
aggregator
service
Tools
Analysis of Task metrics using Facebook's Scuba
Task

Scheduler
Scuba

Event

listeners
Event

Queue
• Enables us to do complex queries
like -
• How many job failed because of OOM in
ExternalSorter in last 1 hour?
• What percentage of total execution time is
being spent in shuffle read?
• Did fetch failure rate go up after the last
Spark release?
• Set up monitoring and alerting to
catch regression
Resources
• Scuba: Diving into Data at Facebook
• Apache Spark @Scale: A 60 TB+ production use case
Questions?

More Related Content

What's hot (20)

PDF
How to Automate Performance Tuning for Apache Spark
Databricks
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
Physical Plans in Spark SQL
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PPTX
Hive Bucketing in Apache Spark
Tejas Patil
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
How to Automate Performance Tuning for Apache Spark
Databricks
 
Apache Spark Core – Practical Optimization
Databricks
 
Physical Plans in Spark SQL
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Hive Bucketing in Apache Spark
Tejas Patil
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Apache Spark At Scale in the Cloud
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Productizing Structured Streaming Jobs
Databricks
 

Similar to Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia (20)

PPTX
Understanding Spark Tuning: Strata New York
Rachel Warren
 
PDF
Spark Autotuning - Strata EU 2018
Holden Karau
 
PDF
Spark Autotuning Talk - Strata New York
Holden Karau
 
PPTX
Spark autotuning talk final
Rachel Warren
 
PPTX
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
PDF
Spark / Mesos Cluster Optimization
ebiznext
 
PDF
Tackling Scaling Challenges of Apache Spark at LinkedIn
Databricks
 
PDF
Scaling Apache Spark at Facebook
Databricks
 
PPTX
Spark Performance Tuning | Best PySpark & Databricks Online Training
Accentfuture
 
PDF
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
PPTX
SORT & JOIN IN SPARK 2.0
Sigmoid
 
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
Apache Spark At Scale in the Cloud
Rose Toomey
 
PPTX
Spark-Performance Tuning and it (1).pptx
bharatkumarbhojwani
 
PDF
Spark Summit EU talk by Luc Bourlier
Spark Summit
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PDF
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Summit
 
Understanding Spark Tuning: Strata New York
Rachel Warren
 
Spark Autotuning - Strata EU 2018
Holden Karau
 
Spark Autotuning Talk - Strata New York
Holden Karau
 
Spark autotuning talk final
Rachel Warren
 
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Spark / Mesos Cluster Optimization
ebiznext
 
Tackling Scaling Challenges of Apache Spark at LinkedIn
Databricks
 
Scaling Apache Spark at Facebook
Databricks
 
Spark Performance Tuning | Best PySpark & Databricks Online Training
Accentfuture
 
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
SORT & JOIN IN SPARK 2.0
Sigmoid
 
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Apache Spark At Scale in the Cloud
Rose Toomey
 
Spark-Performance Tuning and it (1).pptx
bharatkumarbhojwani
 
Spark Summit EU talk by Luc Bourlier
Spark Summit
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark Tuning for Enterprise System Administrators By Anya Bida
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Climate Action.pptx action plan for climate
justfortalabat
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia

  • 1. Tuning Apache Spark for large- scale workloads Sital Kedia
 Gaoxiang Liu Facebook
  • 2. Agenda • Apache Spark at Facebook • Scaling Spark Driver • Scaling Spark Executor • Scaling External Shuffle • Application tuning • Tools
  • 3. • Used for large scale batch workload • Tens of thousands of jobs/day and growing • Running on in the order of thousands of nodes • Job scalability - • Processes hundreds of TBs of compressed input data and shuffle data • Runs hundreds of thousands of tasks Apache Spark at Facebook
  • 6. Dynamic Executor Allocation Cluster
 Manager Task Queue Scheduler Worker Node shuffle
 service shuffle
 service shuffle
 service Executor Request 
 Executor Release
 Executor • Better Resource utilization • Good for multi-tenant environment spark.dynamicAllocation.enabled = true
 spark.dynamicAllocation.executorIdleTimeout = 2m spark.dynamicAllocation.minExecutors = 1 spark.dynamicAllocation.maxExecutors = 2000
  • 7. Multi-threaded Event Processor [SPARK-18838] Task
 Scheduler Single
 threaded
 event 
 processor Event
 listeners Event
 Queue Single threaded event processor architecture Task
 Scheduler Single threaded
 Executor service
 per listener Multi-threaded event processor architecture
  • 8. Better Fetch Failure handling [SPARK-19753] Avoid multiple retries of stages in case of Fetch Failure Single fetch failure causing
 single retriesSingle fetch failure causing 
 multiple retries
  • 9. • Avoid duplicate task run in case of Fetch Failure (SPARK-20163) • Configurable max number of Fetch Failures (SPARK-13369) • Ongoing effort (SPARK-20178) Better Fetch Failure handling spark.max.fetch.failures.per.stage = 10
  • 10. • Frequent driver OOM when running many tasks in parallel • Huge backlog of RPC requests built on Netty server of the driver • Increase RPC server thread to fix OOM Tune RPC Server threads Scaling Spark Driver spark.rpc.io.serverThreads = 64
  • 12. Executor memory layout Shuffle
 Memory User
 Memory Reserved
 Memory 
 (300 MB) Memory
 Buffer } } } spark.memory.fraction * (spark.executor.memory - 300 MB) (1- spark.memory.fraction) * (spark.executor.memory - 300 MB) spark.yarn.executor.memoryOverhead = 0.1 * (spark.executor.memory)
  • 13. Tuning memory configurations Shuffle
 Memory User
 Memory Reserved
 Memory 
 (300 MB) Memory
 Buffer } } } spark.memory.offHeap.enabled = true
 spark.memory.offHeap.size = 3g spark.executor.memory = 3g spark.yarn.executor.memoryOverhead = 0.1 * 
 (spark.executor.memory + spark.memory.offHeap.size) Enable off-heap memory
  • 14. Tuning memory configurations • Large contiguous in-memory buffers allocated by Spark's shuffle internals. • G1GC suffers from fragmentation due to Humongous Allocations, if object size is more than 32 MB (Maximum region size of G1GC) • Use parallel GC instead of G1GC Garbage collection tuning spark.executor.extraJavaOptions = -XX:ParallelGCThreads=4 -XX:+UseParallelGC
  • 15. Eliminating disk I/O bottleneck Sort & 
 Spill to disk 
 
 
 
 
 
 
 
 
 
 
 
 Temporary spill files on disk Shuffle 
 Partition Final shuffle files on disk In-memory
 records
  • 16. Eliminating disk I/O bottleneck • Disk access is 10 - 100K times slower than memory access • Make write buffer sizes for disk I/O configurable (SPARK-20074) • Amortize disk I/O cost by doing buffered read/write Tune Shuffle file buffer spark.shuffle.file.buffer = 1 MB
 spark.unsafe.sorter.spill.reader.buffer.size = 1MB

  • 17. Eliminating disk I/O bottleneck 
 
 
 
 
 
 
 
 
 
 
 
 Temporary spill 
 files on disk [SPARK-20014] Optimize spill files merging 
 
 
 
 
 
 
 
 In-memory spill
 file buffers 
 In-memory
 spill merge Shuffle 
 Partition Final shuffle 
 file on disk spark.file.transferTo = false
 spark.shuffle.file.buffer = 1 MB
 spark.shuffle.unsafe.file
 .output.buffer = 5 MB
  • 18. Eliminating disk I/O bottleneck Tune compression block size Compression block size vs size of shuffle files SizeofShufflefiles(inTB) 130 155 180 205 230 Compression block size 32kb 128kb 512kb 1mb • Default compression block size of 32 kb is sub- optimal • Upto 20% reduction in shuffle/spill file size by increasing the block size spark.io.compression.lz4.blockSize = 512KB
  • 19. Various Memory leak fixes and improvements • Memory leak fixes (SPARK-14363, SPARK-17113, SPARK-18208) • Snappy optimization (SPARK-14277) • Reduce update frequency of shuffle bytes written metrics (SPARK-15569) • Configurable initial buffer size for Sorter(SPARK-15958)
  • 21. Cache Index files on Shuffle Server SPARK-15074 index file partition partition partition Shuffle service Reducer Reducer Shuffle fetch Shuffle fetch Read index 
 file Read 
 partition Read 
 partition Read index 
 file index file partition partition partition Shuffle service Reducer Reducer Shuffle fetch Shuffle fetch Read and cache
 index file Read 
 partition Read 
 partition spark.shuffle.service.index.cache.entries = 2048
  • 22. Scaling External Shuffle Service • Tune shuffle service worker thread and backlog • Configurable shuffle registration timeout and retry (SPARK-20640) spark.shuffle.io.serverThreads = 128
 spark.shuffle.io.backLog = 8192
 spark.shuffle.registration.timeout = 2m
 spark.shuffle.registration.maxAttempts = 5
  • 25. • Improve performance of job latency (under same amount of resource) • Improve usability - eliminate manual tuning as much as possible to achieve comparable job performance with manually tuned parameters Motivation
  • 26. • Heuristics-based approach based on table input size Auto tuning of mapper and reducer • Max cap due to the constrain of the scalability of shuffle service and drivers • Min cap due to the minimum guarantee of resource to user's job
  • 27. Tools
  • 29. Tools Flame Graph Executo Periodic Jstack/PerfExecutoExecutor Filter executor threads Worker Jstack aggregator service
  • 30. Tools Analysis of Task metrics using Facebook's Scuba Task
 Scheduler Scuba
 Event
 listeners Event
 Queue • Enables us to do complex queries like - • How many job failed because of OOM in ExternalSorter in last 1 hour? • What percentage of total execution time is being spent in shuffle read? • Did fetch failure rate go up after the last Spark release? • Set up monitoring and alerting to catch regression
  • 31. Resources • Scuba: Diving into Data at Facebook • Apache Spark @Scale: A 60 TB+ production use case