SlideShare a Scribd company logo
SORT & JOIN IN SPARK 2.0
Harsha Tenneti
CONTENTS
● Benchmarking
● Sort and Join
● Shuffle Manager
● GC optimisations
Benchmarking
● Joins
● Sort
Spark Version Time for two jobs Cores Memory Data Size
1.6 12min 133 288gb 1 * 12GB with 12 * 10mb
2 11min 70 60gb Same as above
Spark Version Time for two jobs Cores Memory Data SIze
1.6 Did not work NA NA 30GB parquet which is approx 500GB
raw data
2 50-60 min 37 37g 30GB parquet which is approx 500GB
raw data
Contd...
● Join with GC Configs
Spark Version Time for two jobs Cores Memory Data Size
2 11min 36 48g 1 *12GB with 12 * 10mb
Sort and Join
Both sort and join need the keys to be in same partition.
If not, then we need to shuffle the data which makes sure keys lies in same
partitioner which is a costly operation.
This is done by shuffle manager which is a service in spark
Shuffle Manager
● Both driver and executors have their own shuffle service.
● Driver registers shuffles with a shuffle manager and executors ask to read
and write data.
● The setting “spark.shuffle.manager” sets up the default shuffle manager.
● Couple of shuffles in spark are hash and sort
Contd...
In 2.0, LZ4 compression of the shuffled data included appending which help
to reduce small files in shuffle spill
● Included “spark.reducer.maxReqsInFlight” property to limits the number
of remote requests to fetch blocks at any given point
● Reusability of shuffle data because of “Whole code stage Generation”
● Found that changing our machine disk from magnetic to sd1 increased
the IO of shuffle read and write
GC optimisations
● -XX:G1HeapRegionSize
● -XX:+AlwaysPreTouch
● -XX:ParallelGCThreads
● -XX:InitiatingHeapOccupancyPercent=0
● -Xms
Contd...
● -XX:InitialTenuringThreshold
● -XX:MaxMetaspaceSize
● -XX:G1MaxNewSizePercent
● --conf "spark.executor.extraJavaOptions=”
● spark.executor.extraJavaOptions=-XX:SurvivorRatio=16 -XX:+UseG1GC -
XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -
XX:+PrintAdaptiveSizePolicy
Thank You

More Related Content

PPTX
Spark 1.6 vs Spark 2.0
Sigmoid
 
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
PDF
Spark shuffle introduction
colorant
 
PDF
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
PDF
Why your Spark job is failing
Sandy Ryza
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Spark Summit
 
PDF
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Spark 1.6 vs Spark 2.0
Sigmoid
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Spark shuffle introduction
colorant
 
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
 
Why your Spark job is failing
Sandy Ryza
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Spark Summit
 
DTCC '14 Spark Runtime Internals
Cheng Lian
 

What's hot (20)

PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Spark Summit
 
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
PPTX
Apache Spark overview
DataArt
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Spark on YARN
Adarsh Pannu
 
PPTX
Apache Spark RDD 101
sparkInstructor
 
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
PDF
The Hadoop Ecosystem
Mathias Herberts
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
Apache Spark Tutorial
Farzad Nozarian
 
PDF
Introduction to Spark
Li Ming Tsai
 
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Spark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Apache Spark overview
DataArt
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark on YARN
Adarsh Pannu
 
Apache Spark RDD 101
sparkInstructor
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark RDDs
Dean Chen
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
The Hadoop Ecosystem
Mathias Herberts
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
Apache Spark Tutorial
Farzad Nozarian
 
Introduction to Spark
Li Ming Tsai
 
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Ad

Viewers also liked (20)

PPTX
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
Long term care
Gurjot Singh Aubi
 
PPTX
Failsafe Hadoop Infrastructure and the way they work
Sigmoid
 
PPTX
Angular js performance improvements
Sigmoid
 
PDF
Equation solving-at-scale-using-apache-spark
Sigmoid
 
PDF
Building high scalable distributed framework on apache mesos
Sigmoid
 
PDF
Graph computation
Sigmoid
 
PDF
Productionizing spark
Sigmoid
 
PPTX
WEBSOCKETS AND WEBWORKERS
Sigmoid
 
PDF
Real-time Supply Chain Analytics
Sigmoid
 
PPTX
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 
PPT
Spark and spark streaming internals
Sigmoid
 
PDF
Composing and scaling data platforms
Sigmoid
 
PPTX
Introduction to apache nutch
Sigmoid
 
PPTX
Approaches to text analysis
Sigmoid
 
PPTX
Joining Large data at Scale
Sigmoid
 
PPTX
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
PDF
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Apache Spark Architecture
Alexey Grishchenko
 
Long term care
Gurjot Singh Aubi
 
Failsafe Hadoop Infrastructure and the way they work
Sigmoid
 
Angular js performance improvements
Sigmoid
 
Equation solving-at-scale-using-apache-spark
Sigmoid
 
Building high scalable distributed framework on apache mesos
Sigmoid
 
Graph computation
Sigmoid
 
Productionizing spark
Sigmoid
 
WEBSOCKETS AND WEBWORKERS
Sigmoid
 
Real-time Supply Chain Analytics
Sigmoid
 
Sparkstreaming with kafka and h base at scale (1)
Sigmoid
 
Spark and spark streaming internals
Sigmoid
 
Composing and scaling data platforms
Sigmoid
 
Introduction to apache nutch
Sigmoid
 
Approaches to text analysis
Sigmoid
 
Joining Large data at Scale
Sigmoid
 
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Ad

Similar to SORT & JOIN IN SPARK 2.0 (20)

PPTX
Emr spark tuning demystified
Omid Vahdaty
 
PDF
3 Flink Mistakes We Made So You Won't Have To
HostedbyConfluent
 
PDF
Apache Spark Best Practices Meetup Talk
Eren Avşaroğulları
 
PDF
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
PDF
Spark Meetup
Sahan Bulathwela
 
PPTX
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
PDF
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
PDF
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
PDF
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
PDF
MesosCon 2018
Pablo Delgado
 
PDF
Migrating to Apache Spark at Netflix
Databricks
 
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
PPTX
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
PPTX
Spark-Performance Tuning and it (1).pptx
bharatkumarbhojwani
 
PDF
Speedrunning the Open Street Map osm2pgsql Loader
GregSmith458515
 
PDF
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Bruno Castelucci
 
PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
PDF
cachegrand: A Take on High Performance Caching
ScyllaDB
 
PPTX
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
 
Emr spark tuning demystified
Omid Vahdaty
 
3 Flink Mistakes We Made So You Won't Have To
HostedbyConfluent
 
Apache Spark Best Practices Meetup Talk
Eren Avşaroğulları
 
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Spark Meetup
Sahan Bulathwela
 
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Kafka to the Maxka - (Kafka Performance Tuning)
DataWorks Summit
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB plc
 
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
MesosCon 2018
Pablo Delgado
 
Migrating to Apache Spark at Netflix
Databricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
Spark-Performance Tuning and it (1).pptx
bharatkumarbhojwani
 
Speedrunning the Open Street Map osm2pgsql Loader
GregSmith458515
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Bruno Castelucci
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
cachegrand: A Take on High Performance Caching
ScyllaDB
 
SSDs, IMDGs and All the Rest - Jax London
Uri Cohen
 

More from Sigmoid (12)

PPTX
Monitoring and tuning Spark applications
Sigmoid
 
PPTX
Structured Streaming Using Spark 2.1
Sigmoid
 
PDF
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
PPTX
Levelling up in Akka
Sigmoid
 
PDF
Expression Problem: Discussing the problems in OOPs language & their solutions
Sigmoid
 
PPTX
Building bots to automate common developer tasks - Writing your first smart c...
Sigmoid
 
PPT
Graph Analytics for big data
Sigmoid
 
PPTX
Using spark for timeseries graph analytics
Sigmoid
 
PDF
Time series database by Harshil Ambagade
Sigmoid
 
PDF
Dashboard design By Anu Vijayan
Sigmoid
 
PDF
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
PPTX
Real Time search using Spark and Elasticsearch
Sigmoid
 
Monitoring and tuning Spark applications
Sigmoid
 
Structured Streaming Using Spark 2.1
Sigmoid
 
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 
Levelling up in Akka
Sigmoid
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Sigmoid
 
Building bots to automate common developer tasks - Writing your first smart c...
Sigmoid
 
Graph Analytics for big data
Sigmoid
 
Using spark for timeseries graph analytics
Sigmoid
 
Time series database by Harshil Ambagade
Sigmoid
 
Dashboard design By Anu Vijayan
Sigmoid
 
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
Real Time search using Spark and Elasticsearch
Sigmoid
 

Recently uploaded (20)

PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Artificial Intelligence (AI)
Mukul
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 

SORT & JOIN IN SPARK 2.0

  • 1. SORT & JOIN IN SPARK 2.0 Harsha Tenneti
  • 2. CONTENTS ● Benchmarking ● Sort and Join ● Shuffle Manager ● GC optimisations
  • 3. Benchmarking ● Joins ● Sort Spark Version Time for two jobs Cores Memory Data Size 1.6 12min 133 288gb 1 * 12GB with 12 * 10mb 2 11min 70 60gb Same as above Spark Version Time for two jobs Cores Memory Data SIze 1.6 Did not work NA NA 30GB parquet which is approx 500GB raw data 2 50-60 min 37 37g 30GB parquet which is approx 500GB raw data
  • 4. Contd... ● Join with GC Configs Spark Version Time for two jobs Cores Memory Data Size 2 11min 36 48g 1 *12GB with 12 * 10mb
  • 5. Sort and Join Both sort and join need the keys to be in same partition. If not, then we need to shuffle the data which makes sure keys lies in same partitioner which is a costly operation. This is done by shuffle manager which is a service in spark
  • 6. Shuffle Manager ● Both driver and executors have their own shuffle service. ● Driver registers shuffles with a shuffle manager and executors ask to read and write data. ● The setting “spark.shuffle.manager” sets up the default shuffle manager. ● Couple of shuffles in spark are hash and sort
  • 7. Contd... In 2.0, LZ4 compression of the shuffled data included appending which help to reduce small files in shuffle spill ● Included “spark.reducer.maxReqsInFlight” property to limits the number of remote requests to fetch blocks at any given point ● Reusability of shuffle data because of “Whole code stage Generation” ● Found that changing our machine disk from magnetic to sd1 increased the IO of shuffle read and write
  • 8. GC optimisations ● -XX:G1HeapRegionSize ● -XX:+AlwaysPreTouch ● -XX:ParallelGCThreads ● -XX:InitiatingHeapOccupancyPercent=0 ● -Xms
  • 9. Contd... ● -XX:InitialTenuringThreshold ● -XX:MaxMetaspaceSize ● -XX:G1MaxNewSizePercent ● --conf "spark.executor.extraJavaOptions=” ● spark.executor.extraJavaOptions=-XX:SurvivorRatio=16 -XX:+UseG1GC - XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC - XX:+PrintAdaptiveSizePolicy