SORT & JOIN IN SPARK 2.0

SORT & JOIN IN SPARK 2.0
Harsha Tenneti

CONTENTS
● Benchmarking
● Sort and Join
● Shuffle Manager
● GC optimisations

Benchmarking
● Joins
● Sort
Spark Version Time for two jobs Cores Memory Data Size
1.6 12min 133 288gb 1 * 12GB with 12 * 10mb
2 11min 70 60gb Same as above
Spark Version Time for two jobs Cores Memory Data SIze
1.6 Did not work NA NA 30GB parquet which is approx 500GB
raw data
2 50-60 min 37 37g 30GB parquet which is approx 500GB
raw data

Contd...
● Join with GC Configs
Spark Version Time for two jobs Cores Memory Data Size
2 11min 36 48g 1 *12GB with 12 * 10mb

Sort and Join
Both sort and join need the keys to be in same partition.
If not, then we need to shuffle the data which makes sure keys lies in same
partitioner which is a costly operation.
This is done by shuffle manager which is a service in spark

Shuffle Manager
● Both driver and executors have their own shuffle service.
● Driver registers shuffles with a shuffle manager and executors ask to read
and write data.
● The setting “spark.shuffle.manager” sets up the default shuffle manager.
● Couple of shuffles in spark are hash and sort

Contd...
In 2.0, LZ4 compression of the shuffled data included appending which help
to reduce small files in shuffle spill
● Included “spark.reducer.maxReqsInFlight” property to limits the number
of remote requests to fetch blocks at any given point
● Reusability of shuffle data because of “Whole code stage Generation”
● Found that changing our machine disk from magnetic to sd1 increased
the IO of shuffle read and write

GC optimisations
● -XX:G1HeapRegionSize
● -XX:+AlwaysPreTouch
● -XX:ParallelGCThreads
● -XX:InitiatingHeapOccupancyPercent=0
● -Xms

Contd...
● -XX:InitialTenuringThreshold
● -XX:MaxMetaspaceSize
● -XX:G1MaxNewSizePercent
● --conf "spark.executor.extraJavaOptions=”
● spark.executor.extraJavaOptions=-XX:SurvivorRatio=16 -XX:+UseG1GC -
XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -
XX:+PrintAdaptiveSizePolicy

SORT & JOIN IN SPARK 2.0

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to SORT & JOIN IN SPARK 2.0 (20)

More from Sigmoid (12)

Recently uploaded (20)

SORT & JOIN IN SPARK 2.0