SlideShare a Scribd company logo
APACHE SPARK 3
NEW FEATURES
- APARUP CHATTERJEE
Spark 3.0.0 has released early June 2020
With the release of Spark 3.0, there are so many improvements implemented for
faster execution.
Well, there are many several changes done in improving SQL Performance such as:
 Adaptive Query Execution (AQE)
 New EXPLAIN Format
 Dataframe tail function
 Join Hints
 Dynamic Partition Pruning
New Added Features in Spark 3.0
Source:- SPARK+AI SUMMIT EUROPE 2019,
SPARK 3.0 OFFICIAL DOCS & Google Search
Today’s session I will be briefing first 3 features and rest of other I will continue in
my next session
Spark 2.0 based Environment Details:
Hadoop 2.9
Spark 2.3
Python 2.7.14
Used GCP based Bigdata Component Details
Spark 3.0 based Environment Details:
Hadoop 3.2
Spark 3.0
Python 3.7.4
Spark catalyst is one of the most important layer of spark SQL which does all
the query optimisation.
Even though spark catalyst does lot of heavy lifting, it’s all done before query
execution. So that means once the physical plan is created and execution of
the plan started, it will not do any optimisation there after. So it cannot do
some of the optimisation which is based on metrics it sees when the
execution is going on.
In 3.0, spark has introduced an additional layer of optimisation. This layer is
known as Adaptive Query Execution(AQE). This layer tries to optimise the
queries depending upon the metrics that are collected as part of the
execution.
Adaptive Query Execution, AQE, is a layer on top of the spark catalyst
which will modify the spark plan on the fly. This allows spark to do some of
the things which are not possible to do in catalyst today.
Adaptive Query Execution(AQE)
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this
number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of
partitions, we want to reduce the overall shuffle partitions so it will improve the performance
Shuffle Partitions without AQE:
Before we see how to optimise the shuffle partitions, let’s see what is the problem we are trying to solve. Let’s take below example
from pyspark.sql import SparkSession
spark = SparkSession 
.builder 
.appName("Spark Adaptive Query Execution ") 
.config("spark.some.config.option", "some-value") 
.getOrCreate()
sc=spark.sparkContext
df=spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("gs://aparup-
files/sales.csv").repartition(500)
#In above code, I am reading a small file and increasing the partitions to 500. This increase is to force the spark to use maximum
shuffle partitions and file size: 226B
df.show(4, False)
#GroupBy for Shuffle
df.groupBy("customerId").count().count()
#sales_df=df.groupBy("customerId").count()
#sales_df.write.parquet("gs://aparup-files/spark2.parquet")
sc.stop()
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Observing Job: Spark 2 Doesn’t has AQE
When I am running in Spark2 Cluster its throwing error as AQE is by default set to false and
we cant use this because to use AQE we need enable
‘spark.sql.adaptive.coalescePartitions.enabled’ to check the requires partition based on
result metrics and its not present in spark 2.
Spark 3 with AQE
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Spark 2 Observing Stages
As you can observe from the image, stage id 14, 200 tasks ran even the data was very less.
Spark 2 Observing Dags
From the image, you can
observe that there was lot
of shuffle.
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Optimising Shuffle Partitions in AQE
Enabling the configuration
To use AQE we need to set spark.sql.adaptive.enabled to true.
conf.set("spark.sql.adaptive.enabled", "true")
To use the shuffle partitions optimization we need to set
spark.sql.adaptive.coalescePartitions.enabled to true.
conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
Adaptive Query Execution(AQE)
Adaptive Number of Shuffle Partitions or Reducers
Spark 3 Observing Stages
From the image you can observe that, most of the stages are skipped all together as spark figured
out that most of the partitions are empty.
Spark 3 Observing Dags
From the image, you can observe
most of the shuffle was skipped.
There is a CoalescedShuffleReader
which is combining all the shuffle
partitions to 1.
So by just enabling few
configuration we can dynamically
optimise the shuffle partitions in
AQE.
New EXPLAIN Format
In Spark EXPLAIN function returns the detail of spark sql query execution stages or you can say how query is
optimized
Challenges in Spark 2 – Not easy to understand how a query is optimized i.e output is too complex
Key Feature of Explain function in Spark 3 –
EASY TO READ QUERY EXECUTION PLAN by adding Explain mode="formatted“
query="select customerId,max(amountPaid) from spark3.sample_tbl where customerId>0 group by customerId having
max(amountPaid)>0 "
Explain in Spark 2
Not easy to understand how a query is optimized
output is too complex!!!
Explain in Spark 3
Easy to Read Query Plan
Output with Very Detailed Information
In many times in our code, we would like to read few rows from the dataframe.
For this, we use head function on top of the dataframe which Internally
implemented by reading only needed number of items by accessing one partition at
a time from beginning.
But to access the values from last partition of Dataframe till Spark V2 we don’t have
any straight forward way
So in Spark V3 new function tail has been introduced for reading values from the last
partition of a dataframe.
Dataframe tail function
Dataframe tail function
Spark 2 Don’t have tail Function
Dataframe tail function
Spark 3 introduced new tail Function
Useful Resources
https://spark.apache.org/releases/spark-release-3-0-0.html -
Spark 3 Official Docs
https://www.youtube.com/watch?v=scM_WQMhB3A&t=1s -
SPARK+AI SUMMIT EUROPE 2019
What's New in Apache Spark 3.0 !!

More Related Content

What's hot (20)

PPTX
Catalyst optimizer
Ayub Mohammad
 
PDF
Koalas: Interoperability Between Koalas and Apache Spark
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
PDF
Parallelize R Code Using Apache Spark
Databricks
 
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
PDF
Spark Summit EU talk by Luc Bourlier
Spark Summit
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PPTX
Spark autotuning talk final
Rachel Warren
 
PDF
Continuous Application with FAIR Scheduler with Robert Xue
Databricks
 
PPTX
Understanding Spark Tuning: Strata New York
Rachel Warren
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
Apache Calcite overview
Julian Hyde
 
PDF
Optimizing Apache Spark UDFs
Databricks
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Apache Spark Data Validation
Databricks
 
PDF
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
Catalyst optimizer
Ayub Mohammad
 
Koalas: Interoperability Between Koalas and Apache Spark
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Parallelize R Code Using Apache Spark
Databricks
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Spark Summit EU talk by Luc Bourlier
Spark Summit
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark autotuning talk final
Rachel Warren
 
Continuous Application with FAIR Scheduler with Robert Xue
Databricks
 
Understanding Spark Tuning: Strata New York
Rachel Warren
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Apache Calcite overview
Julian Hyde
 
Optimizing Apache Spark UDFs
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
Apache Spark Data Validation
Databricks
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 

Similar to What's New in Apache Spark 3.0 !! (20)

PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PDF
Deep Dive into Spark
Eric Xiao
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
SQL Performance Improvements at a Glance in Apache Spark 3.0
Databricks
 
PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Meetup talk
Arpit Tak
 
PDF
Apche Spark SQL and Advanced Queries on big data
rohansharma01198
 
PDF
Apache Spark v3.0.0
Jean-Georges Perrin
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
PDF
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
PPTX
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
The internals of Spark SQL Joins, Dmytro Popovich
Sigma Software
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Deep Dive into Spark
Eric Xiao
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
Databricks
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Spark what's new what's coming
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Meetup talk
Arpit Tak
 
Apche Spark SQL and Advanced Queries on big data
rohansharma01198
 
Apache Spark v3.0.0
Jean-Georges Perrin
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Dive into spark2
Gal Marder
 
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
Data Analytics using sparkabcdefghi.pptx
KarkuzhaliS3
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
The internals of Spark SQL Joins, Dmytro Popovich
Sigma Software
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Ad

Recently uploaded (20)

PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Ad

What's New in Apache Spark 3.0 !!

  • 1. APACHE SPARK 3 NEW FEATURES - APARUP CHATTERJEE
  • 2. Spark 3.0.0 has released early June 2020 With the release of Spark 3.0, there are so many improvements implemented for faster execution. Well, there are many several changes done in improving SQL Performance such as:  Adaptive Query Execution (AQE)  New EXPLAIN Format  Dataframe tail function  Join Hints  Dynamic Partition Pruning New Added Features in Spark 3.0 Source:- SPARK+AI SUMMIT EUROPE 2019, SPARK 3.0 OFFICIAL DOCS & Google Search Today’s session I will be briefing first 3 features and rest of other I will continue in my next session
  • 3. Spark 2.0 based Environment Details: Hadoop 2.9 Spark 2.3 Python 2.7.14 Used GCP based Bigdata Component Details Spark 3.0 based Environment Details: Hadoop 3.2 Spark 3.0 Python 3.7.4
  • 4. Spark catalyst is one of the most important layer of spark SQL which does all the query optimisation. Even though spark catalyst does lot of heavy lifting, it’s all done before query execution. So that means once the physical plan is created and execution of the plan started, it will not do any optimisation there after. So it cannot do some of the optimisation which is based on metrics it sees when the execution is going on. In 3.0, spark has introduced an additional layer of optimisation. This layer is known as Adaptive Query Execution(AQE). This layer tries to optimise the queries depending upon the metrics that are collected as part of the execution. Adaptive Query Execution, AQE, is a layer on top of the spark catalyst which will modify the spark plan on the fly. This allows spark to do some of the things which are not possible to do in catalyst today. Adaptive Query Execution(AQE)
  • 5. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer. So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance Shuffle Partitions without AQE: Before we see how to optimise the shuffle partitions, let’s see what is the problem we are trying to solve. Let’s take below example from pyspark.sql import SparkSession spark = SparkSession .builder .appName("Spark Adaptive Query Execution ") .config("spark.some.config.option", "some-value") .getOrCreate() sc=spark.sparkContext df=spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("gs://aparup- files/sales.csv").repartition(500) #In above code, I am reading a small file and increasing the partitions to 500. This increase is to force the spark to use maximum shuffle partitions and file size: 226B df.show(4, False) #GroupBy for Shuffle df.groupBy("customerId").count().count() #sales_df=df.groupBy("customerId").count() #sales_df.write.parquet("gs://aparup-files/spark2.parquet") sc.stop()
  • 6. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Observing Job: Spark 2 Doesn’t has AQE When I am running in Spark2 Cluster its throwing error as AQE is by default set to false and we cant use this because to use AQE we need enable ‘spark.sql.adaptive.coalescePartitions.enabled’ to check the requires partition based on result metrics and its not present in spark 2. Spark 3 with AQE
  • 7. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Spark 2 Observing Stages As you can observe from the image, stage id 14, 200 tasks ran even the data was very less. Spark 2 Observing Dags From the image, you can observe that there was lot of shuffle.
  • 8. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Optimising Shuffle Partitions in AQE Enabling the configuration To use AQE we need to set spark.sql.adaptive.enabled to true. conf.set("spark.sql.adaptive.enabled", "true") To use the shuffle partitions optimization we need to set spark.sql.adaptive.coalescePartitions.enabled to true. conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
  • 9. Adaptive Query Execution(AQE) Adaptive Number of Shuffle Partitions or Reducers Spark 3 Observing Stages From the image you can observe that, most of the stages are skipped all together as spark figured out that most of the partitions are empty. Spark 3 Observing Dags From the image, you can observe most of the shuffle was skipped. There is a CoalescedShuffleReader which is combining all the shuffle partitions to 1. So by just enabling few configuration we can dynamically optimise the shuffle partitions in AQE.
  • 10. New EXPLAIN Format In Spark EXPLAIN function returns the detail of spark sql query execution stages or you can say how query is optimized Challenges in Spark 2 – Not easy to understand how a query is optimized i.e output is too complex Key Feature of Explain function in Spark 3 – EASY TO READ QUERY EXECUTION PLAN by adding Explain mode="formatted“ query="select customerId,max(amountPaid) from spark3.sample_tbl where customerId>0 group by customerId having max(amountPaid)>0 "
  • 11. Explain in Spark 2 Not easy to understand how a query is optimized output is too complex!!!
  • 12. Explain in Spark 3 Easy to Read Query Plan Output with Very Detailed Information
  • 13. In many times in our code, we would like to read few rows from the dataframe. For this, we use head function on top of the dataframe which Internally implemented by reading only needed number of items by accessing one partition at a time from beginning. But to access the values from last partition of Dataframe till Spark V2 we don’t have any straight forward way So in Spark V3 new function tail has been introduced for reading values from the last partition of a dataframe. Dataframe tail function
  • 14. Dataframe tail function Spark 2 Don’t have tail Function
  • 15. Dataframe tail function Spark 3 introduced new tail Function
  • 16. Useful Resources https://spark.apache.org/releases/spark-release-3-0-0.html - Spark 3 Official Docs https://www.youtube.com/watch?v=scM_WQMhB3A&t=1s - SPARK+AI SUMMIT EUROPE 2019