SlideShare a Scribd company logo
Understanding Query Plans
and Spark UIs
Xiao Li @ gatorsmile
Spark + AI Summit @ SF | April 2019
1
About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile
Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle
Apache Spark 3.x
5
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors
Apache Spark 3.x
6
Catalyst Optimization & Tungsten Execution
SparkSession / DataFrame / DataSet APIs
SQL
Spark ML
Spark
Streaming
Spark
Graph
3rd-party
Libraries
Spark CoreData Source Connectors
From declarative queries to RDDs
7
Cypher
8
Maximize Performance
9
Read Plan.
Interpret Plan.
Tune Plan.
Track Execution.
10
Read Plans from
SQL Tab in either
Spark UI or Spark
History Server
Read Plans from
SQL Tab in either
Spark UI or Spark
History Server
11
Spark 3.0: Show the actual SQL statement? [SPARK-27045]
Page: In Details for SQL Query
12
13
Parsed
Plan
Analyzed
Plan
Optimized
Plan
Physical
Plan
14
Understand and Tune Plans
15
Different Results!!!
16
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.
17
Read the analyzed
plan to check the
implicit type
casting.
Tip:
Explicitly cast the
types in the queries.
Create Hive Tables
18
Syntax to create a Hive Serde table
Hive serde reader
Read Tables
20
filter pushdown
Native
reader/writer
performs faster
than Hive serde
reader/writer
21
Create Native Tables
Syntax to create a Spark native ORC table
Tip:
Create native
data source
tables for better
performance
and stability.
22
Push Down + Implicit Type Casting
Not pushed down???
Tip:
Cast is needed?
Update the
constants?
Nested Schema Pruning
23Not pruned???
Nested Schema Pruning
24
Collapse Projects
25
Call UDF three times!!!
Collapse Projects
26
Cross-session SQL Cache
27
• If a query is cached in the one session, the new
queries in all the sessions might be impacted.
• Check your query plan!
28
29
Join Hints in Spark 3.0
• BROADCAST
• Broadcast Hash/Nested-loop Join
• MERGE
• Shuffle Sort Merge Join
• SHUFFLE_HASH
• Shuffle Hash Join
• SHUFFLE_REPLICATE_NL
• Shuffle-and-Replicate Nested Loop Join
30
Track Execution
From
SQL query
to
Spark Jobs
31
32
• A SQL query => multiple Spark jobs.
• - For example, broadcast exchange, shuffle
exchange, Scalar subquery.
• - External data sources: Delta Lake.
• - New adaptive query execution.
• A Spark job => A DAG
• A chain of RDD dependencies organized in a
directed acyclic graph (DAG)
33
The higher
level SQL
physical
operators.
Optimized
ogical Plan DAGsPhysical
Plans
Selected
Physical Plan
CostModel
he
ger
r Planner
Query
ExecutionQuery Execution
The low
level Spark
RDD
primitives.
Job Tab in Spark UI
34
The amount of time for each job.
Any stage/task failure?
Job Tab
35
The amount of time for each stage.
• Jobs
• Stages
• Tasks
Stages Tab
36
• How the time are spent?
• Any outlier in task execution?
• Straggler tasks?
• Skew in data size, compute time?
• Too many/few tasks (partitions)?
• Load balanced? Locality?
Tasks specific info
37
Balanced? Skew?
Killed?
Which
executor’s
log we
should read?
Executors Tab
38
size of data transferred
between stages
used/available memory
All the problematic executors in the same node?
39
- Interacting with Hive metastore?
- Slow query planning?
- Slow file listing?
40
Insert
Partitioned
Hive
Table OR “STORED AS PARQUET”
5000 partitions took
almost 8 minutes!!!
41
42
Insert
Partitioned
Native
Table
Reduced from almost 8 minutes
to less than 1 minute !!!
43
Insert
Partitioned
Delta
Table
Reduced from almost 8 minutes
to 27 seconds!!!
Typical Spark Performance Issues
44
The table has thousands of partitions
• Hive metastore overhead
This table can have 100s of thousands to millions of files
• File system overhead - listing takes forever!
New data is not immediately visible
• Need to invoke a command “Refresh Table” with the SQL
engine they were using
The above issues can add 10s of minutes to the response time!
Delta Lake + Spark
45
Scalable metadata handling @ Delta Lake
Store metadata in transaction log file instead of metastore
The table has thousands of partitions
• Zero Hive Metastore overhead
The table can have 100s of thousands to millions of files
• No file listing
New data is not immediately visible
• Delta table state is computed on read
How do I use Delta?
format(“parquet”) -> format(“delta”)
Delta Lake + Spark
47
• Full ACID transactions
• Schema management
• Data versioning and time travel
• Unified batch/streaming support
• Scalable metadata handling
• Record update and deletion
• Data expectation
Delta Lake: https://delta.io/
For details, refer to the blog
https://tinyurl.com/yxhbe2lg
Delta Usage Statistics
More than 1 exabyte
processed (1018 bytes)
monthly
ManufacturingPublic Sector Technology Other
Healthcare and Life Sciences Financial Services Media and Entertainment Retail, CPG, and eCommerce
Additional Resources
49
• Apache Spark document: https://spark.apache.org/docs/latest/sql-
programming-guide.html
• Blog: https://databricks.com/blog/category/engineering/spark
• Previous summit: https://databricks.com/sparkaisummit/north-
america/sessions
• Delta Lake document: https://docs.delta.io
• Databricks document: https://docs.databricks.com/
• Books: https://www.amazon.com/s?k=apache+spark
• Databricks academy: https://academy.databricks.com
• Databricks ebooks: https://databricks.com/resources/type/ebooks
Thank you
Xiao Li
(lixiao@databricks.com)

More Related Content

What's hot (20)

PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PDF
Spark shuffle introduction
colorant
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Memory Management in Apache Spark
Databricks
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PDF
Introduction to apache spark
Aakashdata
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Spark shuffle introduction
colorant
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Dive into PySpark
Mateusz Buśkiewicz
 
The Apache Spark File Format Ecosystem
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Programming in Spark using PySpark
Mostafa
 
Memory Management in Apache Spark
Databricks
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Introduction to apache spark
Aakashdata
 

Similar to Understanding Query Plans and Spark UIs (20)

PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Spark what's new what's coming
Databricks
 
PDF
New Developments in Spark
Databricks
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Spark sql
Freeman Zhang
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PPTX
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PDF
Migrating to Spark 2.0 - Part 2
datamantra
 
PDF
Spark + AI Summit recap jul16 2020
Guido Oswald
 
PDF
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Spark sql meetup
Michael Zhang
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark what's new what's coming
Databricks
 
New Developments in Spark
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Spark sql
Freeman Zhang
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
A look ahead at spark 2.0
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark meetup v2.0.5
Yan Zhou
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Intro to Spark and Spark SQL
jeykottalam
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Migrating to Spark 2.0 - Part 2
datamantra
 
Spark + AI Summit recap jul16 2020
Guido Oswald
 
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
BinarySearchTree in datastructures in detail
kichokuttu
 

Understanding Query Plans and Spark UIs