SlideShare a Scribd company logo
Deep Dive Into
Xiao Li & Wenchen Fan
Spark Summit | SF | Jun 2018
1
SQL
with Advanced Performance Tuning
About US
• Software Engineers at
• Apache Spark Committers and PMC Members
Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streaming
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data Engineers
and Data Scientists
Unifies Data and AI
Technologies
Eliminates infrastructure
complexity
Spark SQL
A highly scalable and efficient relational
processing engine with ease-to-use APIs
and mid-query fault tolerance.
4
Run Everywhere
Processes, integrates
and analyzes the data
from diverse data
sources (e.g., Cassandra,
Kafka and Oracle) and
file formats (e.g.,
Parquet, ORC, CSV, and
JSON)
5
The not-so-secret truth...
6
is not only SQL.SQL
Spark SQL
7
Not Only SQL
Powers and optimizes the other Spark applications
and libraries:
• Structured streaming for stream processing
• MLlib for machine learning
• GraphFrame for graph-parallel computation
• Your own Spark applications that use SQL,
DataFrame and Dataset APIs
8
Lazy Evaluation
9
Optimization happens as late as possible,
therefore Spark SQL can optimize across
functions and libraries
Holistic optimization when using these
libraries and SQL/DataFrame/Dataset APIs in
the same Spark application.
New Features of Spark SQL in Spark 2.3
• PySpark Pandas UDFs [SPARK-22216] [SPARK-21187]
• Stable Codegen [SPARK-22510] [SPARK-22692]
• Advanced pushdown for partition pruning predicates [SPARK-20331]
• Vectorized ORC reader [SPARK-20682] [SPARK-16060]
• Vectorized cache reader [SPARK-20822]
• Histogram support in cost-based optimizer [SPARK-21975]
• Better Hive compatibility [SPARK-20236] [SPARK-17729] [SPARK-4131]
• More efficient and extensible data source API V2
10
Spark SQL
11
A compiler from queries to RDDs.
Performance Tuning for Optimal Plans
Run EXPLAIN Plan.
Interpret Plan.
Tune Plan.
12
13
Get the plans by running
Explain command/APIs,
or the SQL tab in either
Spark UI or Spark History
Server
14
More statistics from
the Job page
Declarative APIs
15
Declarative APIs
Declare your intentions by
• SQL API: ANSI SQL:2003 and HiveQL.
• Dataset/DataFrame APIs: richer, language-
integrated and user-friendly interfaces
16
Declarative APIs
17
When should I use SQL, DataFrames or Datasets?
• The DataFrame API provides untyped relational operations
• The Dataset API provides a typed version, at the cost of
performance due to heavy reliance on user-defined
closures/lambdas.
[SPARK-14083]
• http://dbricks.co/29xYnqR
Metadata Catalog
18
Metadata Catalog
• Persistent Hive metastore [Hive 0.12 - Hive 2.3.3]
• Session-local temporary view manager
• Cross-session global temporary view manager
• Session-local function registry
19
Metadata Catalog
Session-local function registry
• Easy-to-use lambda UDF
• Vectorized PySpark Pandas UDF
• Native UDAF interface
• Support Hive UDF, UDAF and UDTF
• Almost 300 built-in SQL functions
• Next, SPARK-23899 adds 30+ high-order built-in functions.
• Blog for high-order functions: https://dbricks.co/2rR8vAr
20
Performance Tips - Catalog
Time costs of partition metadata retrieval:
- Upgrade your Hive metastore
- Avoid very high cardinality of partition columns
- Partition pruning predicates (improved in [SPARK-20331])
21
Cache Manager
22
Cache Manager
• Automatically replace by cached data when
plan matching
• Cross-session
• Dropping/Inserting tables/views invalidates all
the caches that depend on it
• Lazy evaluation
23
Performance Tips
Cache: not always fast if spilled to disk.
- Uncache it, if not needed.
Next releases:
- A new cache mechanism for building the snapshot in
cache. Querying stale data. Resolved by names instead
of by plans. [SPARK-24461]
24
Optimizer
25
Optimizer
Rewrites the query plans using heuristics and cost.
26
• Outer join elimination
• Constraint propagation
• Join reordering
and many more.
• Column pruning
• Predicate push down
• Constant folding
Performance Tips
Roll your own Optimizer and Planner Rules
• In class ExperimentalMethods
• var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil
• var extraStrategies: Seq[Strategy] = Nil
• Examples in the Herman’s talk Deep Dive into Catalyst
Optimizer
• Join two intervals: http://dbricks.co/2etjIDY
27
Planner
28
Planner
• Turn logical plans to physical plans. (what to how)
• Pick the best physical plan according to the cost
29
table1 table2
Join
broadcast
hash join
sort merge
join
OR
broadcast join has lower cost if
one table can fit in memory
table1 table2 table1 table2
Performance Tips - Join Selection
30
table 1
table 2
join result
broadcast
broadcast join
table 1
table 2
shuffled
shuffled
join result
shuffle join
Performance Tips - Join Selection
broadcast join vs shuffle join (broadcast is faster)
• spark.sql.autoBroadcastJoinThreshold
• Keep the statistics updated
• broadcastJoin Hint
31
Performance Tips - Equal Join
… t1 JOIN t2 ON t1.id = t2.id AND t1.value < t2.value
… t1 JOIN t2 ON t1.value < t2.value
Put at least one equal predicate in join condition
32
Performance Tips - Equal Join
… t1 JOIN t2 ON t1.id = t2.id AND t1.value < t2.value
… t1 JOIN t2 ON t1.value < t2.value
33
O(n ^ 2)
O(n)
Query Execution
34
Query Execution
• Memory Manager: tracks the memory usage,
efficiently distribute memory between
tasks/operators.
• Code Generator: compiles the physical plan to
optimal java code.
• Tungsten Engine: efficient binary data format and
data structure for CPU and memory efficiency.
35
Performance Tips - Memory Manager
Tune spark.executor.memory and spark.memory.fraction to
leave enough space for unsupervised memory. Some memory
usages are NOT tracked by Spark(netty buffer, parquet writer
buffer).
Set spark.memory.offHeap.enabled and
spark.memory.offHeap.size to enable offheap, and decrease
spark.executor.memory accordingly.
36
Whole Stage Code Generation
Performance Tip - WholeStage codegen
Tune spark.sql.codegen.hugeMethodLimit to
avoid big method(> 8k) that can’t be compiled
by JIT compiler.
38
Data Sources
• Spark separates computation and storage.
• Complete data pipeline:
• External storage feeds data to Spark.
• Spark processes the data
• Data source can be a bottleneck if Spark
processes data very fast.
39
Scan Vectorization
• More efficient to read columnar data with vectorization.
• More likely for JVM to generate SIMD instructions.
• ……
Partitioning and Bucketing
• A special file system layout for data skipping and pre-shuffle.
• Can speed up query a lot by avoid unnecessary IO and shuffle.
• The summit talk: http://dbricks.co/2oG6ZBL
Performance Tips
• Pick data sources that supports vectorized
reading. (parquet, orc)
• For file-based data sources, creating
partitioning/bucketing if possible.
42
43
Yet challenges still remain
Raw Data InsightData Lake
Reliability & Performance Problems
• Performance degradation at scale for
advanced analytics
• Stale and unreliable data slows analytic
decisions
Reliability and Complexity
• Data corruption issues and broken pipelines
• Complex workarounds - tedious scheduling
and multiple jobs/staging tables
• Many use cases require updates to existing
data - not supported by Spark / Data lakes
Big data pipelines / ETL
Multiple data sources
Batch & streaming data
Machine Learning / AI
Real-time / streaming analytics
(Complex) SQL analytics
Streaming magnifies these challenges
AnalyticsETL
44
Databricks Delta address these challenges
Raw Data Insight
Big data pipelines / ETL
Multiple data sources
Batch & streaming data
Machine Learning / AI
Real-time / streaming analytics
(Complex) SQL analytics
DATABRICKS DELTA
Builds on Cloud Data Lake
Reliability & Automation
Transactions guarantees eliminates complexity
Schema enforcement to ensure clean data
Upserts/Updates/Deletes to manage data changes
Seamlessly support streaming and batch
Performance & Reliability
Automatic indexing & caching
Fresh data for advanced analytics
Automated performance tuning
ETL Analytics
Thank you
Xiao Li (lixiao@databricks.com)
Wenchen Fan (wenchen@databricks.com)
45

More Related Content

What's hot (20)

PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Spark overview
Lisa Hua
 
PDF
Spark shuffle introduction
colorant
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Memory Management in Apache Spark
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
The Apache Spark File Format Ecosystem
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Apache Spark Architecture
Alexey Grishchenko
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Spark overview
Lisa Hua
 
Spark shuffle introduction
colorant
 
Dynamic Partition Pruning in Apache Spark
Databricks
 

Similar to Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenchen Fan (20)

PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Spark sql
Freeman Zhang
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PPTX
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Spark what's new what's coming
Databricks
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Spark sql
Freeman Zhang
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
A look ahead at spark 2.0
Databricks
 
Spark what's new what's coming
Databricks
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
Krezentios memories in college data.pptx
notknown9
 
PDF
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
IT GOVERNANCE 4-1 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
Krezentios memories in college data.pptx
notknown9
 
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
big data eco system fundamentals of data science
arivukarasi
 
IT GOVERNANCE 4-1 - Information System Security (1).pdf
mdirfanuddin1322
 
BinarySearchTree in datastructures in detail
kichokuttu
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenchen Fan

  • 1. Deep Dive Into Xiao Li & Wenchen Fan Spark Summit | SF | Jun 2018 1 SQL with Advanced Performance Tuning
  • 2. About US • Software Engineers at • Apache Spark Committers and PMC Members Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
  • 3. Databricks’ Unified Analytics Platform DATABRICKS RUNTIME COLLABORATIVE NOTEBOOKS Delta SQL Streaming Powered by Data Engineers Data Scientists CLOUD NATIVE SERVICE Unifies Data Engineers and Data Scientists Unifies Data and AI Technologies Eliminates infrastructure complexity
  • 4. Spark SQL A highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. 4
  • 5. Run Everywhere Processes, integrates and analyzes the data from diverse data sources (e.g., Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON) 5
  • 8. Not Only SQL Powers and optimizes the other Spark applications and libraries: • Structured streaming for stream processing • MLlib for machine learning • GraphFrame for graph-parallel computation • Your own Spark applications that use SQL, DataFrame and Dataset APIs 8
  • 9. Lazy Evaluation 9 Optimization happens as late as possible, therefore Spark SQL can optimize across functions and libraries Holistic optimization when using these libraries and SQL/DataFrame/Dataset APIs in the same Spark application.
  • 10. New Features of Spark SQL in Spark 2.3 • PySpark Pandas UDFs [SPARK-22216] [SPARK-21187] • Stable Codegen [SPARK-22510] [SPARK-22692] • Advanced pushdown for partition pruning predicates [SPARK-20331] • Vectorized ORC reader [SPARK-20682] [SPARK-16060] • Vectorized cache reader [SPARK-20822] • Histogram support in cost-based optimizer [SPARK-21975] • Better Hive compatibility [SPARK-20236] [SPARK-17729] [SPARK-4131] • More efficient and extensible data source API V2 10
  • 11. Spark SQL 11 A compiler from queries to RDDs.
  • 12. Performance Tuning for Optimal Plans Run EXPLAIN Plan. Interpret Plan. Tune Plan. 12
  • 13. 13 Get the plans by running Explain command/APIs, or the SQL tab in either Spark UI or Spark History Server
  • 16. Declarative APIs Declare your intentions by • SQL API: ANSI SQL:2003 and HiveQL. • Dataset/DataFrame APIs: richer, language- integrated and user-friendly interfaces 16
  • 17. Declarative APIs 17 When should I use SQL, DataFrames or Datasets? • The DataFrame API provides untyped relational operations • The Dataset API provides a typed version, at the cost of performance due to heavy reliance on user-defined closures/lambdas. [SPARK-14083] • http://dbricks.co/29xYnqR
  • 19. Metadata Catalog • Persistent Hive metastore [Hive 0.12 - Hive 2.3.3] • Session-local temporary view manager • Cross-session global temporary view manager • Session-local function registry 19
  • 20. Metadata Catalog Session-local function registry • Easy-to-use lambda UDF • Vectorized PySpark Pandas UDF • Native UDAF interface • Support Hive UDF, UDAF and UDTF • Almost 300 built-in SQL functions • Next, SPARK-23899 adds 30+ high-order built-in functions. • Blog for high-order functions: https://dbricks.co/2rR8vAr 20
  • 21. Performance Tips - Catalog Time costs of partition metadata retrieval: - Upgrade your Hive metastore - Avoid very high cardinality of partition columns - Partition pruning predicates (improved in [SPARK-20331]) 21
  • 23. Cache Manager • Automatically replace by cached data when plan matching • Cross-session • Dropping/Inserting tables/views invalidates all the caches that depend on it • Lazy evaluation 23
  • 24. Performance Tips Cache: not always fast if spilled to disk. - Uncache it, if not needed. Next releases: - A new cache mechanism for building the snapshot in cache. Querying stale data. Resolved by names instead of by plans. [SPARK-24461] 24
  • 26. Optimizer Rewrites the query plans using heuristics and cost. 26 • Outer join elimination • Constraint propagation • Join reordering and many more. • Column pruning • Predicate push down • Constant folding
  • 27. Performance Tips Roll your own Optimizer and Planner Rules • In class ExperimentalMethods • var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil • var extraStrategies: Seq[Strategy] = Nil • Examples in the Herman’s talk Deep Dive into Catalyst Optimizer • Join two intervals: http://dbricks.co/2etjIDY 27
  • 29. Planner • Turn logical plans to physical plans. (what to how) • Pick the best physical plan according to the cost 29 table1 table2 Join broadcast hash join sort merge join OR broadcast join has lower cost if one table can fit in memory table1 table2 table1 table2
  • 30. Performance Tips - Join Selection 30 table 1 table 2 join result broadcast broadcast join table 1 table 2 shuffled shuffled join result shuffle join
  • 31. Performance Tips - Join Selection broadcast join vs shuffle join (broadcast is faster) • spark.sql.autoBroadcastJoinThreshold • Keep the statistics updated • broadcastJoin Hint 31
  • 32. Performance Tips - Equal Join … t1 JOIN t2 ON t1.id = t2.id AND t1.value < t2.value … t1 JOIN t2 ON t1.value < t2.value Put at least one equal predicate in join condition 32
  • 33. Performance Tips - Equal Join … t1 JOIN t2 ON t1.id = t2.id AND t1.value < t2.value … t1 JOIN t2 ON t1.value < t2.value 33 O(n ^ 2) O(n)
  • 35. Query Execution • Memory Manager: tracks the memory usage, efficiently distribute memory between tasks/operators. • Code Generator: compiles the physical plan to optimal java code. • Tungsten Engine: efficient binary data format and data structure for CPU and memory efficiency. 35
  • 36. Performance Tips - Memory Manager Tune spark.executor.memory and spark.memory.fraction to leave enough space for unsupervised memory. Some memory usages are NOT tracked by Spark(netty buffer, parquet writer buffer). Set spark.memory.offHeap.enabled and spark.memory.offHeap.size to enable offheap, and decrease spark.executor.memory accordingly. 36
  • 37. Whole Stage Code Generation
  • 38. Performance Tip - WholeStage codegen Tune spark.sql.codegen.hugeMethodLimit to avoid big method(> 8k) that can’t be compiled by JIT compiler. 38
  • 39. Data Sources • Spark separates computation and storage. • Complete data pipeline: • External storage feeds data to Spark. • Spark processes the data • Data source can be a bottleneck if Spark processes data very fast. 39
  • 40. Scan Vectorization • More efficient to read columnar data with vectorization. • More likely for JVM to generate SIMD instructions. • ……
  • 41. Partitioning and Bucketing • A special file system layout for data skipping and pre-shuffle. • Can speed up query a lot by avoid unnecessary IO and shuffle. • The summit talk: http://dbricks.co/2oG6ZBL
  • 42. Performance Tips • Pick data sources that supports vectorized reading. (parquet, orc) • For file-based data sources, creating partitioning/bucketing if possible. 42
  • 43. 43 Yet challenges still remain Raw Data InsightData Lake Reliability & Performance Problems • Performance degradation at scale for advanced analytics • Stale and unreliable data slows analytic decisions Reliability and Complexity • Data corruption issues and broken pipelines • Complex workarounds - tedious scheduling and multiple jobs/staging tables • Many use cases require updates to existing data - not supported by Spark / Data lakes Big data pipelines / ETL Multiple data sources Batch & streaming data Machine Learning / AI Real-time / streaming analytics (Complex) SQL analytics Streaming magnifies these challenges AnalyticsETL
  • 44. 44 Databricks Delta address these challenges Raw Data Insight Big data pipelines / ETL Multiple data sources Batch & streaming data Machine Learning / AI Real-time / streaming analytics (Complex) SQL analytics DATABRICKS DELTA Builds on Cloud Data Lake Reliability & Automation Transactions guarantees eliminates complexity Schema enforcement to ensure clean data Upserts/Updates/Deletes to manage data changes Seamlessly support streaming and batch Performance & Reliability Automatic indexing & caching Fresh data for advanced analytics Automated performance tuning ETL Analytics