SlideShare a Scribd company logo
Understanding and Improving Code Generation
Understanding and Improving
Code Generation
Michael Chen
Agenda
Spark SQL Basics
Volcano Iterator Model
Whole-Stage Code Generation
Problems
Splitting Generated Code
Performance
Spark SQL
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
Volcano Iterator Model
Volcano Iterator Model
▪ Each operator can be thought of as an iterator
▪ Simple to compose arbitrary operators
Volcano Iterator Model
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Interpreted Evaluation
Expression Code Generation
Whole Stage Code Generation
Whole-Stage Code Generation
▪ Inspired by Thomas Neumann’s paper –
Efficiently Compiling Efficient Query Plans for Modern Hardware
▪ Collapse entire query into a single operator
▪ Generate one function for the entire query
Whole-Stage Code Generation
▪ Less virtual function calls
▪ Data placed in CPU registers
▪ Compiler optimizations
Whole-Stage Code Generation
▪ Produce method called on children until producer operator
▪ Call consume on parents to generate code for their logic
▪ A Deep Dive into Query Execution of Spark SQL
Whole-Stage Code Generation
Whole-Stage Code Generation
Whole-Stage Code Generation
Problems
Whole-Stage Code Generation Problems
▪ Customers map external data to internal representations using case
statements
▪ Accounting use case for validating input => case expressions with
thousands of when branches
▪ Generated functions are 1million LOC plus
Whole-Stage Code Generation Problems
Whole-Stage Code Generation Problems
▪ Java limits method size to 64KB
▪ JIT compilation disabled when methods exceed 8KB
▪ Compiler can throw OOM exceptions with large methods
Large Generated Code Solutions
▪ Split large functions into many small functions
▪ Implemented in expression code generation
▪ Whole-stage code generation can fallback to volcano iterator
Splitting Code Generated Functions
Splitting Expression Code Generation
▪ Each operator can be thought of as an iterator
▪ Result of every operator share common interface
▪ Only input to generated function is output of next
Splitting Expression Code Generation
Splitting Expression Code Generation
Splitting Whole-Stage Code Generation
▪ Before calling parent’s consume, store output variables
Splitting Whole-Stage Code Generation
Tracking Whole-Stage Code Generation Inputs
▪ Evaluated variables from current or child expressions
▪ Rows referred by current or child expressions
▪ Eliminated subexpressions in child expressions
Split Case Expression Function
Performance
Performance Setup
▪ 1 Driver
▪ 12GB memory
▪ 1 core
▪ 3 Executors
▪ 120GB memory
▪ 28 core
▪ 50 million input rows
Performance
▪ Project case expression with 3000 branches
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot (20)

PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
PDF
Introduction to apache spark
Aakashdata
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
The Apache Spark File Format Ecosystem
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Apache Spark Architecture
Alexey Grishchenko
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Introduction to apache spark
Aakashdata
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 

Similar to Understanding and Improving Code Generation (20)

PPTX
Profiling & Testing with Spark
Roger Rafanell Mas
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
PPTX
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
PDF
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Validating big data jobs - Spark AI Summit EU
Holden Karau
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Big data and computing grid
Thang Nguyen
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PDF
Software Mistakes and Tradeoffs 1st Edition Tomasz Lelek
booteampong
 
PDF
Week 5
EasyStudy3
 
PDF
Week 5
EasyStudy3
 
PDF
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
PDF
The Mechanics of Testing Large Data Pipelines
C4Media
 
PDF
Validating Big Data Pipelines - Big Data Spain 2018
Holden Karau
 
PPT
An Introduction to Apache spark with scala
johnn210
 
Profiling & Testing with Spark
Roger Rafanell Mas
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Spark: A Unified Engine for Big Data Processing
ChadrequeCruzManuela
 
A look ahead at spark 2.0
Databricks
 
Validating big data jobs - Spark AI Summit EU
Holden Karau
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Big data and computing grid
Thang Nguyen
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Software Mistakes and Tradeoffs 1st Edition Tomasz Lelek
booteampong
 
Week 5
EasyStudy3
 
Week 5
EasyStudy3
 
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
The Mechanics of Testing Large Data Pipelines
C4Media
 
Validating Big Data Pipelines - Big Data Spain 2018
Holden Karau
 
An Introduction to Apache spark with scala
johnn210
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
What Is Data Integration and Transformation?
subhashenia
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 

Understanding and Improving Code Generation