Understanding and Improving Code Generation

0 likes1,448 views

The document discusses Spark SQL's code generation techniques, focusing on whole-stage code generation inspired by Thomas Neumann's paper, which improves performance by collapsing queries into single operators. It highlights challenges related to large generated code and Java method size limitations, proposing solutions like splitting large functions. Additionally, the document presents performance setup and results from a case expression project with extensive branches.

Data & Analytics

More Related Content

What's hot (20)

PDF

Understanding Query Plans and Spark UIsDatabricks

PDF

Parquet performance tuning: the missing guideRyan Blue

PDF

Dynamic Partition Pruning in Apache SparkDatabricks

PDF

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

PPTX

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

PDF

The Parquet Format and Performance Optimization OpportunitiesDatabricks

PDF

The Apache Spark File Format EcosystemDatabricks

PDF

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

PDF

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

PPTX

Real-time Analytics with Trino and Apache PinotXiang Fu

PDF

Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks

PPTX

Apache Spark ArchitectureAlexey Grishchenko

PDF

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

PDF

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

PDF

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

PDF

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

PDF

Introduction to apache spark Aakashdata

PDF

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

PDF

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

PDF

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Understanding Query Plans and Spark UIsDatabricks

Parquet performance tuning: the missing guideRyan Blue

Dynamic Partition Pruning in Apache SparkDatabricks

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang

The Parquet Format and Performance Optimization OpportunitiesDatabricks

The Apache Spark File Format EcosystemDatabricks

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Real-time Analytics with Trino and Apache PinotXiang Fu

Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks

Apache Spark ArchitectureAlexey Grishchenko

Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks

Introduction to apache spark Aakashdata

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Similar to Understanding and Improving Code Generation (20)

PPTX

Profiling & Testing with SparkRoger Rafanell Mas

PDF

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

PDF

Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks

PDF

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

PPTX

Spark Summit EU talk by Sameer AgarwalSpark Summit

PDF

Spark: A Unified Engine for Big Data ProcessingChadrequeCruzManuela

PDF

A look ahead at spark 2.0 Databricks

PDF

Validating big data jobs - Spark AI Summit EUHolden Karau

PDF

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

PDF

Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks

PPTX

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfsasuke20y4sh

PDF

Big data and computing gridThang Nguyen

PDF

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

PDF

Software Mistakes and Tradeoffs 1st Edition Tomasz Lelekbooteampong

PDF

Week 5EasyStudy3

PDF

Week 5EasyStudy3

PDF

Sparklife - Life In The Trenches With SparkIan Pointer

PDF

The Mechanics of Testing Large Data PipelinesC4Media

PDF

Validating Big Data Pipelines - Big Data Spain 2018Holden Karau

PPT

An Introduction to Apache spark with scalajohnn210

Profiling & Testing with SparkRoger Rafanell Mas

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks

Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks

Spark Summit EU talk by Sameer AgarwalSpark Summit

Spark: A Unified Engine for Big Data ProcessingChadrequeCruzManuela

A look ahead at spark 2.0 Databricks

Validating big data jobs - Spark AI Summit EUHolden Karau

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...Databricks

Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfsasuke20y4sh

Big data and computing gridThang Nguyen

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri

Software Mistakes and Tradeoffs 1st Edition Tomasz Lelekbooteampong

Week 5EasyStudy3

Sparklife - Life In The Trenches With SparkIan Pointer

The Mechanics of Testing Large Data PipelinesC4Media

Validating Big Data Pipelines - Big Data Spain 2018Holden Karau

An Introduction to Apache spark with scalajohnn210

More from Databricks (20)

PPTX

DW Migration Webinar-March 2022.pptxDatabricks

PPTX

Data Lakehouse Symposium | Day 1 | Part 1Databricks

PPT

Data Lakehouse Symposium | Day 1 | Part 2Databricks

PPTX

Data Lakehouse Symposium | Day 2Databricks

PPTX

Data Lakehouse Symposium | Day 4Databricks

PDF

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

PDF

Democratizing Data Quality Through a Centralized PlatformDatabricks

PDF

Learn to Use Databricks for Data ScienceDatabricks

PDF

Why APM Is Not the Same As ML MonitoringDatabricks

PDF

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

PDF

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

PDF

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

PDF

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

PDF

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

PDF

Sawtooth Windows for Feature AggregationsDatabricks

PDF

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

PDF

Re-imagine Data Monitoring with whylogs and SparkDatabricks

PDF

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

PDF

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

PDF

Massive Data Processing in Adobe Using Delta LakeDatabricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

PDF

NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)GRC Kompas

PDF

Development and validation of the Japanese version of the Organizational Matt...Yoga Tokuyoshi

PDF

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

PDF

apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...apidays

PPTX

apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...apidays

PPTX

03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_eventFinTech Belgium

PDF

Driving Employee Engagement in a Hybrid World.pdfMia scott

PDF

apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...apidays

PDF

InformaticsPractices-MS - Google Docs.pdfseshuashwin0829

PPTX

Powerful Uses of Data Analytics You Should Knowsubhashenia

PPTX

美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买Taqyea

PPTX

What Is Data Integration and Transformation?subhashenia

PPT

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

PPTX

05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_EventFinTech Belgium

PDF

Data Science Course Certificate by Sigma Software UniversityStepan Kalika

PPTX

Feb 2021 Ransomware Recovery presentation.pptxenginsayin1

PPTX

Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...Debolina Ghosh

PDF

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

PDF

Using AI/ML for Space Biology ResearchVICTOR MAESTRE RAMIREZ

PDF

UNISE-Operation-Procedure-InDHIS2trainngahmedabduselam23

NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)GRC Kompas

Development and validation of the Japanese version of the Organizational Matt...Yoga Tokuyoshi

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...apidays

apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...apidays

03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_eventFinTech Belgium

Driving Employee Engagement in a Hybrid World.pdfMia scott

apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...apidays

InformaticsPractices-MS - Google Docs.pdfseshuashwin0829

Powerful Uses of Data Analytics You Should Knowsubhashenia

美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买Taqyea

What Is Data Integration and Transformation?subhashenia

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_EventFinTech Belgium

Data Science Course Certificate by Sigma Software UniversityStepan Kalika

Feb 2021 Ransomware Recovery presentation.pptxenginsayin1

Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...Debolina Ghosh

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

Using AI/ML for Space Biology ResearchVICTOR MAESTRE RAMIREZ

UNISE-Operation-Procedure-InDHIS2trainngahmedabduselam23

Understanding and Improving Code Generation

2. Understanding and Improving Code Generation Michael Chen

3. Agenda Spark SQL Basics Volcano Iterator Model Whole-Stage Code Generation Problems Splitting Generated Code Performance

4. Spark SQL https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

5. Volcano Iterator Model

6. Volcano Iterator Model ▪ Each operator can be thought of as an iterator ▪ Simple to compose arbitrary operators

7. Volcano Iterator Model https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

8. Interpreted Evaluation

9. Expression Code Generation

10. Whole Stage Code Generation

11. Whole-Stage Code Generation ▪ Inspired by Thomas Neumann’s paper – Efficiently Compiling Efficient Query Plans for Modern Hardware ▪ Collapse entire query into a single operator ▪ Generate one function for the entire query

12. Whole-Stage Code Generation ▪ Less virtual function calls ▪ Data placed in CPU registers ▪ Compiler optimizations

13. Whole-Stage Code Generation ▪ Produce method called on children until producer operator ▪ Call consume on parents to generate code for their logic ▪ A Deep Dive into Query Execution of Spark SQL

14. Whole-Stage Code Generation

15. Whole-Stage Code Generation

16. Whole-Stage Code Generation Problems

17. Whole-Stage Code Generation Problems ▪ Customers map external data to internal representations using case statements ▪ Accounting use case for validating input => case expressions with thousands of when branches ▪ Generated functions are 1million LOC plus

18. Whole-Stage Code Generation Problems

19. Whole-Stage Code Generation Problems ▪ Java limits method size to 64KB ▪ JIT compilation disabled when methods exceed 8KB ▪ Compiler can throw OOM exceptions with large methods

20. Large Generated Code Solutions ▪ Split large functions into many small functions ▪ Implemented in expression code generation ▪ Whole-stage code generation can fallback to volcano iterator

21. Splitting Code Generated Functions

22. Splitting Expression Code Generation ▪ Each operator can be thought of as an iterator ▪ Result of every operator share common interface ▪ Only input to generated function is output of next

23. Splitting Expression Code Generation

24. Splitting Expression Code Generation

25. Splitting Whole-Stage Code Generation ▪ Before calling parent’s consume, store output variables

26. Splitting Whole-Stage Code Generation

27. Tracking Whole-Stage Code Generation Inputs ▪ Evaluated variables from current or child expressions ▪ Rows referred by current or child expressions ▪ Eliminated subexpressions in child expressions

28. Split Case Expression Function

29. Performance

30. Performance Setup ▪ 1 Driver ▪ 12GB memory ▪ 1 core ▪ 3 Executors ▪ 120GB memory ▪ 28 core ▪ 50 million input rows

31. Performance ▪ Project case expression with 3000 branches

32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.