Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

2 likes547 views

The document discusses the challenges of using nested data types in Spark SQL at ByteDance and presents materialized columns as a solution to optimize query performance. It outlines the limitations of nested types, such as unnecessary data reads and computational inefficiencies, and compares various potential solutions including separate tables and vectorized reads. Materialized columns are highlighted for significantly improving query performance and reducing data read times.

Data & Analytics

Materialized Column——An Efficient Way
to Optimize Queries on Nested Columns
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, @ByteDance

Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop
experience for OLAP , on which users
can analyze PB level data by writing
SQL without caring about the
underlying execution engine

What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance

Agenda
▪ Spark SQL at ByteDance
▪ Why nested type are widely used
▪ What are the main issues of nested type
▪ Optional solutions
▪ How does Materialized Column solve these problems

Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area

Why nested type are widely used
▪ Event log
▪ A lot of new tracking events are created everyday
▪ It is not a good idea to create a new column for a new type of event
▪ Dimension
▪ Dimension tables are dumped from MySQL of service backend
▪ Service backend may add some new fields on demand. These fields may not be
helpful for now but they may be useful in the future

Main issues for nested type
▪ Unnecessary data are read which is a
waste of IO
▪ Vectorized read can not be exploit when
nested type column is read
▪ Filter pushdown can not be utilized
when nested column is read
▪ Duplicated computation. e.g. JSON
parsing is CPU-intensive

Optional solutions – A separate table
▪ DW users design a solution to solve
these problems
▪ Maintain a new table which add new
columns which are extracted from the
nested columns
▪ Downstream users should query on this
new table and new columns for better
performance

Optional solutions – A separate table
▪ Pros
▪ Queries are on simple type so that all the
problems are solved
▪ Cons
▪ Need to push all the downstream users to
migrate their queries / pipelines to the new
table and new columns
▪ Duplicated storage and computation cost
▪ Can not handle frequent subfields changing

Optional solutions – Vectorized Read on Nested Column
▪ Refactor Parquet vectorized reader to
support vectorized read for nested types
▪ Support predicate pushdown for struct

Optional solutions – Vectorized Read on Nested Column
▪ Pros
▪ Enable vectorized read without any storage
overhead
▪ Cons
▪ Need to refactor vectorized reader for
Parquet and ORC respectively
▪ Filter pushdown for Array/Map is still not
available
▪ The performance of vectorized read on
nested type is not as good as that for simple
type
▪ Improve performance with struct by
about 100%
▪ Improve performance with map by
about 163%

How does Materialized Column solve these problems

How does Materialized Column solve these problems
CREATE TABLE base_table (
item STRING,
count INT,
people<STRING, STRING>
date STRING
)
USING parquet
PARTITIONED BY (date);
ALTER TABLE base_table ADD COLUMNS
(
age INT MATERIALIZED CAST(peopl
e[‘age’] AS INTEGER)
);
Add materialized columnOriginal table

How does Materialized Column solve these problems
Write with materialized column
explain extended insert into base_table partition(date='20201010') select 'appole', 1,
map('age','18','name','jack','gender','male')

How does Materialized Column solve these problems
Query with materialized column rewriteQuery without materialized column rewrite

How does Materialized Column solve these problems
Test case
Without Materialized
Column rewrite
With Materialized
Column rewrite
Performance Read data size
SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓
SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓
SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓
Query without materialized column rewrite

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot (20)

PDF

Deep Dive into the New Features of Apache Spark 3.0Databricks

PDF

Spark SQL Join Improvement at FacebookDatabricks

PDF

Spark Performance Tuning .pdfAmit Raj

PDF

The Parquet Format and Performance Optimization OpportunitiesDatabricks

PDF

Deep Dive: Memory Management in Apache SparkDatabricks

PDF

Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit

PDF

MyRocks Deep DiveYoshinori Matsunobu

PDF

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

PDF

The Apache Spark File Format EcosystemDatabricks

PDF

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

PPTX

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

PDF

Physical Plans in Spark SQLDatabricks

PDF

An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks

PDF

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

PDF

Making Nested Columns as First Citizen in Apache Spark SQLDatabricks

PPTX

Using Apache Hive with High PerformanceInderaj (Raj) Bains

PDF

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

PDF

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

PDF

Blazing Performance with Flame GraphsBrendan Gregg

PDF

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

Deep Dive into the New Features of Apache Spark 3.0Databricks

Spark SQL Join Improvement at FacebookDatabricks

Spark Performance Tuning .pdfAmit Raj

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Deep Dive: Memory Management in Apache SparkDatabricks

Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit

MyRocks Deep DiveYoshinori Matsunobu

Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks

The Apache Spark File Format EcosystemDatabricks

RocksDB Performance and Reliability PracticesYoshinori Matsunobu

A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks

Physical Plans in Spark SQLDatabricks

An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks

Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks

Making Nested Columns as First Citizen in Apache Spark SQLDatabricks

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Hive Bucketing in Apache Spark with Tejas PatilDatabricks

Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki

Blazing Performance with Flame GraphsBrendan Gregg

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

PDF

The Science of DBMS: Data Storage & Organization SAP Technology

PPT

The thinking persons guide to data warehouse designCalpont

DOC

Ibm redbookRahul Verma

PPTX

MWLUG 2016 : AD117 : Xpages & jQuery DataTablesMichael Smith

PDF

The Science of DBMS: Query Optimization SAP Technology

PDF

Best practice bi_design_bestpracticesv_1_5rajibzzaman

PDF

The Future of Fast Databases: Lessons from a Decade of QuestDBjavier ramirez

PDF

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

PPTX

SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...Datavail

PPTX

GIDS 2016 Understanding and Building No SQLstechmaddy

PPTX

Pl sql best practices documentAshwani Pandey

PPTX

Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh

PDF

World2016_T5_S7_TeradataFunctionalOverviewFarah Omer

PPT

Myth busters - performance tuning 102 2008paulguerin

PPTX

SPL_ALL_EN.pptx政宏张

PPTX

Dan Hotka's Top 10 Oracle 12c New FeaturesEmbarcadero Technologies

PDF

Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User GroupRussell Spangler

PDF

Taming the shrew Power BIKellyn Pot'Vin-Gorman

PPTX

MySQL Optimizer: What's New in 8.0Manyi Lu

PDF

Recent MariaDB features to learn for a happy lifeFederico Razzoli

The Science of DBMS: Data Storage & Organization SAP Technology

The thinking persons guide to data warehouse designCalpont

Ibm redbookRahul Verma

MWLUG 2016 : AD117 : Xpages & jQuery DataTablesMichael Smith

The Science of DBMS: Query Optimization SAP Technology

Best practice bi_design_bestpracticesv_1_5rajibzzaman

The Future of Fast Databases: Lessons from a Decade of QuestDBjavier ramirez

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...Datavail

GIDS 2016 Understanding and Building No SQLstechmaddy

Pl sql best practices documentAshwani Pandey

Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh

World2016_T5_S7_TeradataFunctionalOverviewFarah Omer

Myth busters - performance tuning 102 2008paulguerin

SPL_ALL_EN.pptx政宏张

Dan Hotka's Top 10 Oracle 12c New FeaturesEmbarcadero Technologies

Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User GroupRussell Spangler

Taming the shrew Power BIKellyn Pot'Vin-Gorman

MySQL Optimizer: What's New in 8.0Manyi Lu

Recent MariaDB features to learn for a happy lifeFederico Razzoli

More from Databricks (20)

PPTX

DW Migration Webinar-March 2022.pptxDatabricks

PPTX

Data Lakehouse Symposium | Day 1 | Part 1Databricks

PPT

Data Lakehouse Symposium | Day 1 | Part 2Databricks

PPTX

Data Lakehouse Symposium | Day 2Databricks

PPTX

Data Lakehouse Symposium | Day 4Databricks

PDF

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

PDF

Democratizing Data Quality Through a Centralized PlatformDatabricks

PDF

Learn to Use Databricks for Data ScienceDatabricks

PDF

Why APM Is Not the Same As ML MonitoringDatabricks

PDF

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

PDF

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

PDF

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

PDF

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

PDF

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

PDF

Sawtooth Windows for Feature AggregationsDatabricks

PDF

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

PDF

Re-imagine Data Monitoring with whylogs and SparkDatabricks

PDF

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

PDF

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

PDF

Massive Data Processing in Adobe Using Delta LakeDatabricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Recently uploaded (20)

PDF

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

PPTX

What Is Data Integration and Transformation?subhashenia

PPTX

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

PDF

apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...apidays

PPTX

Aict presentation on dpplppp sjdhfh.pptxvabaso5932

PDF

Optimizing Large Language Models with vLLM and Related Tools.pdfTamanna36

PPT

Growth of Public Expendituuure_55423.pptNavyaDeora

PPTX

SlideEgg_501298-Agentic AI.pptx agentic ai530BYManoj

PDF

Driving Employee Engagement in a Hybrid World.pdfMia scott

PPTX

05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_EventFinTech Belgium

PPTX

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

PDF

A GraphRAG approach for Energy Efficiency Q&AMarco Brambilla

PDF

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

PPTX

Feb 2021 Ransomware Recovery presentation.pptxenginsayin1

PPTX

Powerful Uses of Data Analytics You Should Knowsubhashenia

PDF

UNISE-Operation-Procedure-InDHIS2trainngahmedabduselam23

PPTX

04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025FinTech Belgium

PPTX

Listify-Intelligent-Voice-to-Catalog-Agent.pptxnareshkottees

PPTX

BinarySearchTree in datastructures in detailkichokuttu

PPT

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...Janette Toral

What Is Data Integration and Transformation?subhashenia

thid ppt defines the ich guridlens and gives the information about the ICH gu...shaistabegum14

apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...apidays

Aict presentation on dpplppp sjdhfh.pptxvabaso5932

Optimizing Large Language Models with vLLM and Related Tools.pdfTamanna36

Growth of Public Expendituuure_55423.pptNavyaDeora

SlideEgg_501298-Agentic AI.pptx agentic ai530BYManoj

Driving Employee Engagement in a Hybrid World.pdfMia scott

05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_EventFinTech Belgium

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

A GraphRAG approach for Energy Efficiency Q&AMarco Brambilla

apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)apidays

Feb 2021 Ransomware Recovery presentation.pptxenginsayin1

Powerful Uses of Data Analytics You Should Knowsubhashenia

UNISE-Operation-Procedure-InDHIS2trainngahmedabduselam23

04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025FinTech Belgium

Listify-Intelligent-Voice-to-Catalog-Agent.pptxnareshkottees

BinarySearchTree in datastructures in detailkichokuttu

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

1. Materialized Column——An Efficient Way to Optimize Queries on Nested Columns Guo, Jun ([email protected]) Lead of Data Engine Team, @ByteDance

2. Who we are o Data Engine team of ByteDance o Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine

3. What we do o Manage Spark SQL / Presto / Hive workload o Offer Open API and self-serve platform o Optimize Spark SQL / Presto / Hive engine o Design data architecture for most business lines in ByteDance

4. Agenda ▪ Spark SQL at ByteDance ▪ Why nested type are widely used ▪ What are the main issues of nested type ▪ Optional solutions ▪ How does Materialized Column solve these problems

5. Spark SQL at ByteDance

6. Spark SQL at ByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area

7. Why nested type are widely used

8. Why nested type are widely used ▪ Event log ▪ A lot of new tracking events are created everyday ▪ It is not a good idea to create a new column for a new type of event ▪ Dimension ▪ Dimension tables are dumped from MySQL of service backend ▪ Service backend may add some new fields on demand. These fields may not be helpful for now but they may be useful in the future

9. Main issues for nested type

10. Main issues for nested type ▪ Unnecessary data are read which is a waste of IO ▪ Vectorized read can not be exploit when nested type column is read ▪ Filter pushdown can not be utilized when nested column is read ▪ Duplicated computation. e.g. JSON parsing is CPU-intensive

11. Optional solutions

12. Optional solutions – A separate table ▪ DW users design a solution to solve these problems ▪ Maintain a new table which add new columns which are extracted from the nested columns ▪ Downstream users should query on this new table and new columns for better performance

13. Optional solutions – A separate table ▪ Pros ▪ Queries are on simple type so that all the problems are solved ▪ Cons ▪ Need to push all the downstream users to migrate their queries / pipelines to the new table and new columns ▪ Duplicated storage and computation cost ▪ Can not handle frequent subfields changing

14. Optional solutions – Vectorized Read on Nested Column ▪ Refactor Parquet vectorized reader to support vectorized read for nested types ▪ Support predicate pushdown for struct

15. Optional solutions – Vectorized Read on Nested Column ▪ Pros ▪ Enable vectorized read without any storage overhead ▪ Cons ▪ Need to refactor vectorized reader for Parquet and ORC respectively ▪ Filter pushdown for Array/Map is still not available ▪ The performance of vectorized read on nested type is not as good as that for simple type ▪ Improve performance with struct by about 100% ▪ Improve performance with map by about 163%

16. How does Materialized Column solve these problems

17. How does Materialized Column solve these problems CREATE TABLE base_table ( item STRING, count INT, people<STRING, STRING> date STRING ) USING parquet PARTITIONED BY (date); ALTER TABLE base_table ADD COLUMNS ( age INT MATERIALIZED CAST(peopl e[‘age’] AS INTEGER) ); Add materialized columnOriginal table

18. How does Materialized Column solve these problems

19. How does Materialized Column solve these problems Write with materialized column explain extended insert into base_table partition(date='20201010') select 'appole', 1, map('age','18','name','jack','gender','male')

20. How does Materialized Column solve these problems Query with materialized column rewriteQuery without materialized column rewrite

21. How does Materialized Column solve these problems Test case Without Materialized Column rewrite With Materialized Column rewrite Performance Read data size SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓ SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓ SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓ Query without materialized column rewrite

22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.