SlideShare a Scribd company logo
Materialized Column——An Efficient Way
to Optimize Queries on Nested Columns
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, @ByteDance
Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop
experience for OLAP , on which users
can analyze PB level data by writing
SQL without caring about the
underlying execution engine
What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance
Agenda
▪ Spark SQL at ByteDance
▪ Why nested type are widely used
▪ What are the main issues of nested type
▪ Optional solutions
▪ How does Materialized Column solve these problems
Spark SQL at ByteDance
Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area
Why nested type are widely used
Why nested type are widely used
▪ Event log
▪ A lot of new tracking events are created everyday
▪ It is not a good idea to create a new column for a new type of event
▪ Dimension
▪ Dimension tables are dumped from MySQL of service backend
▪ Service backend may add some new fields on demand. These fields may not be
helpful for now but they may be useful in the future
Main issues for nested type
Main issues for nested type
▪ Unnecessary data are read which is a
waste of IO
▪ Vectorized read can not be exploit when
nested type column is read
▪ Filter pushdown can not be utilized
when nested column is read
▪ Duplicated computation. e.g. JSON
parsing is CPU-intensive
Optional solutions
Optional solutions – A separate table
▪ DW users design a solution to solve
these problems
▪ Maintain a new table which add new
columns which are extracted from the
nested columns
▪ Downstream users should query on this
new table and new columns for better
performance
Optional solutions – A separate table
▪ Pros
▪ Queries are on simple type so that all the
problems are solved
▪ Cons
▪ Need to push all the downstream users to
migrate their queries / pipelines to the new
table and new columns
▪ Duplicated storage and computation cost
▪ Can not handle frequent subfields changing
Optional solutions – Vectorized Read on Nested Column
▪ Refactor Parquet vectorized reader to
support vectorized read for nested types
▪ Support predicate pushdown for struct
Optional solutions – Vectorized Read on Nested Column
▪ Pros
▪ Enable vectorized read without any storage
overhead
▪ Cons
▪ Need to refactor vectorized reader for
Parquet and ORC respectively
▪ Filter pushdown for Array/Map is still not
available
▪ The performance of vectorized read on
nested type is not as good as that for simple
type
▪ Improve performance with struct by
about 100%
▪ Improve performance with map by
about 163%
How does Materialized Column solve these problems
How does Materialized Column solve these problems
CREATE TABLE base_table (
item STRING,
count INT,
people<STRING, STRING>
date STRING
)
USING parquet
PARTITIONED BY (date);
ALTER TABLE base_table ADD COLUMNS
(
age INT MATERIALIZED CAST(peopl
e[‘age’] AS INTEGER)
);
Add materialized columnOriginal table
How does Materialized Column solve these problems
How does Materialized Column solve these problems
Write with materialized column
explain extended insert into base_table partition(date='20201010') select 'appole', 1,
map('age','18','name','jack','gender','male')
How does Materialized Column solve these problems
Query with materialized column rewriteQuery without materialized column rewrite
How does Materialized Column solve these problems
Test case
Without Materialized
Column rewrite
With Materialized
Column rewrite
Performance Read data size
SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓
SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓
SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓
Query without materialized column rewrite
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

What's hot (20)

PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
Spark SQL Join Improvement at Facebook
Databricks
 
PDF
Spark Performance Tuning .pdf
Amit Raj
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
PDF
MyRocks Deep Dive
Yoshinori Matsunobu
 
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Physical Plans in Spark SQL
Databricks
 
PDF
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
PDF
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
Blazing Performance with Flame Graphs
Brendan Gregg
 
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Spark SQL Join Improvement at Facebook
Databricks
 
Spark Performance Tuning .pdf
Amit Raj
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
MyRocks Deep Dive
Yoshinori Matsunobu
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
The Apache Spark File Format Ecosystem
Databricks
 
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Physical Plans in Spark SQL
Databricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Blazing Performance with Flame Graphs
Brendan Gregg
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

PDF
The Science of DBMS: Data Storage & Organization
SAP Technology
 
PPT
The thinking persons guide to data warehouse design
Calpont
 
DOC
Ibm redbook
Rahul Verma
 
PPTX
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
Michael Smith
 
PDF
The Science of DBMS: Query Optimization
SAP Technology
 
PDF
Best practice bi_design_bestpracticesv_1_5
rajibzzaman
 
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
PPTX
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
Datavail
 
PPTX
GIDS 2016 Understanding and Building No SQLs
techmaddy
 
PPTX
Pl sql best practices document
Ashwani Pandey
 
PPTX
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
PDF
World2016_T5_S7_TeradataFunctionalOverview
Farah Omer
 
PPT
Myth busters - performance tuning 102 2008
paulguerin
 
PPTX
SPL_ALL_EN.pptx
政宏 张
 
PPTX
Dan Hotka's Top 10 Oracle 12c New Features
Embarcadero Technologies
 
PDF
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Russell Spangler
 
PDF
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
PPTX
MySQL Optimizer: What's New in 8.0
Manyi Lu
 
PDF
Recent MariaDB features to learn for a happy life
Federico Razzoli
 
The Science of DBMS: Data Storage & Organization
SAP Technology
 
The thinking persons guide to data warehouse design
Calpont
 
Ibm redbook
Rahul Verma
 
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
Michael Smith
 
The Science of DBMS: Query Optimization
SAP Technology
 
Best practice bi_design_bestpracticesv_1_5
rajibzzaman
 
The Future of Fast Databases: Lessons from a Decade of QuestDB
javier ramirez
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...
Datavail
 
GIDS 2016 Understanding and Building No SQLs
techmaddy
 
Pl sql best practices document
Ashwani Pandey
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
World2016_T5_S7_TeradataFunctionalOverview
Farah Omer
 
Myth busters - performance tuning 102 2008
paulguerin
 
SPL_ALL_EN.pptx
政宏 张
 
Dan Hotka's Top 10 Oracle 12c New Features
Embarcadero Technologies
 
Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group
Russell Spangler
 
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
MySQL Optimizer: What's New in 8.0
Manyi Lu
 
Recent MariaDB features to learn for a happy life
Federico Razzoli
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
What Is Data Integration and Transformation?
subhashenia
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
BinarySearchTree in datastructures in detail
kichokuttu
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

  • 1. Materialized Column——An Efficient Way to Optimize Queries on Nested Columns Guo, Jun ([email protected]) Lead of Data Engine Team, @ByteDance
  • 2. Who we are o Data Engine team of ByteDance o Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine
  • 3. What we do o Manage Spark SQL / Presto / Hive workload o Offer Open API and self-serve platform o Optimize Spark SQL / Presto / Hive engine o Design data architecture for most business lines in ByteDance
  • 4. Agenda ▪ Spark SQL at ByteDance ▪ Why nested type are widely used ▪ What are the main issues of nested type ▪ Optional solutions ▪ How does Materialized Column solve these problems
  • 5. Spark SQL at ByteDance
  • 6. Spark SQL at ByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area
  • 7. Why nested type are widely used
  • 8. Why nested type are widely used ▪ Event log ▪ A lot of new tracking events are created everyday ▪ It is not a good idea to create a new column for a new type of event ▪ Dimension ▪ Dimension tables are dumped from MySQL of service backend ▪ Service backend may add some new fields on demand. These fields may not be helpful for now but they may be useful in the future
  • 9. Main issues for nested type
  • 10. Main issues for nested type ▪ Unnecessary data are read which is a waste of IO ▪ Vectorized read can not be exploit when nested type column is read ▪ Filter pushdown can not be utilized when nested column is read ▪ Duplicated computation. e.g. JSON parsing is CPU-intensive
  • 12. Optional solutions – A separate table ▪ DW users design a solution to solve these problems ▪ Maintain a new table which add new columns which are extracted from the nested columns ▪ Downstream users should query on this new table and new columns for better performance
  • 13. Optional solutions – A separate table ▪ Pros ▪ Queries are on simple type so that all the problems are solved ▪ Cons ▪ Need to push all the downstream users to migrate their queries / pipelines to the new table and new columns ▪ Duplicated storage and computation cost ▪ Can not handle frequent subfields changing
  • 14. Optional solutions – Vectorized Read on Nested Column ▪ Refactor Parquet vectorized reader to support vectorized read for nested types ▪ Support predicate pushdown for struct
  • 15. Optional solutions – Vectorized Read on Nested Column ▪ Pros ▪ Enable vectorized read without any storage overhead ▪ Cons ▪ Need to refactor vectorized reader for Parquet and ORC respectively ▪ Filter pushdown for Array/Map is still not available ▪ The performance of vectorized read on nested type is not as good as that for simple type ▪ Improve performance with struct by about 100% ▪ Improve performance with map by about 163%
  • 16. How does Materialized Column solve these problems
  • 17. How does Materialized Column solve these problems CREATE TABLE base_table ( item STRING, count INT, people<STRING, STRING> date STRING ) USING parquet PARTITIONED BY (date); ALTER TABLE base_table ADD COLUMNS ( age INT MATERIALIZED CAST(peopl e[‘age’] AS INTEGER) ); Add materialized columnOriginal table
  • 18. How does Materialized Column solve these problems
  • 19. How does Materialized Column solve these problems Write with materialized column explain extended insert into base_table partition(date='20201010') select 'appole', 1, map('age','18','name','jack','gender','male')
  • 20. How does Materialized Column solve these problems Query with materialized column rewriteQuery without materialized column rewrite
  • 21. How does Materialized Column solve these problems Test case Without Materialized Column rewrite With Materialized Column rewrite Performance Read data size SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓ SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓ SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓ Query without materialized column rewrite
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.