SlideShare a Scribd company logo
Taro L. Saito, Naoki Takezoe, Yukihiro Okada, Takako Shimamoto, Dongmin Yu, Suprith Chandrashekharachar, Kai Sasaki, Shohei
Okumiya, Yan Wang, Takashi Kurihara, Ryu Kobayashi, Keisuke Suzuki, Zhenghong Yang, Makoto Onizuka
DBTest ‘22 on June 17th
Journey of Migrating Millions
of Queries on The Cloud
© 2022 Treasure Data
Treasure Data
Logs
Device
Data
Batch
Data
PlazmaDB
Table Schema
Data Collection Cloud Storage Distributed Data Processing
Jobs
Job Management
SQL Editor
Scheduler
Workflows
Machine
Learning
Treasure Data OSS
Third Party OSS
is an enterprise customer data platform (CDP) on the cloud
Data
© 2022 Treasure Data
Upgrading query engine is always tough
• Various customer use cases
• There are many edge cases in queries, data, and combination of both
• General benchmark and test cases are not enough
• Need to minimize customer frustration caused by upgrading
• Keep backward compatibility as much as possible
• Notify customers of incompatible queries and how to fix them in
advance if we will break compatibility
• Activeness of OSS development
• In particular, Trino development is super active
• Monthly or more frequent release with hundreds of commits
• No stable versions
• But staying at the same version so long is also painful
• Cannot use new features and optimizations unless backporting
• Backporting will get harder over time
© 2022 Treasure Data
Query simulator
Test using production data and queries with security and safety
Control Cluster
Test Cluster
Query Log
Checksum /
Query Metrics
Report
Query Set
Real Database Test Database
read write
• Security: Don’t show customer data and query results
• Safety: Don’t cause any side-effect on customer data
replay queries
© 2022 Treasure Data
Challenges in query simulation
• Query simulation takes very long time
• Very large number of queries need to be tested
• Not only time, but also cost of test clusters
• We need to make query simulation faster
• Result verification is not straightforward
• Many false positives and duplications
• Result analysis tends to rely on personal knowledge
• We need to make result verification easier
© 2022 Treasure Data
How we can make query simulation faster?
• Reduce the number of queries by clustering by query signature
• Reduce the amount of data by narrowing table scan ranges
• Test only specific queries (by period, running time, query type, etc)
© 2022 Treasure Data
Clustering queries by query signature
Reduce 90% of queries a day need to be tested
Query signature Corresponding SQL statements
S(T) SELECT ... FROM ...
S[*](T) SELECT * FROM ... (select all columns)
G(S(T)) SELECT ... FROM ... GROUP BY
S(LJ(T, T)) SELECT ... FROM .. LEFT JOIN ...
WS[A(a,S(T))] WITH a AS SELECT .. (define aliases to queries)
O(S(T)) SELECT ... ORDER BY
CT(S(T)) CREATE TABLE AS SELECT ...
I(S(T)) INSERT INTO ... SELECT ...
E(S(T)) SELECT distinct ... FROM ... (duplicate elimination)
U(S(T),S(T)) SELECT ... UNION ALL SELECT ...
© 2022 Treasure Data
Narrowing scan ranges
Time distribution of records in a table
Use only x% of total records by adding a time range predicate
SELECT time, parh, user_agent
FROM access
SELECT time, path, user_agent
FROM (
SELECT time, path, user_agent
FROM access
)
WHERE time >= from AND time < to
Original scan range
Only use this range
© 2022 Treasure Data
Choose options depending on the purpose
• For checking query compatibility
• Group by query signature and narrow scan ranges
• For checking performance differences
• Test only long-running queries without scan range narrowing
• For checking detailed behavior of particular queries
• Test only specified queries without grouping by query signature and
scan range narrowing
© 2022 Treasure Data
Challenges in query simulation
• Query simulation takes very long time
• Very large number of queries need to be tested
• Not only time, but also cost of test clusters
• We need to make query simulation faster
• Result verification is not straightforward
• Many false positives and duplications
• Result analysis tends to rely on personal knowledge
• We need to make result verification easier
© 2022 Treasure Data
How we can make result verification easier?
• Exclude uncheckable queries as much as possible
• Generate a human-readable report
• Assistance tools for investigation
© 2022 Treasure Data
Reporting for easier result verification
• List problematic queries
• Differences in query results, errors,
performance, resource usage,
scan ranges, worker distribution, etc
• Exclude uncheckable queries
• Non-deterministic queries
• Failed queries with the same (or similar) error on both versions
• Finally, check remaining queries by human
• Reduced more than 90% of queries need to be checked
• In addition, suggest potential cause of different results
© 2022 Treasure Data
Assistance tools for investigation
trino-compatibility-checker
https://github.com/takezoe/trino-compatibility-checker
Run the same query on multiple versions of Trino using docker and compare
query results to identify the version that introduced the incompatibility
✅ 317: Right(37a6259cc0c1dae299a7866489dff0bd)
❌ 350: Left(java.sql.SQLException: Query failed (#20210526_154140_00004_yzz4q): Multiple entries with same
key: @38f546e: null=expr and @38f546e: null=expr)
✅ 334: Right(37a6259cc0c1dae299a7866489dff0bd)
❌ 342: Left(java.sql.SQLException: Query failed (#20210526_154251_00003_2km75): Multiple entries with same
key: @63f11e0f: null=expr and @63f11e0f: null=expr)
✅ 338: Right(37a6259cc0c1dae299a7866489dff0bd)
❌ 340: Left(java.sql.SQLException: Query failed (#20210526_154338_00002_2xtvz): Multiple entries with same
key: @76615946: null=expr and @76615946: null=expr)
❌ 339: Left(java.sql.SQLException: Query failed (#20210526_154358_00002_jdnni): Multiple entries with same
key: @4aca599d: null=expr and @4aca599d: null=expr)
© 2022 Treasure Data
Trino bugs found by query simulation
• #8027 'Multiple entries with same key' error on duplicated grouping of
literal values
• #19764 Missing shallowEquals() implementation for SampledRelation
• #10861 Query fails if IS NOT NULL is used for information_schema
• #10937 Predicate push down doesn’t work outside the scope of sub query
• #11259 TRY should handle invalid value error in cast VACHAR as
TIMESTAMP
• #12199 Fix query planning failure on multiple subqueries due to
IllegalStateException at ScopeAware.scopeAwareComparison()
Also, we found many bugs in our target version that had been already fixed in
the latest version so we could backported them
© 2022 Treasure Data
Future work
• More support for investigation
• After finding bugs, compatibility/performance issues, the root cause
investigation is still tough
• Some tools or automations for drill-down investigation will be helpful
• More efficient query simulation
• Better workload compression (e.g. exclude more queries that won’t
improve the test coverage)
• Isolate simulation traffic from other production components (e.g.
pseudo reproduction of real data by synthetic data)

More Related Content

PDF
Testing Distributed Query Engine as a Service
takezoe
 
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Mysql query optimization
Baohua Cai
 
PPTX
Query-porcessing-& Query optimization
Saranya Natarajan
 
PDF
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Ontico
 
PDF
Beyond EXPLAIN: Query Optimization From Theory To Code
Yuto Hayamizu
 
PDF
unit 3 DBMS.docx.pdf geometric transformer in query processing
FallenAngel35
 
Testing Distributed Query Engine as a Service
takezoe
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Mysql query optimization
Baohua Cai
 
Query-porcessing-& Query optimization
Saranya Natarajan
 
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Ontico
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Yuto Hayamizu
 
unit 3 DBMS.docx.pdf geometric transformer in query processing
FallenAngel35
 

Similar to Journey of Migrating Millions of Queries on The Cloud (20)

PDF
unit 3 DBMS.docx.pdf geometry in query p
FallenAngel35
 
PDF
Chapter 2.pdf WND FWKJFW KSD;KFLWHFB ASNK
alemunuruhak9
 
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
PDF
Alluxio Product School Webinar - Boosting Trino Performance.
Alluxio, Inc.
 
PDF
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
PostgresOpen
 
PDF
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
DOCX
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
LinaCovington707
 
PPTX
Part5 sql tune
Maria Colgan
 
PPTX
Reading the .explain() Output
MongoDB
 
PDF
MariaDB's join optimizer: how it works and current fixes
Sergey Petrunya
 
PPT
PASS Summit 2010 Keynote David DeWitt
GraySystemsLab
 
PPTX
MySQL Optimizer Overview
Olav Sandstå
 
PDF
Optimizer features in recent releases of other databases
Sergey Petrunya
 
PPTX
CS 542 -- Query Optimization
J Singh
 
PPTX
addressing tim/quality trade-off in view maintenance
Soheila Dehghanzadeh
 
PPTX
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014
Lucian Precup
 
PDF
Migration from mysql to elasticsearch
Ryosuke Nakamura
 
PPTX
Database Management System Review
Kaya Ota
 
PDF
Troubleshooting MySQL Performance add-ons
Sveta Smirnova
 
PPTX
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Lucian Precup
 
unit 3 DBMS.docx.pdf geometry in query p
FallenAngel35
 
Chapter 2.pdf WND FWKJFW KSD;KFLWHFB ASNK
alemunuruhak9
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Alluxio Product School Webinar - Boosting Trino Performance.
Alluxio, Inc.
 
Robert Haas Query Planning Gone Wrong Presentation @ Postgres Open
PostgresOpen
 
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
LinaCovington707
 
Part5 sql tune
Maria Colgan
 
Reading the .explain() Output
MongoDB
 
MariaDB's join optimizer: how it works and current fixes
Sergey Petrunya
 
PASS Summit 2010 Keynote David DeWitt
GraySystemsLab
 
MySQL Optimizer Overview
Olav Sandstå
 
Optimizer features in recent releases of other databases
Sergey Petrunya
 
CS 542 -- Query Optimization
J Singh
 
addressing tim/quality trade-off in view maintenance
Soheila Dehghanzadeh
 
Back to the future : SQL 92 for Elasticsearch ? @nosqlmatters Dublin 2014
Lucian Precup
 
Migration from mysql to elasticsearch
Ryosuke Nakamura
 
Database Management System Review
Kaya Ota
 
Troubleshooting MySQL Performance add-ons
Sveta Smirnova
 
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
Lucian Precup
 
Ad

More from takezoe (20)

PDF
GitBucket: Open source self-hosting Git server built by Scala
takezoe
 
PDF
Revisit Dependency Injection in scala
takezoe
 
PDF
How to keep maintainability of long life Scala applications
takezoe
 
PDF
頑張りすぎないScala
takezoe
 
PDF
GitBucket: Git Centric Software Development Platform by Scala
takezoe
 
PDF
Non-Functional Programming in Scala
takezoe
 
PDF
Scala警察のすすめ
takezoe
 
PDF
Scala製機械学習サーバ「Apache PredictionIO」
takezoe
 
PDF
The best of AltJava is Xtend
takezoe
 
PDF
Scala Warrior and type-safe front-end development with Scala.js
takezoe
 
PDF
Tracing Microservices with Zipkin
takezoe
 
PDF
Type-safe front-end development with Scala
takezoe
 
PDF
Scala Frameworks for Web Application 2016
takezoe
 
PDF
Macro in Scala
takezoe
 
PDF
Java9 and Project Jigsaw
takezoe
 
PDF
Reactive database access with Slick3
takezoe
 
PDF
markedj: The best of markdown processor on JVM
takezoe
 
PDF
ネタじゃないScala.js
takezoe
 
PDF
Excel方眼紙を支えるJava技術 2015
takezoe
 
PDF
ビズリーチの新サービスをScalaで作ってみた 〜マイクロサービスの裏側 #jissenscala
takezoe
 
GitBucket: Open source self-hosting Git server built by Scala
takezoe
 
Revisit Dependency Injection in scala
takezoe
 
How to keep maintainability of long life Scala applications
takezoe
 
頑張りすぎないScala
takezoe
 
GitBucket: Git Centric Software Development Platform by Scala
takezoe
 
Non-Functional Programming in Scala
takezoe
 
Scala警察のすすめ
takezoe
 
Scala製機械学習サーバ「Apache PredictionIO」
takezoe
 
The best of AltJava is Xtend
takezoe
 
Scala Warrior and type-safe front-end development with Scala.js
takezoe
 
Tracing Microservices with Zipkin
takezoe
 
Type-safe front-end development with Scala
takezoe
 
Scala Frameworks for Web Application 2016
takezoe
 
Macro in Scala
takezoe
 
Java9 and Project Jigsaw
takezoe
 
Reactive database access with Slick3
takezoe
 
markedj: The best of markdown processor on JVM
takezoe
 
ネタじゃないScala.js
takezoe
 
Excel方眼紙を支えるJava技術 2015
takezoe
 
ビズリーチの新サービスをScalaで作ってみた 〜マイクロサービスの裏側 #jissenscala
takezoe
 
Ad

Recently uploaded (20)

PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Summary Of Odoo 18.1 to 18.4 : The Way For Odoo 19
CandidRoot Solutions Private Limited
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
Presentation about variables and constant.pptx
safalsingh810
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
Presentation about variables and constant.pptx
kr2589474
 

Journey of Migrating Millions of Queries on The Cloud

  • 1. Taro L. Saito, Naoki Takezoe, Yukihiro Okada, Takako Shimamoto, Dongmin Yu, Suprith Chandrashekharachar, Kai Sasaki, Shohei Okumiya, Yan Wang, Takashi Kurihara, Ryu Kobayashi, Keisuke Suzuki, Zhenghong Yang, Makoto Onizuka DBTest ‘22 on June 17th Journey of Migrating Millions of Queries on The Cloud
  • 2. © 2022 Treasure Data Treasure Data Logs Device Data Batch Data PlazmaDB Table Schema Data Collection Cloud Storage Distributed Data Processing Jobs Job Management SQL Editor Scheduler Workflows Machine Learning Treasure Data OSS Third Party OSS is an enterprise customer data platform (CDP) on the cloud Data
  • 3. © 2022 Treasure Data Upgrading query engine is always tough • Various customer use cases • There are many edge cases in queries, data, and combination of both • General benchmark and test cases are not enough • Need to minimize customer frustration caused by upgrading • Keep backward compatibility as much as possible • Notify customers of incompatible queries and how to fix them in advance if we will break compatibility • Activeness of OSS development • In particular, Trino development is super active • Monthly or more frequent release with hundreds of commits • No stable versions • But staying at the same version so long is also painful • Cannot use new features and optimizations unless backporting • Backporting will get harder over time
  • 4. © 2022 Treasure Data Query simulator Test using production data and queries with security and safety Control Cluster Test Cluster Query Log Checksum / Query Metrics Report Query Set Real Database Test Database read write • Security: Don’t show customer data and query results • Safety: Don’t cause any side-effect on customer data replay queries
  • 5. © 2022 Treasure Data Challenges in query simulation • Query simulation takes very long time • Very large number of queries need to be tested • Not only time, but also cost of test clusters • We need to make query simulation faster • Result verification is not straightforward • Many false positives and duplications • Result analysis tends to rely on personal knowledge • We need to make result verification easier
  • 6. © 2022 Treasure Data How we can make query simulation faster? • Reduce the number of queries by clustering by query signature • Reduce the amount of data by narrowing table scan ranges • Test only specific queries (by period, running time, query type, etc)
  • 7. © 2022 Treasure Data Clustering queries by query signature Reduce 90% of queries a day need to be tested Query signature Corresponding SQL statements S(T) SELECT ... FROM ... S[*](T) SELECT * FROM ... (select all columns) G(S(T)) SELECT ... FROM ... GROUP BY S(LJ(T, T)) SELECT ... FROM .. LEFT JOIN ... WS[A(a,S(T))] WITH a AS SELECT .. (define aliases to queries) O(S(T)) SELECT ... ORDER BY CT(S(T)) CREATE TABLE AS SELECT ... I(S(T)) INSERT INTO ... SELECT ... E(S(T)) SELECT distinct ... FROM ... (duplicate elimination) U(S(T),S(T)) SELECT ... UNION ALL SELECT ...
  • 8. © 2022 Treasure Data Narrowing scan ranges Time distribution of records in a table Use only x% of total records by adding a time range predicate SELECT time, parh, user_agent FROM access SELECT time, path, user_agent FROM ( SELECT time, path, user_agent FROM access ) WHERE time >= from AND time < to Original scan range Only use this range
  • 9. © 2022 Treasure Data Choose options depending on the purpose • For checking query compatibility • Group by query signature and narrow scan ranges • For checking performance differences • Test only long-running queries without scan range narrowing • For checking detailed behavior of particular queries • Test only specified queries without grouping by query signature and scan range narrowing
  • 10. © 2022 Treasure Data Challenges in query simulation • Query simulation takes very long time • Very large number of queries need to be tested • Not only time, but also cost of test clusters • We need to make query simulation faster • Result verification is not straightforward • Many false positives and duplications • Result analysis tends to rely on personal knowledge • We need to make result verification easier
  • 11. © 2022 Treasure Data How we can make result verification easier? • Exclude uncheckable queries as much as possible • Generate a human-readable report • Assistance tools for investigation
  • 12. © 2022 Treasure Data Reporting for easier result verification • List problematic queries • Differences in query results, errors, performance, resource usage, scan ranges, worker distribution, etc • Exclude uncheckable queries • Non-deterministic queries • Failed queries with the same (or similar) error on both versions • Finally, check remaining queries by human • Reduced more than 90% of queries need to be checked • In addition, suggest potential cause of different results
  • 13. © 2022 Treasure Data Assistance tools for investigation trino-compatibility-checker https://github.com/takezoe/trino-compatibility-checker Run the same query on multiple versions of Trino using docker and compare query results to identify the version that introduced the incompatibility ✅ 317: Right(37a6259cc0c1dae299a7866489dff0bd) ❌ 350: Left(java.sql.SQLException: Query failed (#20210526_154140_00004_yzz4q): Multiple entries with same key: @38f546e: null=expr and @38f546e: null=expr) ✅ 334: Right(37a6259cc0c1dae299a7866489dff0bd) ❌ 342: Left(java.sql.SQLException: Query failed (#20210526_154251_00003_2km75): Multiple entries with same key: @63f11e0f: null=expr and @63f11e0f: null=expr) ✅ 338: Right(37a6259cc0c1dae299a7866489dff0bd) ❌ 340: Left(java.sql.SQLException: Query failed (#20210526_154338_00002_2xtvz): Multiple entries with same key: @76615946: null=expr and @76615946: null=expr) ❌ 339: Left(java.sql.SQLException: Query failed (#20210526_154358_00002_jdnni): Multiple entries with same key: @4aca599d: null=expr and @4aca599d: null=expr)
  • 14. © 2022 Treasure Data Trino bugs found by query simulation • #8027 'Multiple entries with same key' error on duplicated grouping of literal values • #19764 Missing shallowEquals() implementation for SampledRelation • #10861 Query fails if IS NOT NULL is used for information_schema • #10937 Predicate push down doesn’t work outside the scope of sub query • #11259 TRY should handle invalid value error in cast VACHAR as TIMESTAMP • #12199 Fix query planning failure on multiple subqueries due to IllegalStateException at ScopeAware.scopeAwareComparison() Also, we found many bugs in our target version that had been already fixed in the latest version so we could backported them
  • 15. © 2022 Treasure Data Future work • More support for investigation • After finding bugs, compatibility/performance issues, the root cause investigation is still tough • Some tools or automations for drill-down investigation will be helpful • More efficient query simulation • Better workload compression (e.g. exclude more queries that won’t improve the test coverage) • Isolate simulation traffic from other production components (e.g. pseudo reproduction of real data by synthetic data)