SlideShare a Scribd company logo
Building Merge on Read on Delta Lake
Justin Breese
Senior Solutions Architect
Nick Karpov
Resident Solutions Architect
Who are we?
Justin Breese
justin.breese@databricks.com | Los Angeles
Senior Strategic Solutions Architect
I pester Nick with a lot of questions and thoughts
Nick Karpov
nick.karpov@databricks.com | San Francisco
Senior Resident Solutions Architect
History & Music
Agenda
▪ Background: Copy on Write (COW) & Merge on Read (MOR)
▪ Use case, challenges, & MOR strategies
▪ Testing: choosing the right MOR strategy
▪ Rematerialization?
Problem statement(s)
▪ Dealing with highly random and update heavy CDC streams
▪ Wanting to be able to get fresh data at any given time
Summary
▪ Using MOR allows for faster writes and still get reads that can meet SLAs
Building Merge on Read on Data Lake
▪ What is Merge on Read (MOR) and Copy on Write (COW)?
▪ What is the use case?
▪ Why did we build?
▪ What is the architecture?
▪ How to test and verify it?
Copy on Write (COW) and Merge on read (MOR)
Copy on Write (COW)
▪ TL;DR the merge is done during the write
▪ Default config for Delta Lake
▪ Data is “merged” into a Delta table by physically rewriting existing files
with modifications before making available to the reader
▪ In Delta Lake, merge is a three-step process
▪ Great for write once read many scenarios
▪
Delta Lake Merge - Under the hood
▪ source: new data, target: existing data (Delta table)
▪ Phase 1: Find the input files in target that are touched by the rows that
satisfy the condition and verify that no two source rows match with the
same target row [innerJoin]
▪ Phase 2: Read the touched files again and write new files with updated
and/or inserted rows
▪ Phase 3: Use the Delta protocol to atomically remove the touched files and
add the new files (write stuff to object/blob store)
COW: What is Delta Lake doing under the hood?
Phase 2: Read the touched files again and write new files
with updated and/or inserted rows.
The type of join can vary depending on the conditions of the merge:
▪ Insert only merge (e.g. no updates/deletes) → leftAntiJoin on the
source to find the inserts
▪ Matched only clauses (e.g. when matched) → rightOuterJoin
▪ Else (e.g. you have updates, deletes, and inserts) → fullOuterJoin
Phase 2 double click
Merge on Read (MOR)
▪ TL;DR the “merge” is done during the read
▪ Common strategy: don’t logically merge until you NEED the result
▪ Implementation? Two tables and a view
▪ Materialized table
▪ Changelog table (can be a diff, Avro, parquet, etc.)
▪ View that acts as the referee between the two and is the source of truth
Which one do you pick? Well it depends...
or
write many read less
write less read many
Use case
Use case info
▪ 100-200/second (6k-12k/minute)
▪ CDC data coming from Kafka
▪ usually 1-3 columns are changing
▪ partial updates
▪ Each row has a unique ID
▪ 200GB active files; growing at a small rate
▪ SLA: read updates to point lookups in <5 min
▪ Currently doing daily batch overwrites; data can be up to 24 hours
stale
Initial observations and problems encountered
▪ Lots of updates: 96% of events
▪ Matching condition is uniformly distributed across the target
▪ No natural partitioning keys
▪ Sample of 50k events could have 2k different days of updates
▪ Default Delta Lake Merge configs were not performing well
▪ Ended up rewriting almost the entire table each merge
Architecture: what did we settle on? MOR
This is what we will talk about
Snapshot & Changeset
▪ Snapshot: base table ▪ Changset: append only
Primary Key id
Most recent data fragno
Partitioning optional (depends on use case)
many data
columns
….
Primary Key id
Most recent data fragno
Partitioning Structured Streaming batchId (this is
important!)
many data
columns
...
Changeset
▪ Get the unique values in the changeset - primaryKey and latest
▪ As I have partial updates, I need to coalesce(changes, baseline)
▪ Check to understand if the dataframe can be broadcasted?
▪ If I believe I can broadcast 1GB data and each row is 364b, then I can broadcast anything up to
2.8M rows. If the changeset is > 2.8M rows ⇒ do not broadcast -- because memory!
* if your changeset is small enough
View: Methods to join rankedChangeset into the baseline
doubleRankOver
fullOuterJoin
leftJoinAntiJoin broadcastable!
leftJoinUnionInserts: broadcastable! Great if you are guaranteed that your inserts are not upserts!
▪ Now that we have our changeset… we still need to compare these values to the baseline table to get the
latest by id
▪ There are several methods to do this
How to pick the right view - perfTesting!
Testing [normally] takes a long time
▪ Things to consider:
▪ How many tests are sufficient?
▪ How can I make them as even as possible?
▪ What do you actually want to test?
▪ Why is this part so hard and manual?
▪ Databricks has a `/runs/submit` API - starts a fresh cluster for each run
▪ Databricks notebooks have widgets which act as params
▪ Let’s do 3 tests for each viewType (method) and each operation
(read/write) ⇒ 3 * 4 * 2 = 24 tests!
But it doesn’t have to!
Create the widgets in your Notebooks
Create your results payload (note: we are calling the widgets as params)
Create a timer function
Save to Delta table (note: payload)
Operation to test
Case statement to match the method and supply the correct view - send it to the stopwatch utility
Configuring the API
Check out my gitHub [https://github.com/justinbreese/databricks-gems#perftestautomationpy]
Made a simple script that leverages the Databricks runs/submit API
Run info
Cluster info
Here is what we will create:
Run Operation Method
0 Read leftJoinUnionInserts
1 Read leftJoinUnionInserts
2 Read leftJoinUnionInserts
0 Read outerJoined
1 Read outerJoined
2 Read outerJoined
0 Read antiJoinLeftJoinUnion
1 Read antiJoinLeftJoinUnion
2 Read antiJoinLeftJoinUnion
Calling the API
Check out gitHub [https://github.com/justinbreese/databricks-gems#perftestautomationpy]
Made a simple script that leverages the Databricks runs/submit API
python3 perfTestAutomation.py -t <userAccessToken> -s 0 -j artifacts/perfTest.json
View the results
leftJoinUnionInserts in the winner for the view
Recap thus far
Now we will talk about this part
Periodic Rematerialization
Periodic Rematerialization
▪ If changes are getting appended consistently, then you’ll have more
and more rows to compare against
▪ This makes your read performance degrade over time
▪ Therefore, you need to do a periodic job that will reset your baseline
table
▪ And yes, there are some choices that you have for this:
Because you need to reset your baseline table for read perf
Method Consideration(s)
Merge Easy, very helpful if you have many larger partitions and only a smaller subset of partitions need to be changed,
and built into Delta Lake
Overwrite Easy, great if you do not have or cannot partition, or if all/most partitions need to be changed
replaceWhere Moderate, only can be used if you have partitions, built into Delta Lake
Periodic Rematerialization
▪ Now that we’ve materialized the new changes into the baseline, we want to delete those batches that we
don’t need
▪ Since we partitioned by batchId, when we delete those previous batches, this is a metadata only
operation and super fast/cheap - line 68
▪ We do this so we don’t duplicate changes and because we don’t need them anymore
▪ Remember: we have an initial bronze table that has all of our changes so we always have this if we ever need them
Code! Remember that we said that the batchId is important?
Periodic Rematerialization
▪ Yes, you can even do some perfTesting on this to understand which
method fits your use case best
▪ Our use case ended up using overwrite as it was a better fit
▪ Changes happened very randomly; going back up to 2000+ days
▪ Dataset was ~200GB; partitioning was not able to be effective
▪ 200GB is small and we can overwrite the complete table in <10 min with 80 cores
Final recap
▪ Talked about the use case
▪ Introduced the MOR architecture
▪ Talked about the two tables
▪ Different views and understanding their differences
▪ How to test the different view methods
▪ Periodic rematerialization
This wouldn’t have been possible without help
from:
Chris Fish
Daniel Tomes
Tathagata Das (TD)
Burak Yavuz
Joe Widen
Denny Lee
Paul Roome
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Change Data Feed in Delta
Databricks
 
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Change Data Feed in Delta
Databricks
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Understanding Query Plans and Spark UIs
Databricks
 

What's hot (20)

PDF
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
PDF
ClickHouse Keeper
Altinity Ltd
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
PDF
Delta Lake: Optimizing Merge
Databricks
 
ODP
Presto
Knoldus Inc.
 
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
PPTX
Presto: SQL-on-anything
DataWorks Summit
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
ClickHouse Keeper
Altinity Ltd
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
Delta Lake: Optimizing Merge
Databricks
 
Presto
Knoldus Inc.
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
Presto: SQL-on-anything
DataWorks Summit
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Optimizing Apache Spark SQL Joins
Databricks
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Ad

Similar to Delta: Building Merge on Read (20)

PDF
Essential concepts of data architectures
Marco Brambilla
 
PDF
Operating and Supporting Delta Lake in Production
Databricks
 
PPTX
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
Dave Stokes
 
PDF
Reliable Data Replication by Cameron Morgan
ScyllaDB
 
PPTX
Structured Data Extraction
KaustubhPatange2
 
PDF
How to build TiDB
PingCAP
 
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
PDF
Big data should be simple
Dori Waldman
 
PDF
Object Compaction in Cloud for High Yield
ScyllaDB
 
PDF
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
Altinity Ltd
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PPTX
Ten query tuning techniques every SQL Server programmer should know
Kevin Kline
 
PDF
Performant Django - Ara Anjargolian
Hakka Labs
 
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
PDF
Embedded Systems: Lecture 12: Introduction to Git & GitHub (Part 3)
Ahmed El-Arabawy
 
PDF
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
PPTX
FluentMigrator - Dayton .NET - July 2023
Matthew Groves
 
PDF
MariaDB/MySQL_: Developing Scalable Applications
Federico Razzoli
 
DOC
Ibm redbook
Rahul Verma
 
Essential concepts of data architectures
Marco Brambilla
 
Operating and Supporting Delta Lake in Production
Databricks
 
7 Database Mistakes YOU Are Making -- Linuxfest Northwest 2019
Dave Stokes
 
Reliable Data Replication by Cameron Morgan
ScyllaDB
 
Structured Data Extraction
KaustubhPatange2
 
How to build TiDB
PingCAP
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
javier ramirez
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Big data should be simple
Dori Waldman
 
Object Compaction in Cloud for High Yield
ScyllaDB
 
OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...
Altinity Ltd
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Ten query tuning techniques every SQL Server programmer should know
Kevin Kline
 
Performant Django - Ara Anjargolian
Hakka Labs
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Embedded Systems: Lecture 12: Introduction to Git & GitHub (Part 3)
Ahmed El-Arabawy
 
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
FluentMigrator - Dayton .NET - July 2023
Matthew Groves
 
MariaDB/MySQL_: Developing Scalable Applications
Federico Razzoli
 
Ibm redbook
Rahul Verma
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

Delta: Building Merge on Read

  • 1. Building Merge on Read on Delta Lake Justin Breese Senior Solutions Architect Nick Karpov Resident Solutions Architect
  • 2. Who are we? Justin Breese [email protected] | Los Angeles Senior Strategic Solutions Architect I pester Nick with a lot of questions and thoughts Nick Karpov [email protected] | San Francisco Senior Resident Solutions Architect History & Music
  • 3. Agenda ▪ Background: Copy on Write (COW) & Merge on Read (MOR) ▪ Use case, challenges, & MOR strategies ▪ Testing: choosing the right MOR strategy ▪ Rematerialization?
  • 4. Problem statement(s) ▪ Dealing with highly random and update heavy CDC streams ▪ Wanting to be able to get fresh data at any given time Summary ▪ Using MOR allows for faster writes and still get reads that can meet SLAs
  • 5. Building Merge on Read on Data Lake ▪ What is Merge on Read (MOR) and Copy on Write (COW)? ▪ What is the use case? ▪ Why did we build? ▪ What is the architecture? ▪ How to test and verify it?
  • 6. Copy on Write (COW) and Merge on read (MOR)
  • 7. Copy on Write (COW) ▪ TL;DR the merge is done during the write ▪ Default config for Delta Lake ▪ Data is “merged” into a Delta table by physically rewriting existing files with modifications before making available to the reader ▪ In Delta Lake, merge is a three-step process ▪ Great for write once read many scenarios ▪
  • 8. Delta Lake Merge - Under the hood ▪ source: new data, target: existing data (Delta table) ▪ Phase 1: Find the input files in target that are touched by the rows that satisfy the condition and verify that no two source rows match with the same target row [innerJoin] ▪ Phase 2: Read the touched files again and write new files with updated and/or inserted rows ▪ Phase 3: Use the Delta protocol to atomically remove the touched files and add the new files (write stuff to object/blob store)
  • 9. COW: What is Delta Lake doing under the hood? Phase 2: Read the touched files again and write new files with updated and/or inserted rows. The type of join can vary depending on the conditions of the merge: ▪ Insert only merge (e.g. no updates/deletes) → leftAntiJoin on the source to find the inserts ▪ Matched only clauses (e.g. when matched) → rightOuterJoin ▪ Else (e.g. you have updates, deletes, and inserts) → fullOuterJoin Phase 2 double click
  • 10. Merge on Read (MOR) ▪ TL;DR the “merge” is done during the read ▪ Common strategy: don’t logically merge until you NEED the result ▪ Implementation? Two tables and a view ▪ Materialized table ▪ Changelog table (can be a diff, Avro, parquet, etc.) ▪ View that acts as the referee between the two and is the source of truth
  • 11. Which one do you pick? Well it depends... or write many read less write less read many
  • 13. Use case info ▪ 100-200/second (6k-12k/minute) ▪ CDC data coming from Kafka ▪ usually 1-3 columns are changing ▪ partial updates ▪ Each row has a unique ID ▪ 200GB active files; growing at a small rate ▪ SLA: read updates to point lookups in <5 min ▪ Currently doing daily batch overwrites; data can be up to 24 hours stale
  • 14. Initial observations and problems encountered ▪ Lots of updates: 96% of events ▪ Matching condition is uniformly distributed across the target ▪ No natural partitioning keys ▪ Sample of 50k events could have 2k different days of updates ▪ Default Delta Lake Merge configs were not performing well ▪ Ended up rewriting almost the entire table each merge
  • 15. Architecture: what did we settle on? MOR This is what we will talk about
  • 16. Snapshot & Changeset ▪ Snapshot: base table ▪ Changset: append only Primary Key id Most recent data fragno Partitioning optional (depends on use case) many data columns …. Primary Key id Most recent data fragno Partitioning Structured Streaming batchId (this is important!) many data columns ...
  • 17. Changeset ▪ Get the unique values in the changeset - primaryKey and latest ▪ As I have partial updates, I need to coalesce(changes, baseline) ▪ Check to understand if the dataframe can be broadcasted? ▪ If I believe I can broadcast 1GB data and each row is 364b, then I can broadcast anything up to 2.8M rows. If the changeset is > 2.8M rows ⇒ do not broadcast -- because memory! * if your changeset is small enough
  • 18. View: Methods to join rankedChangeset into the baseline doubleRankOver fullOuterJoin leftJoinAntiJoin broadcastable! leftJoinUnionInserts: broadcastable! Great if you are guaranteed that your inserts are not upserts! ▪ Now that we have our changeset… we still need to compare these values to the baseline table to get the latest by id ▪ There are several methods to do this
  • 19. How to pick the right view - perfTesting!
  • 20. Testing [normally] takes a long time ▪ Things to consider: ▪ How many tests are sufficient? ▪ How can I make them as even as possible? ▪ What do you actually want to test? ▪ Why is this part so hard and manual? ▪ Databricks has a `/runs/submit` API - starts a fresh cluster for each run ▪ Databricks notebooks have widgets which act as params ▪ Let’s do 3 tests for each viewType (method) and each operation (read/write) ⇒ 3 * 4 * 2 = 24 tests! But it doesn’t have to!
  • 21. Create the widgets in your Notebooks Create your results payload (note: we are calling the widgets as params)
  • 22. Create a timer function Save to Delta table (note: payload) Operation to test Case statement to match the method and supply the correct view - send it to the stopwatch utility
  • 23. Configuring the API Check out my gitHub [https://github.com/justinbreese/databricks-gems#perftestautomationpy] Made a simple script that leverages the Databricks runs/submit API Run info Cluster info Here is what we will create: Run Operation Method 0 Read leftJoinUnionInserts 1 Read leftJoinUnionInserts 2 Read leftJoinUnionInserts 0 Read outerJoined 1 Read outerJoined 2 Read outerJoined 0 Read antiJoinLeftJoinUnion 1 Read antiJoinLeftJoinUnion 2 Read antiJoinLeftJoinUnion
  • 24. Calling the API Check out gitHub [https://github.com/justinbreese/databricks-gems#perftestautomationpy] Made a simple script that leverages the Databricks runs/submit API python3 perfTestAutomation.py -t <userAccessToken> -s 0 -j artifacts/perfTest.json
  • 25. View the results leftJoinUnionInserts in the winner for the view
  • 26. Recap thus far Now we will talk about this part
  • 28. Periodic Rematerialization ▪ If changes are getting appended consistently, then you’ll have more and more rows to compare against ▪ This makes your read performance degrade over time ▪ Therefore, you need to do a periodic job that will reset your baseline table ▪ And yes, there are some choices that you have for this: Because you need to reset your baseline table for read perf Method Consideration(s) Merge Easy, very helpful if you have many larger partitions and only a smaller subset of partitions need to be changed, and built into Delta Lake Overwrite Easy, great if you do not have or cannot partition, or if all/most partitions need to be changed replaceWhere Moderate, only can be used if you have partitions, built into Delta Lake
  • 29. Periodic Rematerialization ▪ Now that we’ve materialized the new changes into the baseline, we want to delete those batches that we don’t need ▪ Since we partitioned by batchId, when we delete those previous batches, this is a metadata only operation and super fast/cheap - line 68 ▪ We do this so we don’t duplicate changes and because we don’t need them anymore ▪ Remember: we have an initial bronze table that has all of our changes so we always have this if we ever need them Code! Remember that we said that the batchId is important?
  • 30. Periodic Rematerialization ▪ Yes, you can even do some perfTesting on this to understand which method fits your use case best ▪ Our use case ended up using overwrite as it was a better fit ▪ Changes happened very randomly; going back up to 2000+ days ▪ Dataset was ~200GB; partitioning was not able to be effective ▪ 200GB is small and we can overwrite the complete table in <10 min with 80 cores
  • 31. Final recap ▪ Talked about the use case ▪ Introduced the MOR architecture ▪ Talked about the two tables ▪ Different views and understanding their differences ▪ How to test the different view methods ▪ Periodic rematerialization
  • 32. This wouldn’t have been possible without help from: Chris Fish Daniel Tomes Tathagata Das (TD) Burak Yavuz Joe Widen Denny Lee Paul Roome
  • 33. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.