SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Matthew Powers, Prognos Health
Optimizing Delta / Parquet
Data Lakes
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Why Delta?
• Delta basics and transaction log
• Compacting Delta lake
• Vacuuming old files
• Partitioning Delta lakes
• Deleting rows
• Persisting transformations in columns
3
About
4
MungingData
• Time travel
• Compacting
• Vacuuming
• Update columns
Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
5
What is Delta lake?
• Parquet + transaction log
• Provides awesome features for free!
6
Delta Lake =!= Databricks Delta
7
https://github.com/delta-io/delta/issues/49
#UnifiedDataAnalytics #SparkAISummit
TL;DR
• 1 GB files
• No nested directories
8
#UnifiedDataAnalytics #SparkAISummit 9
Delta Lake Slack says 1GB files
Databricks Delta autoOptimize
10
Why does compaction speed up
lakes?
• Parquet: files need to be listed before they are
read. Listing is expensive in object stores.
• Delta: Data is read via the transaction log.
• Easier for Spark to read partitioned lakes into
memory partitions.
11
Sample Data
12
Create Delta Data Lake
13
Delta Lake on Disk
14
_delta_log/00000000000000000000.json
15
Code examples
16
Compact Delta Data Lake
17
Files post-compaction
18
_delta_log/00000000000000000001.json
19
Compacting Delta lakes without breaking
downstream apps
20
https://github.com/delta-io/delta/issues/146
21
Delta Lake Vacuum
• Files marked for removal older than the retention
period
• Default retention period is 7 days
• Not going to improve performance
22
Vacuum Delta Data Lake
23
Files post-vacuum
24
Optimal number of partitions
(delta)
25
spark-daria helps!
26
spark-daria on GitHub
27
Optimal number of partitions (parquet)
28
https://github.com/MrPowers/spark-daria/blob/master/src/main/scala/com/github/
mrpowers/spark/daria/utils/DirHelpers.scala
Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
29
Sample data
30
Filtering unpartitioned lake
31
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia),
StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioning the data lake
32
Partitioned lake on disk
33
_delta_log/00000000000000000000.json
34
Filtering partitioned lake
35
== Physical Plan ==
*(1) Project [first_name#662, last_name#663, country#664]
+- *(1) Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- *(1) FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>
Comparing physical plans
36
Unpartitioned
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12))
&& (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[….],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioned
Project [first_name#662, last_name#663, country#664]
+- Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 =
Russia)],
PushedFilters: [IsNotNull(first_name),
StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>
Directly grabbing the partitions is
faster for Parquet lakes…
37
Directly grabbing partitions was 83 times faster than relying on partition
filters for a simple query
Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
38
Creating partitioned lake (2/3)
39
Partitioned lake on disk (2/3)
40
Creating partitioned lake (3/3)
41
Incrementally updating
partitioned lakes
• Small file problem grows quickly
• Compaction is hard
42
Filtering data from a lake
43
We can delete rows in Delta lakes
44
Deleting under the hood
45
Append a column on the fly
46
Resulting DataFrame
47
Append a column in Delta
48
Delta lake downsides… not many
49
Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
50
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot (20)

PPTX
Delta Lake with Azure Databricks
Dustin Vannoy
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Change Data Feed in Delta
Databricks
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PPTX
Delta lake and the delta architecture
Adam Doyle
 
PDF
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
HostedbyConfluent
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PDF
Dynamic Partition Pruning in Apache Spark
Databricks
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
PDF
Introduction to PySpark
Russell Jurney
 
Delta Lake with Azure Databricks
Dustin Vannoy
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Change Data Feed in Delta
Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Delta lake and the delta architecture
Adam Doyle
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
HostedbyConfluent
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Dynamic Partition Pruning in Apache Spark
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Introduction to PySpark
Russell Jurney
 

Similar to Optimizing Delta/Parquet Data Lakes for Apache Spark (20)

PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PDF
Delta Lake: Optimizing Merge
Databricks
 
PDF
Delta: Building Merge on Read
Databricks
 
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
PDF
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
PDF
Care and Feeding of Catalyst Optimizer
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SeeQuality.net
 
PDF
Containerized Stream Engine to Build Modern Delta Lake
Databricks
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PDF
Simplifying Change Data Capture using Databricks Delta
Databricks
 
PDF
Delta from a Data Engineer's Perspective
Databricks
 
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
PDF
DeltaLakeOperations.pdf
GCPAdmin
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Delta Lake: Optimizing Merge
Databricks
 
Delta: Building Merge on Read
Databricks
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
 
Free Training: How to Build a Lakehouse
Databricks
 
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Care and Feeding of Catalyst Optimizer
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
SeeQuality.net
 
Containerized Stream Engine to Build Modern Delta Lake
Databricks
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Simplifying Change Data Capture using Databricks Delta
Databricks
 
Delta from a Data Engineer's Perspective
Databricks
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
DeltaLakeOperations.pdf
GCPAdmin
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 

Optimizing Delta/Parquet Data Lakes for Apache Spark