SlideShare a Scribd company logo
6
Most read
10
Most read
14
Most read
Presented By:
Raviyanshu Singh
Software Consultant
Getting Started With
DeltaLake
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes priorto
the session start time. We start on
time andconclude on time!
Feedback
Makesure to submita constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep yourmobiledevices in silent
mode, feel free to moveout of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoidunwantedchitchat during
the session.
Our Agenda
01 Why Delta Lake ?
02 Data Warehouse
03 Data Lake
04 Possible Solution
Delta Lake
05
05
06 Demo
Why Delta Lake?
Streaming Systems
Data source come through the systems
like Apache Kafka or Amazon Kinesis
Data Lakes
Data is stored for long periods of time in
data lake where it’s optimized for large
scale and low cost.
Data Warehouse
Valuable data is stored which are then again
optimized for high concurrency & reliability.
The modern data architecture uses the
blend of at least these three different
types of systems.
Data
Architecture
Data Warehouse
2013 2017 2018
● A data management system that stores current and
historical data from multiple sources in a business
friendly manner for easier insights and reporting.
● Data warehouses are typically used for business
intelligence (BI), reporting and data analysis.
Limitations
➔No support for video, audio, text
➔No support for data science
➔ ML Limited support for streaming Closed & proprietary
formats
ETL
(Extract Transform Load)
Data Source
Data Lake
2017 2018
● A central location that holds a large amount of data in its
native, raw format.
● Unstructured and semi-structured data like photos, video,
audio, and documents, which is essential for today’s machine
learning and advanced analytics use cases.
Limitations
➔Poor BI support Complex to set up
➔Poor performance
➔Lack of security features
➔Reliability issues
What’s the Solution?
A combination of DW & DL
Structured &
Unstructured Data
Data Lake
ETL
Metadata, Caching &
Indexing Layer
Data Validation
Data Warehousing
Reports, BI & Data
Science
Data Lakehouse
2017 2018
A system which merges the flexibility, low cost, and scale of
a data lake with the data management and ACID
transactions of data warehouses, addressing the limitations
of both.
Benefits
➔Don’t have to copy data to data lake and another copy to
some data warehouse
➔Cost savings, both in infrastructure and staff and
consulting overhead.
➔Scalability through underline cloud storage
➔Reliability through ACID transaction.
What is Delta Lake?
2018
● Delta Lake is a file-based open-source metadata layer
that enables building Lakehouse architecture on the top of
data lakes.
● It can run on existing data lakes and is fully compatible
with processing engines like Apache Spark
With Delta Lake -
➔Scalable metadata handling
➔ACID Transactions
➔Streaming and Batch unification
➔Time Travel (query an oldersnapshotof a Delta table)
➔Schema Enforcement
The Medallion Architecture
Ingestion Tables Refined Tables Feature/Agg Data Store
● No business rules or
transformations of any kind
● Should be fast and easy to
get new data to this layer
● Prioritize speed to market
and write performance- just
enough transformations
● Quality data expected
● Prioritize business use
cases and user experience
● Precalculated, business-
specific transformations
Features of Delta Lake
01 02
03 04
06
05
ACID Transactions
Data lake transactions done using processing
engine are committed for durability and
exposed to other readers in an atomic fashion.
Audit History
Transaction logs enables the full audit trail
of any changes made to the data
Schema
Enforcement
Automatically enforces schema
when writing and reading data
from lake
Unification of batch and
streaming
Table in Delta Lake is a batch table as well
as a streaming source and sink
Full DML Support
DML operations like deletes and updates,
but also complex data merge, or upsert
scenarios
Metadata Support
& Scaling
Leverages Spark distributedprocessing
power to handle all the metadata for
petabyte-scale tables with billions of files
at ease
Getting Started With
Delta Lake with
Spark-Shell
Delta Lake in
Pyspark
Delta Lake on
Databricks
1 2
3 4 Hello Delta Lake
Demo
Delta Lake
Best Practices
Choosethe rightpartition column:
If the cardinality of a column will be very high, do
not use that column for partitioning.
Amount of data in each partition. < 1GB
Improve performance on Delta Lake
Merge
Compact Files
A large number of small files should be rewritten
into a smaller number of larger files on a regular
basis. Thisis known as compaction.
Enhanced checkpoints for low latency
queries
Replace the content or schema of the
table.
Sometimesyou maywant to replace a Delta table.
Spark Caching
Differencebetween Delta Lake and
Parquet on ApacheSpark
Thank You !

More Related Content

What's hot (20)

PDF
Introducing Databricks Delta
Databricks
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
PPTX
Building a modern data warehouse
James Serra
 
PDF
What’s New with Databricks Machine Learning
Databricks
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PPTX
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Data Mesh
Piethein Strengholt
 
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
PPTX
Azure data platform overview
James Serra
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Time to Talk about Data Mesh
LibbySchulze
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Introducing Databricks Delta
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
Building a modern data warehouse
James Serra
 
What’s New with Databricks Machine Learning
Databricks
 
Modernizing to a Cloud Data Architecture
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks
 
Owning Your Own (Data) Lake House
Data Con LA
 
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
Free Training: How to Build a Lakehouse
Databricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Azure data platform overview
James Serra
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Modern Data architecture Design
Kujambu Murugesan
 
Time to Talk about Data Mesh
LibbySchulze
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 

Similar to Getting Started with Delta Lake on Databricks (20)

PPTX
Data Engineering with Databricks Presentation
Knoldus Inc.
 
PDF
What Is Delta Lake ???
✪Computants✪IBM_BP
 
PPTX
DataBricks fundamentals for fresh graduates
SanjeevaniClinicalRe
 
PDF
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
PPTX
Delta Lake with Azure Databricks
Dustin Vannoy
 
PDF
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
PDF
Benefits of a data lake
Sun Technologies
 
PDF
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 
PPTX
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
PDF
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
PDF
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
PDF
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
PPTX
use_case.pptx
vuppalanaveen
 
PDF
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PPT
Oracle Database 11g Lower Your Costs
Mark Rabne
 
PDF
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
PPTX
Things learned from OpenWorld 2013
Connor McDonald
 
PPTX
Azure SQL Database Managed Instance
James Serra
 
Data Engineering with Databricks Presentation
Knoldus Inc.
 
What Is Delta Lake ???
✪Computants✪IBM_BP
 
DataBricks fundamentals for fresh graduates
SanjeevaniClinicalRe
 
Data Warehouse or Data Lake, Which Do I Choose?
DATAVERSITY
 
Delta Lake with Azure Databricks
Dustin Vannoy
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Benefits of a data lake
Sun Technologies
 
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
use_case.pptx
vuppalanaveen
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Denodo
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Oracle Database 11g Lower Your Costs
Mark Rabne
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
Adaryl "Bob" Wakefield, MBA
 
Things learned from OpenWorld 2013
Connor McDonald
 
Azure SQL Database Managed Instance
James Serra
 
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Digital Circuits, important subject in CS
contactparinay1
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 

Getting Started with Delta Lake on Databricks

  • 1. Presented By: Raviyanshu Singh Software Consultant Getting Started With DeltaLake
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes priorto the session start time. We start on time andconclude on time! Feedback Makesure to submita constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep yourmobiledevices in silent mode, feel free to moveout of session in case you need to attend an urgent call. Avoid Disturbance Avoidunwantedchitchat during the session.
  • 3. Our Agenda 01 Why Delta Lake ? 02 Data Warehouse 03 Data Lake 04 Possible Solution Delta Lake 05 05 06 Demo
  • 4. Why Delta Lake? Streaming Systems Data source come through the systems like Apache Kafka or Amazon Kinesis Data Lakes Data is stored for long periods of time in data lake where it’s optimized for large scale and low cost. Data Warehouse Valuable data is stored which are then again optimized for high concurrency & reliability. The modern data architecture uses the blend of at least these three different types of systems. Data Architecture
  • 5. Data Warehouse 2013 2017 2018 ● A data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. ● Data warehouses are typically used for business intelligence (BI), reporting and data analysis. Limitations ➔No support for video, audio, text ➔No support for data science ➔ ML Limited support for streaming Closed & proprietary formats ETL (Extract Transform Load) Data Source
  • 6. Data Lake 2017 2018 ● A central location that holds a large amount of data in its native, raw format. ● Unstructured and semi-structured data like photos, video, audio, and documents, which is essential for today’s machine learning and advanced analytics use cases. Limitations ➔Poor BI support Complex to set up ➔Poor performance ➔Lack of security features ➔Reliability issues
  • 7. What’s the Solution? A combination of DW & DL Structured & Unstructured Data Data Lake ETL Metadata, Caching & Indexing Layer Data Validation Data Warehousing Reports, BI & Data Science
  • 8. Data Lakehouse 2017 2018 A system which merges the flexibility, low cost, and scale of a data lake with the data management and ACID transactions of data warehouses, addressing the limitations of both. Benefits ➔Don’t have to copy data to data lake and another copy to some data warehouse ➔Cost savings, both in infrastructure and staff and consulting overhead. ➔Scalability through underline cloud storage ➔Reliability through ACID transaction.
  • 9. What is Delta Lake? 2018 ● Delta Lake is a file-based open-source metadata layer that enables building Lakehouse architecture on the top of data lakes. ● It can run on existing data lakes and is fully compatible with processing engines like Apache Spark With Delta Lake - ➔Scalable metadata handling ➔ACID Transactions ➔Streaming and Batch unification ➔Time Travel (query an oldersnapshotof a Delta table) ➔Schema Enforcement
  • 10. The Medallion Architecture Ingestion Tables Refined Tables Feature/Agg Data Store ● No business rules or transformations of any kind ● Should be fast and easy to get new data to this layer ● Prioritize speed to market and write performance- just enough transformations ● Quality data expected ● Prioritize business use cases and user experience ● Precalculated, business- specific transformations
  • 11. Features of Delta Lake 01 02 03 04 06 05 ACID Transactions Data lake transactions done using processing engine are committed for durability and exposed to other readers in an atomic fashion. Audit History Transaction logs enables the full audit trail of any changes made to the data Schema Enforcement Automatically enforces schema when writing and reading data from lake Unification of batch and streaming Table in Delta Lake is a batch table as well as a streaming source and sink Full DML Support DML operations like deletes and updates, but also complex data merge, or upsert scenarios Metadata Support & Scaling Leverages Spark distributedprocessing power to handle all the metadata for petabyte-scale tables with billions of files at ease
  • 12. Getting Started With Delta Lake with Spark-Shell Delta Lake in Pyspark Delta Lake on Databricks 1 2 3 4 Hello Delta Lake
  • 13. Demo
  • 14. Delta Lake Best Practices Choosethe rightpartition column: If the cardinality of a column will be very high, do not use that column for partitioning. Amount of data in each partition. < 1GB Improve performance on Delta Lake Merge Compact Files A large number of small files should be rewritten into a smaller number of larger files on a regular basis. Thisis known as compaction. Enhanced checkpoints for low latency queries Replace the content or schema of the table. Sometimesyou maywant to replace a Delta table. Spark Caching Differencebetween Delta Lake and Parquet on ApacheSpark