Delta Lake with Azure Databricks

Dustin Vannoy
Data Engineer
Cloud + Streaming
Azure Databricks with
Delta Lake

Dustin Vannoy
Data Engineering Consultant
Co-founder Data Engineering San Diego
/in/dustinvannoy
@dustinvannoy
dustin@dustinvannoy.com
Technologies
• Azure & AWS
• Spark
• Kafka
• Python
Modern Data Systems
• Data Lakes
• Analytics in Cloud
• Streaming

© Microsoft Azure + AI Conference All rights reserved.
Agenda
 Intro to Spark + Azure Databricks
 Delta Lake Overview
 Delta Lake in Action
 Schema Enforcement
 Time Travel
 MERGE, DELETE, OPTIMIZE

Intro to Spark & Azure Databricks
Overview and Databricks workspace walk through

Why Spark?
Big data and the cloud
changed our mindset.
We want tools that
scale easily as data
size grows.
Spark is a leader in
data processing that
scales across many
machines. It can run
on Hadoop but is
faster and easier than
Map Reduce.

Benefit of horizontal scaling
Traditional Distributed (Parallel)

What is Spark?
 Fast, general purpose engine for large-scale data processing
 Replaces MapReduce as Hadoop parallel programming API
 Many options:
 Yarn / Spark Cluster / Local
 Scala / Python / Java / R
 Spark Core / SQL / Streaming / ML / Graph

Simple code, parallel compute
Spark consists of a programming API and execution engine
Worker Worker Worker Worker
Master
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
song_df = spark.read
.option('sep','t')
.option("inferSchema","true")
.csv("/databricks-datasets/songs/data-001/part-0000*")
tempo_df = song_df.select(
col('_c4').alias('artist_name'),
col('_c14').alias('tempo'),
)
avg_tempo_df = tempo_df
.groupBy('artist_name')
.avg('tempo')
.orderBy('avg(tempo)',ascending=False)
avg_tempo_df.show(truncate=False)

Spark’s Strengths
 Data pipelines and analytics
 Batch or streaming
 SparkSQL
 Machine learning
 Uses memory to speed up processing
 Large community, many examples and tutorials

Delta Lake Overview
Why use it and how to start

Spark is powerful, but...
 Not ACID compliant – too easy to get corrupted data
 Schema mismatches – no validation on write
 Small files written, not efficient for reading
 Reads too much data (no indexes, only partitions)

ACID
 Atomicity – all or nothing
 Consistency – data always in valid state
 Isolation – uncommitted operations don’t impact other reads/writes
 Durability – committed data is never lost
ACID compliance would give us ability to update and delete!

Small File Problem
 Too much metadata
 Too many file open/close operations
 Compression not as effective
 Bad if using Map Reduce to read
We fix this with scheduled file compaction jobs, difficulty is avoiding
interference with new write operations

Partitions
 Typically Spark reads all data in a table/directory before applying
filters
 Folder partitioning used to allow some filter push downs
 Limited to one fixed partition scheme to allow skipping reads
 Must use low cardinality columns for partitioning
We used to just add indexes and run statistics to improve seeks

Delta Lake Concepts
Reference: delta.io

ACID Transactions
Atomicity, Consistency, and Isolation all improved

Reminder: ACID
 Atomicity – all or nothing
 Consistency – data always in valid state
 Isolation – uncommitted operations don’t impact other reads/writes
 Durability – committed data is never lost

ACID Transaction Support
“Serializable isolation levels
ensure that readers never
see inconsistent data”
- Delta Lake Documentation

Schema Enforcement
How to use schema validation and schema merge

Schema validation by default
 Delta defaults to validating schema
 Fails on mismatch
 Or, set schema merge option

Time Travel
Data version history in Delta

Delta Log
“The transaction log is the mechanism through which
Delta Lake is able to offer the guarantee of atomicity.”
Reference: Databricks Blog: Unpacking the Transaction Log

Final thoughts
Delta Lake delivers some powerful capabilities

Delta Lake addresses
 ACID compliance
 Schema enforcement
 Compacting files
 Performance optimizations

References
 Video - Simplify and Scale Data Engineering Pipelines with Delta Lake
- Amanda Moran
 Video - Building Data Intensive Application on Top of Delta Lakes
 Video - Why do we need Delta Lake for Spark? - Learning Journal
 Databricks Blog: Unpacking the Transaction Log
 Databricks Delta Lake - James Serra
 Databricks Delta Technical Guide - Jan 2019
 Productionizing Machine Learning with Delta Lake

Please use EventsXD to fill out a session evaluation.
Thank you!

Delta Lake with Azure Databricks

More Related Content

What's hot (20)

Similar to Delta Lake with Azure Databricks (20)

More from Dustin Vannoy (6)

Recently uploaded (20)

Delta Lake with Azure Databricks

Editor's Notes