SlideShare a Scribd company logo
Accelerating Data
Ingestion with
Databricks Autoloader
Simon Whiteley
Director of Engineering, Advancing Analytics
Agenda
▪ Why Incremental is Hard
▪ Autoloader Components
▪ Implementation
▪ Evolution
▪ Lessons
Why Incremental is Hard
Incremental Ingestion
BRONZE SILVER
LANDING
?
Incremental Ingestion
▪ Only Read New Files
▪ Don’t Miss Files
▪ Trigger Immediately
▪ Repeatable Pattern
▪ Fast over large directories
?
Existing Patterns – 1) ETL Metadata
etl batch read
{“lastRead”:”2021/05/26”}
Contents:
• /2021/05/24/file 1
• /2021/05/25/file 2
• /2021/05/26/file 3
• /2021/05/27/file 4
.load(f“/{loadDate}/”
Existing Patterns – 2) Spark File Streaming
file stream read
Contents:
• File 1
• File 2
• File 3
• File 4
Checkpoint:
• File 1
• File 2
• File 3
Existing Patterns – 3) DIY
triggered batch read
Blob File
Trigger
Logic
App
Azure
Function
Databricks
Job API
Incremental Ingestion Approaches
Approach Good At Bad At
Metadata ETL Repeatable
Not immediate,
requires polling
File Streaming
Repeatable
Immediate
Slows down over
large directories
DIY Architecture
Immediate
Triggering
Not Repeatable
Databricks Autoloader
Prakash Chockalingam
Databricks Engineering Blog
Auto Loader is an optimized cloud file
source for Apache Spark that loads
data continuously and efficiently from
cloud storage as new data arrives.
What is Autoloader?
Essentially, Autoloader combines our three approaches of:
• Storing Metadata about what has been read
• Using Structured Streaming for immediate processing
• Utilising Cloud-Native Components to optimise identifying
arriving files
There are two parts to the Autoloader job:
• CloudFiles DataReader
• CloudNotification Services (optional)
Cloudfiles Reader
Blob Storage
Blob Storage Queue
{“fileAdded”:”/landing/file 4.json”
• File 1.json
• File 2.json
• File 3.json
• File 4.json
Dataframe
Check Files in
Queue
Read specific files
from source
CloudFiles DataReader
df = ( spark
.readStream
.format(“cloudfiles”)
.option(“cloudfiles.format”,”json”)
.option(“cloudfiles.useNotifications”,”true”)
.schema(mySchema)
.load(“/mnt/landing/”)
)
Tells Spark to use
Autoloader
Tells Autoloader to
expect JSON files
Should Autoloader use
the Notification Queue
Cloud Notification Services - Azure
Blob Storage
Event Grid Topic
Event Grid Subscription Blob Storage Queue
Event Grid Subscription Blob Storage Queue
Event Grid Subscription Blob Storage Queue
Cloud Notification Services - Azure
Blob Storage
New File Arrives,
Triggers Event Topic
Subscription checks
message filters,
inserts into queue
{fileAdded:“/file 4/”}
NotificationServices Config
cloudFiles
.useNotifications – Directory Listing VS Notification Queue
.queueName – Use an Existing Queue
.connectionString – Queue Storage Connection
.subscriptionId
.resourceGroup
.tenantId
.clientId
.clientSecret
Service Principal for Queue Creation
Implementing Autoloader
▪ Setup Steps
▪ Reading New Files
▪ A Basic ETL Setup
Delta Implementation
Practical Implementations
BRONZE SILVER
LANDING
Autoloader
Low Frequency Streams
Autoloader
One
File Per
Day
24/7
Cluster
Low Frequency Streams
Autoloader
One
File Per
Day
1/7
Cluster df
.writeStream
.trigger(once=True)
.save(path)
Autoloader can be combined with trigger.Once
– each run finds only files not processed since
last run
Delta Merge
Autoloader
Merge?
Delta Merge
Autoloader
df
.writeStream
.foreachBatch(runThis)
.save(path)
def runThis(df, batchId):
(df
.write
.save(path)
)
Delta Implementation
▪ Batch ETL Pattern
▪ Merge Statements
▪ Logging State
Evolving Schemas
New Features since Databricks Runtime 8.2
What is Schema Evolution?
{“ID”:1,“ProductName”:“Belt”}
{“ID”:2,“ProductName”:“T-Shirt”,”Size”:”XL”}
{“ID”:3,“ProductName”:“Shirt”,“Size”:“14”,
“Care”:{ “DryClean”: “Yes”,
“MachineWash”:“Don’t you dare”
}
}
How do we handle Evolution?
1. Fail the Stream
2. Manually Intervene
3. Automatically Evolve
In order to manage schema evolution, we need to know:
• What the schema is expected to be
• What the schema is now
• How we want to handle any changes in schema
Schema Inference
In Databricks 8.2 Onwards – simply don’t provide a
Schema to enable Schema Inference. This infers the
schema once when the stream is started and stores it as
metadata.
cloudfiles
.schemaLocation – where to store the schema
.inferColumnTypes – sample data to infer types
.schemaHints – manually specify data types for certain columns
Schema Metastore
_schemas
{“ID”:1,
“ProductName”:“Belt”}
{
"type": "struct",
"fields": [
{
"name": "ID",
"type": “string",
"nullable": true,
"metadata": {}
},
{
"name": "ProductName",
"type": “string",
"nullable": true,
"metadata": {}
}
]
}
0
On First Read
Schema Metastore – DataType Inference
_schemas
{“ID”:1,
“ProductName”:“Belt”}
{
"type": "struct",
"fields": [
{
"name": "ID",
"type": “int",
"nullable": true,
"metadata": {}
},
{
"name": "ProductName",
"type": “string",
"nullable": true,
"metadata": {}
}
]
}
0
On First Read
.option(“cloudFiles.inferColumnTypes”,”True”)
Schema Metastore – Schema Hints
_schemas
{“ID”:1,
“ProductName”:“Belt”}
{
"type": "struct",
"fields": [
{
"name": "ID",
"type": “long",
"nullable": true,
"metadata": {}
},
{
"name": "ProductName",
"type": “string",
"nullable": true,
"metadata": {}
}
]
}
0
On First Read
.option(“cloudFiles.schemaHints”,”ID long”)
Schema Evolution
cloudFiles.schemaEvolutionMode
• addNewColumns – Fail the job, update the schema
metastore
• failOnNewColumns – Fail the job, no updates made
• rescue – Do not fail, pull all unexpected data into
_rescued_data
To allow for schema evolution, we can include a
schema evolution mode option:
Evolution Reminder
{“ID”:1,“ProductName”:“Belt”}
{“ID”:2,“ProductName”:“T-Shirt”,”Size”:”XL”}
{“ID”:3,“ProductName”:“Shirt”,“Size”:“14”,
“Care”:{ “DryClean”: “Yes”,
“MachineWash”:“Don’t you dare”
}
}
1
2
3
Schema Evolution - Rescue
1
2
3
ID Product Name _rescued_data
1 Belt
ID Product Name _rescued_data
2 T-Shirt {“Size”:”XL”}
ID Product Name _rescued_data
3 Shirt {“Size”:”14”,”Care”:{“DryC…
Schema Evolution – Add New Columns
_schemas
{“ID”:2,
“ProductName”:“T-Shirt”,
“Size”:”XL”}
0
On Arrival
2
1
{
"type": "struct",
"fields": [
{
"name": "ID",
"type": “string",
},
{
"name": "ProductName",
"type": “string",
} ,
{
"name": “Size",
"type": “string",
}…
Schema Evolution
▪ Inference & The Schema
Metastore
▪ Schema Hints
▪ Schema Evolution
Lessons from an Autoloader Life
Autoloader Lessons
▪ EventGrid Quotas &
Settings
▪ Streaming Best
Practices
▪ Batching Best Practices
Accelerating Data Ingestion with Databricks Autoloader
EventGrid Quota Lessons
• You can have 500 files from a single storage account
using the system topic
• Deleting checkpoint will reset the stream ID and create
a new Subscription/Queue, leaving an orphan set
• Use the CloudNotification Libraries to manage this
more closely with custom topics
Streaming Optimisation
• MaxBytesPerTrigger / MaxFilesPerTrigger
Manages the size of the streaming microbatch
• FetchParallelism
Manages the workload on your queue
Batch Lessons – Look for Lost Messages
Default 7 days!
Databricks Autoloader
▪ Reduces complexity of ingesting files
▪ Has some quirks in implementing ETL processes
▪ Growing number of schema evolution features
Simon Whiteley
Director of
Engineering
hello@advancinganalytics.co.uk
@MrSiWhiteley
www.youtube.com/c/AdvancingAnalytics
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot (20)

PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
PDF
From Data Warehouse to Lakehouse
Modern Data Stack France
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Free Training: How to Build a Lakehouse
Databricks
 
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Snowflake for Data Engineering
Harald Erb
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Databricks Fundamentals
Dalibor Wijas
 
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Databricks Platform.pptx
Alex Ivy
 

Similar to Accelerating Data Ingestion with Databricks Autoloader (20)

PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
PPTX
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
PDF
Analytics Metrics Delivery & ML Feature Visualization
Bill Liu
 
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
PDF
The Evolving Landscape of Data Engineering
Andrei Savu
 
PDF
Natural Language Query and Conversational Interface to Apache Spark
Databricks
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
PDF
Building a Real-Time Feature Store at iFood
Databricks
 
PPT
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 
PPTX
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
 
PDF
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
 
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
PPTX
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
PDF
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
Fwdays
 
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
PDF
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
Analytics Metrics Delivery & ML Feature Visualization
Bill Liu
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Lviv Startup Club
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
The Evolving Landscape of Data Engineering
Andrei Savu
 
Natural Language Query and Conversational Interface to Apache Spark
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
Building a Real-Time Feature Store at iFood
Databricks
 
Data Science Day New York: Data Science: A Personal History
Cloudera, Inc.
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
Fwdays
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Research Methodology Overview Introduction
ayeshagul29594
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 

Accelerating Data Ingestion with Databricks Autoloader