SlideShare a Scribd company logo
3D: DBT, Databricks and Delta
Fokko Driesprong
Principal Code Connaisseur
whoami
▪ Fokko Driesprong
▪ Master Distributed Systems &
Software Engineering
▪ Code Connaisseur at GoDataDriven
▪ Mostly doing {data,software} engineering
▪ Open source enthousiast
▪ ASF Member
▪ Committer + PMC member on Apache {Airflow, Avro, Druid}
▪ Committer on Apache Parquet
GoDataDriven
▪ Amsterdam based consultancy
▪ Data {Engineer,Science,Strategy}
▪ And now also analytics engineering!
▪ Around 50 consultants
▪ Used to do Hadoop, now Cloud
Agenda
What is DBT?
DBT + Delta Lake
DBT + Azure Databricks
What’s DBT? And why I ❤ it so much
Data Build Tool
▪ Tool for building data pipelines following the DataOps principles
▪ Simple tool to build complex pipelines
▪ Best practices from software engineering
▪ Linting / Peer reviews / DRY principle / Data testing
▪ SQL First
▪ Encodes organisational knowledge into the pipeline
Try it yourself: https://godatadriven.com/blog/tutorial-for-dbt-analytics-engineering-made-easy/
But everyone just calls it DBT
Data Build Tool
▪ Main focus on integration with DWH platforms
▪ Postgres, Redshift, Snowflake and Bigquery
▪ Support for Spark / Databricks
▪ Created by Fishtown Analytics
▪ Huge open source community
▪ Apache 2.0 Open Source license
But everyone just calls it DBT
DBT
The T in Extract-Transform-Load (ELT)
Analysts using dbt can transform their data by simply writing select statements, while
dbt handles turning these statements into tables and views in a data warehouse.
Let’s go through at a small example
Combine orders and order_lines into a revenue table
SQL with some Ninja2 sauce
Revenue table
DBT as a SQL Runner
Executes the pipeline from the command line
Seamless integration with the
Databricks Metastore
▪ Requires a Hive Metastore
▪ Analyize table
DBT as a SQL Compiler
Compiled SQL: target/compiled/dbtpreprocessing/models/revenue.sql
Next to the SQL there is documentation
Give meaning to the columns and add constraints
Looking at the docs
dbt docs generate
dbt docs serve
▪ Columns including types
▪ Test constraints
▪ Statistics
▪ The compiled query
Testing
dbt test
▪ Not-null
▪ Uniqueness
▪ Accepted values
▪ Referential constraints
▪ Custom tests
How does DBT communicate with Spark?
▪ SQL Over HTTP
▪ Authenticate using the token
▪ Parallel execution
DBT with Delta Lake
Switch to incremental ingestion
Using the Delta format
▪ ACID dataformat by Databricks
▪ Linux Software Foundation
▪ Allows MERGE INTO
▪ Enabled incremental imports
Switch to incremental Delta
If the table doesn’t exists (yet)
Switch to incremental Delta
Incremental MERGE INTO if the table exists
History
DESCRIBE HISTORY
In practice
Incremental imports
▪ Watermark column
▪ Only load the changed orders
▪ Also interesting for Users table
DBT Macro’s
Running it a second time
▪ Don’t Repeat Yourself
▪ Write a Macro instead
DBT with Azure Databricks
Observability is king
Keeping track of your pipelines
▪ Building trust
▪ Track aggregated metrics
computed by Spark
▪ Application insights
▪ Centralized system
Very simple Hive UDF
Keep track of stats over time
Small snippet of Scala
Sends the metrics to Application Insights
Use the UDF in DBT
Sends the metrics to Application Insights
▪ Register the UDF
▪ Keep track of
▪ Seconds since last order
▪ Number of orders
Be proactive
Before there are angry managers at your desk
▪ Keep track of the metric
▪ Send alerts on business rules
▪ Outlier based on historical
distributions
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
▪ Code available at:
▪ https://github.com/godatadriven/dbt-data-ai-summit
▪ https://github.com/godatadriven/azure-dbt-logger
Color Palette
Primary
Colors
Code example
Two Columns
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
▪ Bulleted list format
Headline FormatHeadline Format
Attribution Format
Second line of attribution
This is a template for a quote slide.
This is where the quote goes.
Attribute the source below…
Databricks simplifies data and AI
so data teams can innovate faster
Logos
Databricks Logos
Open Source Logos

More Related Content

What's hot (20)

PPTX
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
PPTX
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Jon Su
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
Intro to Delta Lake
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PPTX
Azure Data Factory
HARIHARAN R
 
PDF
Change Data Feed in Delta
Databricks
 
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Jon Su
 
Databricks Delta Lake and Its Benefits
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Intro to Delta Lake
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
Modernizing to a Cloud Data Architecture
Databricks
 
Architecting a datalake
Laurent Leturgez
 
Free Training: How to Build a Lakehouse
Databricks
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Azure Data Factory
HARIHARAN R
 
Change Data Feed in Delta
Databricks
 

Similar to 3D: DBT using Databricks and Delta (20)

PPTX
DBT Training in Hyderabad | Data Build Tool Training Online Course
susheel visualpath
 
PDF
Speeding Time to Insight with a Modern ELT Approach
Databricks
 
PPTX
Data Build Tool Training | Best Online DBT Courses
susheel visualpath
 
PPTX
Data Build Tool Training | DBT Training
susheel visualpath
 
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
PDF
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
HostedbyConfluent
 
PPTX
[DSC DACH 24] Ship data faster with dbt - Sean McIntyre
DataScienceConferenc1
 
PDF
Dbt documentation for general setups chapter 3
AlokNayak66
 
PPTX
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
Kaan Onuk
 
PDF
Technical Deck Delta Live Tables.pdf
Ilham31574
 
PDF
Slides: Case Study — How J.B. Hunt is Driving Efficiency with AI and Real-Tim...
DATAVERSITY
 
PDF
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
PDF
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
vitm11
 
PDF
Data Science towards the Digital Enterprise
Jake Bouma
 
PDF
Analytics Engineering With Sql And Dbt Building Meaningful Data Models At Sca...
dekendetlic
 
PDF
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
PDF
My Path From Data Engineer to Analytics Engineer
GoDataDriven
 
PDF
Data in Action
Natalino Busa
 
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
PDF
This Week in Neo4j - 16th June 2018
Neo4j
 
DBT Training in Hyderabad | Data Build Tool Training Online Course
susheel visualpath
 
Speeding Time to Insight with a Modern ELT Approach
Databricks
 
Data Build Tool Training | Best Online DBT Courses
susheel visualpath
 
Data Build Tool Training | DBT Training
susheel visualpath
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Next Gen Data Modeling in the Open Data Platform With Doron Porat and Liran Y...
HostedbyConfluent
 
[DSC DACH 24] Ship data faster with dbt - Sean McIntyre
DataScienceConferenc1
 
Dbt documentation for general setups chapter 3
AlokNayak66
 
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
Kaan Onuk
 
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Slides: Case Study — How J.B. Hunt is Driving Efficiency with AI and Real-Tim...
DATAVERSITY
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
HostedbyConfluent
 
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
vitm11
 
Data Science towards the Digital Enterprise
Jake Bouma
 
Analytics Engineering With Sql And Dbt Building Meaningful Data Models At Sca...
dekendetlic
 
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
My Path From Data Engineer to Analytics Engineer
GoDataDriven
 
Data in Action
Natalino Busa
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
This Week in Neo4j - 16th June 2018
Neo4j
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Ad

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 

3D: DBT using Databricks and Delta

  • 1. 3D: DBT, Databricks and Delta Fokko Driesprong Principal Code Connaisseur
  • 2. whoami ▪ Fokko Driesprong ▪ Master Distributed Systems & Software Engineering ▪ Code Connaisseur at GoDataDriven ▪ Mostly doing {data,software} engineering ▪ Open source enthousiast ▪ ASF Member ▪ Committer + PMC member on Apache {Airflow, Avro, Druid} ▪ Committer on Apache Parquet
  • 3. GoDataDriven ▪ Amsterdam based consultancy ▪ Data {Engineer,Science,Strategy} ▪ And now also analytics engineering! ▪ Around 50 consultants ▪ Used to do Hadoop, now Cloud
  • 4. Agenda What is DBT? DBT + Delta Lake DBT + Azure Databricks
  • 5. What’s DBT? And why I ❤ it so much
  • 6. Data Build Tool ▪ Tool for building data pipelines following the DataOps principles ▪ Simple tool to build complex pipelines ▪ Best practices from software engineering ▪ Linting / Peer reviews / DRY principle / Data testing ▪ SQL First ▪ Encodes organisational knowledge into the pipeline Try it yourself: https://godatadriven.com/blog/tutorial-for-dbt-analytics-engineering-made-easy/ But everyone just calls it DBT
  • 7. Data Build Tool ▪ Main focus on integration with DWH platforms ▪ Postgres, Redshift, Snowflake and Bigquery ▪ Support for Spark / Databricks ▪ Created by Fishtown Analytics ▪ Huge open source community ▪ Apache 2.0 Open Source license But everyone just calls it DBT
  • 8. DBT The T in Extract-Transform-Load (ELT) Analysts using dbt can transform their data by simply writing select statements, while dbt handles turning these statements into tables and views in a data warehouse.
  • 9. Let’s go through at a small example Combine orders and order_lines into a revenue table
  • 10. SQL with some Ninja2 sauce Revenue table
  • 11. DBT as a SQL Runner Executes the pipeline from the command line
  • 12. Seamless integration with the Databricks Metastore ▪ Requires a Hive Metastore ▪ Analyize table
  • 13. DBT as a SQL Compiler Compiled SQL: target/compiled/dbtpreprocessing/models/revenue.sql
  • 14. Next to the SQL there is documentation Give meaning to the columns and add constraints
  • 15. Looking at the docs dbt docs generate dbt docs serve ▪ Columns including types ▪ Test constraints ▪ Statistics ▪ The compiled query
  • 16. Testing dbt test ▪ Not-null ▪ Uniqueness ▪ Accepted values ▪ Referential constraints ▪ Custom tests
  • 17. How does DBT communicate with Spark? ▪ SQL Over HTTP ▪ Authenticate using the token ▪ Parallel execution
  • 19. Switch to incremental ingestion Using the Delta format ▪ ACID dataformat by Databricks ▪ Linux Software Foundation ▪ Allows MERGE INTO ▪ Enabled incremental imports
  • 20. Switch to incremental Delta If the table doesn’t exists (yet)
  • 21. Switch to incremental Delta Incremental MERGE INTO if the table exists
  • 23. In practice Incremental imports ▪ Watermark column ▪ Only load the changed orders ▪ Also interesting for Users table
  • 24. DBT Macro’s Running it a second time ▪ Don’t Repeat Yourself ▪ Write a Macro instead
  • 25. DBT with Azure Databricks
  • 26. Observability is king Keeping track of your pipelines ▪ Building trust ▪ Track aggregated metrics computed by Spark ▪ Application insights ▪ Centralized system
  • 27. Very simple Hive UDF Keep track of stats over time
  • 28. Small snippet of Scala Sends the metrics to Application Insights
  • 29. Use the UDF in DBT Sends the metrics to Application Insights ▪ Register the UDF ▪ Keep track of ▪ Seconds since last order ▪ Number of orders
  • 30. Be proactive Before there are angry managers at your desk ▪ Keep track of the metric ▪ Send alerts on business rules ▪ Outlier based on historical distributions
  • 31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. ▪ Code available at: ▪ https://github.com/godatadriven/dbt-data-ai-summit ▪ https://github.com/godatadriven/azure-dbt-logger
  • 34. Two Columns ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format ▪ Bulleted list format Headline FormatHeadline Format
  • 35. Attribution Format Second line of attribution This is a template for a quote slide. This is where the quote goes. Attribute the source below…
  • 36. Databricks simplifies data and AI so data teams can innovate faster
  • 37. Logos