SlideShare a Scribd company logo
5
Most read
6
Most read
14
Most read
Data Lake Demonstration
Building Data Lakes with Apache Airflow
Gary A. Stafford
Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com
Agenda
What is a Data Lake?
Dataset
Architecture
Source Code
Demonstration
What is a Data Lake?
What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS
What is a Data Lake?
Dataset
Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed to demonstrate Amazon Redshift Cloud Data Warehouse
Small database consists of seven tables: two fact and five dimension tables
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
Building Data Lakes with Apache Airflow
Dataset
Table Simulated Datasource Demo Datasource
Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Listing COTS E-commerce Platform Amazon RDS for MySQL
Sales COTS E-commerce Platform Amazon RDS for MySQL
Date COTS E-commerce Platform Amazon RDS for MySQL
Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server
Dataset
Architecture
Architecture: AWS Services Used
Amazon Simple Storage Service (Amazon S3)
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)
Building Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): Handling changes to systems of record
Transactional Storage Layer: Managing changes to the SoR in the data lake
Streaming Data: Data continuously generated by different sources
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption
Architecture: Out of Scope (but critically important)
Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII)
DataOps: Automating testing, deployment, job execution
Infrastructure as Code (IaC): Infrastructure provisioning automation
Data Warehousing (Lake House architecture)
Data Lake Storage Tiering, Archival, and Backup
Source Code
github.com/garystafford/tickit-data-lake-demo
Demonstration

More Related Content

What's hot (20)

PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PDF
Azure Synapse Analytics
WinWire Technologies Inc
 
PPTX
Power BI for Big Data and the New Look of Big Data Solutions
James Serra
 
PDF
Deploying Confluent Platform for Production
confluent
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
PPTX
Building a modern data warehouse
James Serra
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Introduction to Time Series Analytics with Microsoft Azure
Codit
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PPTX
Azure Security and Management
Allen Brokken
 
PPTX
Data Mesh using Microsoft Fabric
Nathan Bijnens
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PPTX
Azure purview
Shafqat Turza
 
PDF
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Cathrine Wilhelmsen
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PDF
Big Data Architecture
Guido Schmutz
 
PPSX
On-premise to Microsoft Azure Cloud Migration.
Emtec Inc.
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Azure Synapse Analytics
WinWire Technologies Inc
 
Power BI for Big Data and the New Look of Big Data Solutions
James Serra
 
Deploying Confluent Platform for Production
confluent
 
Databricks Platform.pptx
Alex Ivy
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Building a modern data warehouse
James Serra
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Introduction to Time Series Analytics with Microsoft Azure
Codit
 
Databricks Delta Lake and Its Benefits
Databricks
 
Azure Security and Management
Allen Brokken
 
Data Mesh using Microsoft Fabric
Nathan Bijnens
 
Databricks Fundamentals
Dalibor Wijas
 
Free Training: How to Build a Lakehouse
Databricks
 
Azure purview
Shafqat Turza
 
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Cathrine Wilhelmsen
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Big Data Architecture
Guido Schmutz
 
On-premise to Microsoft Azure Cloud Migration.
Emtec Inc.
 

Similar to Building Data Lakes with Apache Airflow (10)

PDF
Building a Data Lake on AWS
Gary Stafford
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PDF
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
PDF
AWS Big Data Landscape
Crishantha Nanayakkara
 
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
PDF
Serverless Big Data Architectures: Serverless Data Analytics
Kristana Kane
 
PDF
Building Serverless Data Infrastructure in the AWS Cloud
Ryan Plant
 
PPTX
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
PDF
Data Analysis - Journey Through the Cloud
Ian Massingham
 
Building a Data Lake on AWS
Gary Stafford
 
Owning Your Own (Data) Lake House
Data Con LA
 
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
AWS Big Data Landscape
Crishantha Nanayakkara
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Sungmin Kim
 
Serverless Big Data Architectures: Serverless Data Analytics
Kristana Kane
 
Building Serverless Data Infrastructure in the AWS Cloud
Ryan Plant
 
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
Data Analysis - Journey Through the Cloud
Ian Massingham
 
Ad

More from Gary Stafford (6)

PDF
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Gary Stafford
 
PDF
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford
 
PDF
How Mature is Your Infrastructure?
Gary Stafford
 
PDF
Infrastructure as Code Maturity Model v1
Gary Stafford
 
PDF
Enterprise DevOps Adoption LinkedIn
Gary Stafford
 
PDF
From Zurich to the Cosmos, by Artist Steve Carpenter
Gary Stafford
 
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...
Gary Stafford
 
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford
 
How Mature is Your Infrastructure?
Gary Stafford
 
Infrastructure as Code Maturity Model v1
Gary Stafford
 
Enterprise DevOps Adoption LinkedIn
Gary Stafford
 
From Zurich to the Cosmos, by Artist Steve Carpenter
Gary Stafford
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 

Building Data Lakes with Apache Airflow

  • 1. Data Lake Demonstration Building Data Lakes with Apache Airflow Gary A. Stafford
  • 3. Agenda What is a Data Lake? Dataset Architecture Source Code Demonstration
  • 4. What is a Data Lake?
  • 5. What is a Data Lake? “A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.” - Databricks “A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.” - AWS
  • 6. What is a Data Lake?
  • 8. Dataset TICKIT database E-commerce platform Bringing together buyers and sellers of tickets to entertainments events Designed to demonstrate Amazon Redshift Cloud Data Warehouse Small database consists of seven tables: two fact and five dimension tables Tables: Categories, Events, Venues, Users, Listings, Sales, Dates docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
  • 10. Dataset Table Simulated Datasource Demo Datasource Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Listing COTS E-commerce Platform Amazon RDS for MySQL Sales COTS E-commerce Platform Amazon RDS for MySQL Date COTS E-commerce Platform Amazon RDS for MySQL Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server
  • 13. Architecture: AWS Services Used Amazon Simple Storage Service (Amazon S3) AWS Glue Studio (alt. AWS Glue DataBrew) AWS Glue Data Catalog (alt. Apache Hive on EMR) AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect) AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR) Amazon Athena (alt. Presto on EMR) Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)
  • 16. Architecture: Out of Scope (but critically important) Change Data Capture (CDC): Handling changes to systems of record Transactional Storage Layer: Managing changes to the SoR in the data lake Streaming Data: Data continuously generated by different sources Fine-grained Authorization: database-, table-, column-, and row-level access Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption
  • 17. Architecture: Out of Scope (but critically important) Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII) DataOps: Automating testing, deployment, job execution Infrastructure as Code (IaC): Infrastructure provisioning automation Data Warehousing (Lake House architecture) Data Lake Storage Tiering, Archival, and Backup