SlideShare a Scribd company logo
Introduction to Apache Airflow
Programmatically Manage Your Workflows
for Data Engineering
Xiaodong DENG
XD-DENG.com
August 2018
SHORT-BIO
• Education
• M.Sc in Mathematics, National University of Singapore, Singapore (2014-2016)
• B.Sc in Applied Mathematics, Beijing Forestry University, China (2010-2014)
• Working Experience
• 2018 July - Present: Data Engineer, DBS Bank
• 2017 May - 2018 June: Assistant Manager, Advanced Analytics, Manulife Insurance
• 2016 April - 2017 April: Data Analytics Specialist, AXA Insurance
2
LET’S IMAGINE - A VERY SIMPLE USE CASE
3
Query your metadata database to decide if the batch job should be run today.
You have 5 external data sources.
For each data source, the data will be passed to you via S3. Two of them are
expected to arrive at 3AM, and three of them are expected to arrive at 4AM.
If SLA is missed, send notification to an email list.
If the data arrived on time, move them to your HIVE storage. If not, retry until 7am
before you fail the whole batch job and send out failure notification.
When all files are in place, submit a pre-defined spark job.
When SUCCESS signal is returned from Spark, write a record to your log (a MySQL
database).
LET’S IMAGINE - A VERY SIMPLE USE CASE
“Scripting + Cron would do!”
LET’S IMAGINE - A VERY “SIMPLE” USE CASE
What if you have hundreds of workflows to manage?
LET’S IMAGINE - A NO MORE SIMPLE USE CASE
• Scalability in terms of managing
• How do you manually manage scripts & Cron expressions for hundreds of workflows?
• Scalability in terms of execution
• For the consideration of performance, you may want to run your jobs on multiple worker nodes, how do
you manage them?
• Environment Dependencies
• Different jobs may have different dependencies, e.g., Spark, or network proxy, etc. 
• Connections to different systems (like RDBMS, AWS, Hive, HDFS, etc)
• like RDBMS, AWS, Hive, HDFS, etc. All of them come together with configurations like host address,
port, id, password, schema, etc. How to manage them in a centralised fashion?
• Monitoring
• How do we monitor the status of each step? Which batch job failed? Due to which step? For what
reason?
• Re-running
• How can we re-run a specific step? Manually do it or make ad-hoc change on the script? Neither is ideal.
APACHE AIRFLOW (INCUBATING)
• Started in 2014 at Airbnb
• Became an Apache incubator project in 2016
• Written in Python
• 500+ contributors (according to GitHub history)
• A platform to programmatically author, schedule
and monitor workflows
• Workflows are defined as directed acyclic graphs
(DAG) and configured as Python scripts.
• Supports distributed execution
• Friendly interface
APACHE AIRFLOW (INCUBATING)
Webserver Scheduler
DAGs
Metadata
database
Local
Executors
Reference: https://www.slideshare.net/sumitmaheshwari007/apache-airflow
APACHE AIRFLOW (INCUBATING)
Webserver Scheduler
Distributed
Workers
Broker
DAGs
Metadata
database
Reference: https://www.slideshare.net/sumitmaheshwari007/apache-airflow
APACHE AIRFLOW (INCUBATING)
Webserver Scheduler
Distributed
Workers
Broker
DAGs
Metadata
database
APACHE AIRFLOW (INCUBATING)
Webserver Scheduler
Distributed
Workers
Broker
DAGs
Metadata
database
DEMO
WHERE IS YOUR DEMO?!
Thanks!

More Related Content

What's hot (20)

PDF
Intro to H2O Machine Learning in R at Santa Clara University
Sri Ambati
 
PDF
resume
Akhil Katta
 
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
PPTX
Science base usage analysis - AGU2016 - in21d08
Sky Bristol
 
PPTX
Big data-science-oanyc
Open Analytics
 
PDF
Penghao Wang Intern Resume 2016 Spring
Penghao Wang
 
PDF
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Databricks
 
PDF
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
PPTX
Big data bi-mature-oanyc summit
Open Analytics
 
PDF
GluonNLP MXNet Meetup-Aug
Chenguang Wang
 
PPTX
Statistical Analysis for HR
Pranab Choudhary
 
PDF
Open Source DataViz with Apache Superset
Carl W. Handlin
 
PDF
Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Cla...
Ayush Singh, MS
 
PDF
Bootstrapping of PySpark Models for Factorial A/B Tests
Databricks
 
PDF
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
DevOpsDays Tel Aviv
 
PPTX
Automate your Machine Learning
Ajit Ananthram
 
PDF
Yi_Ou_Resume
Yi Ou
 
PDF
Resume_General
NISPAND MEHTA
 
PDF
chen_qi
Qi Chen
 
PPTX
Customer Presentation - Financial Services Organization
Splunk
 
Intro to H2O Machine Learning in R at Santa Clara University
Sri Ambati
 
resume
Akhil Katta
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
Science base usage analysis - AGU2016 - in21d08
Sky Bristol
 
Big data-science-oanyc
Open Analytics
 
Penghao Wang Intern Resume 2016 Spring
Penghao Wang
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Databricks
 
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
Big data bi-mature-oanyc summit
Open Analytics
 
GluonNLP MXNet Meetup-Aug
Chenguang Wang
 
Statistical Analysis for HR
Pranab Choudhary
 
Open Source DataViz with Apache Superset
Carl W. Handlin
 
Red Winged Black Bird Sighting Prediction using Spark Mllib Random Forest Cla...
Ayush Singh, MS
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Databricks
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
DevOpsDays Tel Aviv
 
Automate your Machine Learning
Ajit Ananthram
 
Yi_Ou_Resume
Yi Ou
 
Resume_General
NISPAND MEHTA
 
chen_qi
Qi Chen
 
Customer Presentation - Financial Services Organization
Splunk
 

Similar to Introduction to Apache Airflow - Programmatically Manage Your Workflows for Data Engineering (20)

DOCX
Deepankar Sehdev- Resume2015
Deepankar Sehdev
 
DOC
DeepeshRehi
deepesh rehi
 
PPTX
So your boss says you need to learn data science
Susan Ibach
 
PDF
Resume_Apoorva
Apoorva Pabbathi
 
PPTX
How LinkedIn Uses Scalding for Data Driven Product Development
Sasha Ovsankin
 
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
PDF
Annu Sharma Resume
Annu Sharma
 
PDF
Sandeep Resume
SandeepKaushik58
 
PDF
Resume
nagapandu
 
DOC
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Mopuru Babu
 
DOC
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Mopuru Babu
 
DOC
Resume
Rama kumar M V
 
PDF
resume
Ping Yabin
 
PDF
SRK_RES
SHIV KRISHNA
 
PPTX
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
PPTX
2015 Data Science Summit @ dato Review
Hang Li
 
PDF
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
PPTX
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
 
PDF
Simply Business' Data Platform
Dani Solà Lagares
 
PDF
Aishwarya_Resume_DA
Aishwarya Rahalkar
 
Deepankar Sehdev- Resume2015
Deepankar Sehdev
 
DeepeshRehi
deepesh rehi
 
So your boss says you need to learn data science
Susan Ibach
 
Resume_Apoorva
Apoorva Pabbathi
 
How LinkedIn Uses Scalding for Data Driven Product Development
Sasha Ovsankin
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Annu Sharma Resume
Annu Sharma
 
Sandeep Resume
SandeepKaushik58
 
Resume
nagapandu
 
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Mopuru Babu
 
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Mopuru Babu
 
resume
Ping Yabin
 
SRK_RES
SHIV KRISHNA
 
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
2015 Data Science Summit @ dato Review
Hang Li
 
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
 
Simply Business' Data Platform
Dani Solà Lagares
 
Aishwarya_Resume_DA
Aishwarya Rahalkar
 
Ad

Recently uploaded (20)

PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Ad

Introduction to Apache Airflow - Programmatically Manage Your Workflows for Data Engineering

  • 1. Introduction to Apache Airflow Programmatically Manage Your Workflows for Data Engineering Xiaodong DENG XD-DENG.com August 2018
  • 2. SHORT-BIO • Education • M.Sc in Mathematics, National University of Singapore, Singapore (2014-2016) • B.Sc in Applied Mathematics, Beijing Forestry University, China (2010-2014) • Working Experience • 2018 July - Present: Data Engineer, DBS Bank • 2017 May - 2018 June: Assistant Manager, Advanced Analytics, Manulife Insurance • 2016 April - 2017 April: Data Analytics Specialist, AXA Insurance 2
  • 3. LET’S IMAGINE - A VERY SIMPLE USE CASE 3 Query your metadata database to decide if the batch job should be run today. You have 5 external data sources. For each data source, the data will be passed to you via S3. Two of them are expected to arrive at 3AM, and three of them are expected to arrive at 4AM. If SLA is missed, send notification to an email list. If the data arrived on time, move them to your HIVE storage. If not, retry until 7am before you fail the whole batch job and send out failure notification. When all files are in place, submit a pre-defined spark job. When SUCCESS signal is returned from Spark, write a record to your log (a MySQL database).
  • 4. LET’S IMAGINE - A VERY SIMPLE USE CASE “Scripting + Cron would do!”
  • 5. LET’S IMAGINE - A VERY “SIMPLE” USE CASE What if you have hundreds of workflows to manage?
  • 6. LET’S IMAGINE - A NO MORE SIMPLE USE CASE • Scalability in terms of managing • How do you manually manage scripts & Cron expressions for hundreds of workflows? • Scalability in terms of execution • For the consideration of performance, you may want to run your jobs on multiple worker nodes, how do you manage them? • Environment Dependencies • Different jobs may have different dependencies, e.g., Spark, or network proxy, etc.  • Connections to different systems (like RDBMS, AWS, Hive, HDFS, etc) • like RDBMS, AWS, Hive, HDFS, etc. All of them come together with configurations like host address, port, id, password, schema, etc. How to manage them in a centralised fashion? • Monitoring • How do we monitor the status of each step? Which batch job failed? Due to which step? For what reason? • Re-running • How can we re-run a specific step? Manually do it or make ad-hoc change on the script? Neither is ideal.
  • 7. APACHE AIRFLOW (INCUBATING) • Started in 2014 at Airbnb • Became an Apache incubator project in 2016 • Written in Python • 500+ contributors (according to GitHub history) • A platform to programmatically author, schedule and monitor workflows • Workflows are defined as directed acyclic graphs (DAG) and configured as Python scripts. • Supports distributed execution • Friendly interface
  • 8. APACHE AIRFLOW (INCUBATING) Webserver Scheduler DAGs Metadata database Local Executors Reference: https://www.slideshare.net/sumitmaheshwari007/apache-airflow
  • 9. APACHE AIRFLOW (INCUBATING) Webserver Scheduler Distributed Workers Broker DAGs Metadata database Reference: https://www.slideshare.net/sumitmaheshwari007/apache-airflow
  • 10. APACHE AIRFLOW (INCUBATING) Webserver Scheduler Distributed Workers Broker DAGs Metadata database
  • 11. APACHE AIRFLOW (INCUBATING) Webserver Scheduler Distributed Workers Broker DAGs Metadata database