SlideShare a Scribd company logo
Airflow
Workflow management system
Ilias OKACHA
Index
- Workflows Management Systems
- Architecture
- Building blocks
- More features
- User Interface
- Security
- CLI
- Demo
WTH is a Workflow Management System ?
A worflow Management system is:
Is a data-centric software (framework) for :
- Setting up
- Performing
- Monitoring
of a defined sequence of processes and tasks
Popular Workflow Management Systems
Airflow Architecture
Airflow architecture
SequentialExecutor / LocalExecutor
Airflow architecture
CeleryExecutor
Airflow architecture
HA + CeleryExector
Airflow architecture
● MesosExecutor : already available in contrib package
● KubernetesExecutor ??
Building blocks
Dags :
- Directed Acyclic Graph
- Is a collection of all the tasks you want to run
- DAGs describe how to run a workflow
Building blocks
Building blocks
Dags :
Building blocks
Operators :
- Describes a single task in a workflow.
- Determine what actually gets done
- Operators generally run independently (atomic)
- The DAG make sure that operators run in the correct certain order
- They may run on completely different machines
Building blocks
Operators : There are 3 main types of operators:
● Operators that performs an action, or tell another system to perform an action
● Transfer operators move data from one system to another
● Sensors are a certain type of operator that will keep running until a certain criterion is met.
○ Examples include a specific file landing in HDFS or S3.
○ A partition appearing in Hive.
○ A specific time of the day.
Operators :
- Operators :
- BashOperator
- PythonOperator
- EmailOperator
- HTTPOperator
- MySqlOperator
- SqliteOperator
- PostgresOperator
- MsSqlOperator
- OracleOperator
- JdbcOperator
- DockerOperator
- HiveOperator
- SlackOperator
Building blocks
Operators :
- Transfers :
- S3FileTransferOperator
- PrestoToMysqlOperator
- MySqlToHiveTransfer
- S3ToHiveTransfer
- BigQueryToCloudStorageOperator
- GenericTransfer
- HiveToDruidTransfer
- HiveToMySqlTransfer
Building blocks
Operators :
- Sensors :
- ExternalTaskSensor
- HdfsSensor
- HttpSensor
- MetastorePartitionSensor
- HivePartitionSensor
- S3KeySensor
- S3PrefixSensor
- SqlSensor
- TimeDeltaSensor
- TimeSensor
- WebHdfsSensor
Building blocks
Building blocks
Operators :
Tasks : a parameterized instance of an operator
Building blocks
Building blocks
Task Instance : Dag + Task + point in time
- Specific run of a Task
- A task assigned to a DAG
- Has State associated with a specific run of the DAG
- States : it could be
- running
- success,
- failed
- skipped
- up for retry
- …
Building blocks
Workflows :
● DAG: a description of the order in which work should take place
● Operator: a class that acts as a template for carrying out some work
● Task: a parameterized instance of an operator
● Task Instance: a task that
○ Has been assigned to a DAG
○ Has a state associated with a specific run of the DAG
● By combining DAGs and Operators to create TaskInstances, you can build complex workflows.
Building blocks
More features
- Features :
- Hooks
- Connections
- Variables
- XComs
- SLA
- Pools
- Queues
- Trigger Rules
- Branchings
- SubDags
More features
Hooks :
- Interface to external platforms and databases :
- Hive
- S3
- MySQL
- PostgreSQL
- HDFS
- Hive
- Pig
- …
- Act as building block for Operators
- Use Connection to retrieve authentication informations
- Keep authentication infos out of pipelines.
More features
Connections :
Connection informations to external systems are stored in the airflow metadata Database and managed in the UI
More features
More features
Exemple de Hook + connection :
More features
More features
Variables :
- A generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow.
- Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI.
- While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control, it
can be useful to have some variables or configuration items accessible and modifiable through the UI.
More features
XCom or Cross-communication:
● Let tasks exchange messages allowing shared state.
● Defined by a key, value, and timestamp.
● Also track attributes like the task/DAG that created the XCom and when it should become visible.
● Any object that can be pickled can be used as an XCom value.
XComs can be :
● Pushed (sent) :
○ Calling xcom_push()
○ If a task return a value (from its operator execute() method) or from a PythonOperator’s python_callable
● Pulled (received) : calling xcom_pull()
More features
More features
SLA :
- Service Level Agreements, or time by which a task or DAG should have succeeded,
- Can be set at a task level as a timedelta.
- An alert email is sent detailing the list of tasks that missed their SLA.
More features
Pools :
- Some systems can get overwhelmed when too many processes hit them at the same time.
- Limit the execution parallelism on arbitrary sets of tasks.
More features
Pools :
Queues : (only on CeleryExecutors) :
- Every Task can be assigned a specific queue name
- By default, both worker and tasks are assigned with the default_queue queue
- Workers can be assigned multiple queues
- Very useful feature when specialized workers are needed (GPU, Spark…)
More features
More features
Trigger Rules:
Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex
dependency settings.
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is
all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based
on direct parent tasks and are values that can be passed to any operator while creating tasks:
● all_success: (default) all parents have succeeded
● all_failed: all parents are in a failed or upstream_failed state
● all_done: all parents are done with their execution
● one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done • one_success: fires as soon as at
least one parent succeeds, it does not wait for all parents to be done • dummy: dependencies are just for show, trigger at will.
User Interface
User Interface
Dags view :
User Interface
Tree view :
User Interface
Graph view :
User Interface
Gantt view :
User Interface
Task duration :
User Interface
Data Profiling : SQL Queries
User Interface
Data Profiling : Charts
User Interface
Data Profiling : Charts
CLI
CLI
https://airflow.apache.org/cli.html
airflow variables [-h] [-s KEY VAL] [-g KEY] [-j] [-d VAL] [-i FILEPATH] [-e FILEPATH] [-x KEY]
airflow connections [-h] [-l] [-a] [-d] [--conn_id CONN_ID]
[--conn_uri CONN_URI] [--conn_extra CONN_EXTRA]
[--conn_type CONN_TYPE] [--conn_host CONN_HOST]
[--conn_login CONN_LOGIN] [--conn_password CONN_PASSWORD]
[--conn_schema CONN_SCHEMA] [--conn_port CONN_PORT]
airflow pause [-h] [-sd SUBDIR] dag_id
airflow test [-h] [-sd SUBDIR] [-dr] [-tp TASK_PARAMS] dag_id task_id execution_date
airflow backfill dag_id task_id -s START_DATE -e END_DATE
airflow clear DAG_ID
airflow resetdb [-h] [-y]
...
Security
Security
By default : all access are open
Support ;
● Web authentication with :
○ Password
○ LDAP
○ Custom auth
○ Kerberos
○ OAuth
■ Github Entreprise Authentication
■ Google Authentication
● Impersonation (run as other $USER)
● Secure access via SSL
Demo
Demo
1. Facebook Ads insights data pipeline.
2. Run a pyspark script on a ephemeral dataproc cluster only when s3 data input is available
3. Useless workflow : Hook + Connection + Operators + Sensors + XCom +( SLA ):
○ List s3 files (hooks)
○ Share state with the next task (xcom)
○ Write content to s3 (hooks)
○ Resume the workflow when an S3 DONE.FLAG file is ready (sensor)
Resources
https://airflow.apache.org
http://www.clairvoyantsoft.com/assets/whitepapers/GuideToApacheAirflow.pdf
https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned
https://www.slideshare.net/sumitmaheshwari007/apache-airflow
Thanks

More Related Content

PPTX
Airflow presentation
Anant Corporation
 
PDF
Apache Airflow
Knoldus Inc.
 
PDF
Introduction to Apache Airflow
mutt_data
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
PDF
Airflow introduction
Chandler Huang
 
PPTX
Apache airflow
Pavel Alexeev
 
PDF
Apache Airflow
Sumit Maheshwari
 
Airflow presentation
Anant Corporation
 
Apache Airflow
Knoldus Inc.
 
Introduction to Apache Airflow
mutt_data
 
Apache Airflow overview
NikolayGrishchenkov
 
Airflow Intro-1.pdf
BagustTriCahyo1
 
Airflow introduction
Chandler Huang
 
Apache airflow
Pavel Alexeev
 
Apache Airflow
Sumit Maheshwari
 

What's hot (20)

PDF
Apache airflow
Purna Chander
 
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
PDF
Introducing Apache Airflow and how we are using it
Bruno Faria
 
PPTX
Airflow 101
SaarBergerbest
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
Apache Airflow Architecture
Gerard Toonstra
 
PPTX
Airflow - a data flow engine
Walter Liu
 
PPTX
Apache Airflow Introduction
Liangjun Jiang
 
PDF
From airflow to google cloud composer
Bruce Kuo
 
PDF
Apache Airflow
Knoldus Inc.
 
PPTX
Airflow and supervisor
Rafael Roman Otero
 
PDF
Airflow tutorials hands_on
pko89403
 
PPTX
Apache Airflow in Production
Robert Sanders
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PDF
Airflow for Beginners
Varya Karpenko
 
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
PDF
Grafana introduction
Rico Chen
 
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
PDF
KFServing and Feast
Animesh Singh
 
Apache airflow
Purna Chander
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Introducing Apache Airflow and how we are using it
Bruno Faria
 
Airflow 101
SaarBergerbest
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Apache Airflow Architecture
Gerard Toonstra
 
Airflow - a data flow engine
Walter Liu
 
Apache Airflow Introduction
Liangjun Jiang
 
From airflow to google cloud composer
Bruce Kuo
 
Apache Airflow
Knoldus Inc.
 
Airflow and supervisor
Rafael Roman Otero
 
Airflow tutorials hands_on
pko89403
 
Apache Airflow in Production
Robert Sanders
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Airflow for Beginners
Varya Karpenko
 
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Grafana introduction
Rico Chen
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
KFServing and Feast
Animesh Singh
 
Ad

Similar to Airflow presentation (20)

PPTX
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
PPTX
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
PPTX
DataPipelineApacheAirflow.pptx
John J Zhao
 
PPTX
Apache Airdrop detailed description.pptx
prince07031999
 
PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
PDF
Managing transactions on Ethereum with Apache Airflow
Michael Ghen
 
PPTX
Fyber - airflow best practices in production
Itai Yaffe
 
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
PPSX
Introduce Airflow.ppsx
ManKD
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PDF
Airflow - Insane power in a Tiny Box
Dovy Paukstys
 
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
PPTX
Airflow
JitheeshaThankachan
 
PDF
Powering machine learning workflows with Apache Airflow and Python
Tatiana Al-Chueyr
 
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
DataPipelineApacheAirflow.pptx
John J Zhao
 
Apache Airdrop detailed description.pptx
prince07031999
 
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Managing transactions on Ethereum with Apache Airflow
Michael Ghen
 
Fyber - airflow best practices in production
Itai Yaffe
 
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Introduce Airflow.ppsx
ManKD
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow - Insane power in a Tiny Box
Dovy Paukstys
 
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Powering machine learning workflows with Apache Airflow and Python
Tatiana Al-Chueyr
 
Ad

Recently uploaded (20)

PDF
Generative AI Foundations: AI Skills for the Future of Work
hemal sharma
 
PPTX
The Monk and the Sadhurr and the story of how
BeshoyGirgis2
 
PDF
Project English Paja Jara Alejandro.jpdf
AlejandroAlonsoPajaJ
 
PPTX
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
PPTX
谢尔丹学院毕业证购买|Sheridan文凭不见了怎么办谢尔丹学院成绩单
mookxk3
 
PDF
KIPER4D situs Exclusive Game dari server Star Gaming Asia
hokimamad0
 
PPT
Transformaciones de las funciones elementales.ppt
rirosel211
 
PDF
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PPTX
Generics jehfkhkshfhskjghkshhhhlshluhueheuhuhhlhkhk.pptx
yashpavasiya892
 
PDF
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
PPTX
SEO Trends in 2025 | B3AITS - Bow & 3 Arrows IT Solutions
B3AITS - Bow & 3 Arrows IT Solutions
 
PPTX
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
PPT
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
PDF
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
PPTX
Parallel & Concurrent ...
yashpavasiya892
 
PPTX
Black Yellow Modern Minimalist Elegant Presentation.pptx
nothisispatrickduhh
 
PDF
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
PPTX
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
PDF
Slides: PDF Eco Economic Epochs for World Game (s) pdf
Steven McGee
 
Generative AI Foundations: AI Skills for the Future of Work
hemal sharma
 
The Monk and the Sadhurr and the story of how
BeshoyGirgis2
 
Project English Paja Jara Alejandro.jpdf
AlejandroAlonsoPajaJ
 
The Latest Scam Shocking the USA in 2025.pptx
onlinescamreport4
 
谢尔丹学院毕业证购买|Sheridan文凭不见了怎么办谢尔丹学院成绩单
mookxk3
 
KIPER4D situs Exclusive Game dari server Star Gaming Asia
hokimamad0
 
Transformaciones de las funciones elementales.ppt
rirosel211
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
Generics jehfkhkshfhskjghkshhhhlshluhueheuhuhhlhkhk.pptx
yashpavasiya892
 
BGP Security Best Practices that Matter, presented at PHNOG 2025
APNIC
 
SEO Trends in 2025 | B3AITS - Bow & 3 Arrows IT Solutions
B3AITS - Bow & 3 Arrows IT Solutions
 
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
Parallel & Concurrent ...
yashpavasiya892
 
Black Yellow Modern Minimalist Elegant Presentation.pptx
nothisispatrickduhh
 
PDF document: World Game (s) Great Redesign.pdf
Steven McGee
 
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
Slides: PDF Eco Economic Epochs for World Game (s) pdf
Steven McGee
 

Airflow presentation