SlideShare a Scribd company logo
2
Most read
5
Most read
8
Most read
Running
Apache Airflow
Workflows as ETL
Processes on Hadoop
By: Robert Sanders
2Page:
Agenda
• What is Apache Airflow?
• Features
• Architecture
• Terminology
• Operator Types
• ETL Best Practices
• How they’re supported in Apache Airflow
• Executing Airflow Workflows on Hadoop
• Use Cases
• Q&A
3Page:
Robert Sanders
• Big Data Manager, Engineer, Architect, etc.
• Work for Clairvoyant LLC
• 5+ Years of Big Data Experience
• Email: robert.sanders@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/robert-sanders-
61446732
• Slide Share: http://www.slideshare.net/RobertSanders49
4Page:
What’s the problem?
• As a Big Data Engineer you work to create jobs that will
perform various operations
• Ingest data from external data sources
• Transformation of Data
• Run Predictions
• Export data
• Etc.
• You need to have some mechanism to schedule and run
these jobs
• Cron
• Oozie
• Existing Scheduling Services have a number of limitations
that make them difficult to work with and not usable in all
instances
5Page:
What is Apache Airflow?
• Airflow is an Open Source platform to programmatically
author, schedule and monitor workflows
• Workflows as Code
• Schedules Jobs through Cron Expressions
• Provides monitoring tools like alerts and a web interface
• Written in Python
• As well as user defined Workflows and Plugins
• Was started in the fall of 2014 by Maxime Beauchemin at
Airbnb
• Apache Incubator Project
• Joined Apache Foundation in early 2016
• https://github.com/apache/incubator-airflow/
6Page:
Why use Apache Airflow?
• Define Workflows as Code
• Makes workflows more maintainable, versionable, and
testable
• More flexible execution and workflow generation
• Lots of Features
• Feature Rich Web Interface
• Worker Processes Scale Horizontally and Vertically
• Can be a cluster or single node setup
• Lightweight Workflow Platform
7Page:
Apache Airflow Features (Some of them)
• Automatic Retries
• SLA monitoring/alerting
• Complex dependency rules: branching, joining, sub-
workflows
• Defining ownership and versioning
• Resource Pools: limit concurrency + prioritization
• Plugins
• Operators
• Executors
• New Views
• Built-in integration with other services
• Many more…
8Page:
9Page:
10Page:
What is a DAG?
• Directed Acyclic Graph
• A finite directed graph that doesn’t have any cycles
• A collection of tasks to run, organized in a way that reflects
their relationships and dependencies
• Defines your Workflow
11Page:
What is an Operator?
• An operator describes a single task in a workflow
• Operators allow for generation of certain types of tasks that
become nodes in the DAG when instantiated
• All operators derive from BaseOperator and inherit many
attributes and methods that way
12Page:
Workflow Operators (Sensors)
• A type of operator that keeps running until a certain criteria
is met
• Periodically pokes
• Parameterized poke interval and timeout
• Example
• HdfsSensor
• HivePartitionSensor
• NamedHivePartitionSensor
• S3KeyPartition
• WebHdfsSensor
• Many More…
13Page:
Workflow Operators (Transfer)
• Operator that moves data from one system to another
• Data will be pulled from the source system, staged on the
machine where the executor is running and then transferred
to the target system
• Example:
• HiveToMySqlTransfer
• MySqlToHiveTransfer
• HiveToMsSqlTransfer
• MsSqlToHiveTransfer
• S3ToHiveTransfer
• Many More…
14Page:
Defining a DAG
from airflow.models import DAG
from airflow.operators import …
from datetime import datetime, timedelta
default_args = dict(
'owner'='Airflow’,
'retries': 1,
'retry_delay': timedelta(minutes=5),
)
# Define the DAG
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *')
# Define the Tasks
task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag)
task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag)
task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag)
# Define the task relationships
task1.set_downstream(task2)
task2.set_downstream(task3)
task1 task2 task3
15Page:
Defining a DAG (Dynamically)
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *')
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
task1 task2 task3
16Page:
ETL Best Practices (Some of Them)
• Load Data Incrementally
• Operators will receive an execution_date entry which you can
use to pull in data since that date
• Process historic Data
• Backfill operations are supported
• Enforce Idempotency (retry safe)
• Execute Conditionally
• Branching, Joining
• Understand SLA’s and Alerts
• Alert if Failures
• Sense when to start a task
• Sensor Operators
• Build Validation into your Workflows
17Page:
Executing Airflow Workflows on Hadoop
• Airflow Workers should be installed on a edge/gateway
nodes
• Allows Airflow to interact with Hadoop related commands
• Utilize the BashOperator to run command line functions
and interact with Hadoop services
• Put all necessary scripts and Jars in HDFS and pull the files
down from HDFS during the execution of the script
• Avoids requiring you to keep copies of the scripts on
every machine where the executors are running
• Support for Kerborized Clusters
• Airflow can renew Kerberos tickets for itself and store it
in the ticket cache
18Page:
Use Case (BPV)
• Daily ETL Batch Process to Ingest data into Hadoop
• Extract
• 23 databases total
• 1226 tables total
• Transform
• Impala scripts to join and transform data
• Load
• Impala scripts to load data into common final tables
• Other requirements
• Make it extensible to allow the client to import more databases and
tables in the future
• Status emails to be sent out after daily job to report on success and
failures
• Solution
• Create a DAG that dynamically generates the workflow based off data
in a Metastore
19Page:
Use Case (BPV) (Architecture)
20Page:
Use Case (BPV) (DAG)
100 foot view 10,000 foot view
21Page:
Use Case (Kogni)
• New Product being built by Clairvoyant to facilitate:
• kogni-inspector – Sensitive Data Analyzer
• kogni-ingestor – Ingests Data
• kogni-guardian – Sensitive Data Masking (Encrypt and
Tokenize)
• Others components coming soon
• Utilizes Airflow for Data Ingestion and Masking
• Dynamically creates a workflow based off what is in the
Metastore
• Learn More: http://kogni.io/
22Page:
Use Case (Kogni) (Architecture)
23Page:
References
• https://pythonhosted.org/airflow/
• https://gtoonstra.github.io/etl-with-airflow/principles.html
• https://github.com/apache/incubator-airflow
• https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
Q&A

More Related Content

What's hot (20)

PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PPTX
Airflow presentation
Anant Corporation
 
PDF
Introducing Apache Airflow and how we are using it
Bruno Faria
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
Airflow presentation
Ilias Okacha
 
PDF
Apache Airflow Architecture
Gerard Toonstra
 
PDF
Apache airflow
Purna Chander
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
Airflow - a data flow engine
Walter Liu
 
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
PPTX
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
PDF
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PDF
Airflow for Beginners
Varya Karpenko
 
PDF
Server monitoring using grafana and prometheus
Celine George
 
PDF
Apache Airflow
Knoldus Inc.
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Introduction to Redis
Arnab Mitra
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow presentation
Anant Corporation
 
Introducing Apache Airflow and how we are using it
Bruno Faria
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Airflow presentation
Ilias Okacha
 
Apache Airflow Architecture
Gerard Toonstra
 
Apache airflow
Purna Chander
 
Apache Spark Architecture
Alexey Grishchenko
 
Airflow - a data flow engine
Walter Liu
 
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Airflow를 이용한 데이터 Workflow 관리
YoungHeon (Roy) Kim
 
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
Programming in Spark using PySpark
Mostafa
 
Airflow for Beginners
Varya Karpenko
 
Server monitoring using grafana and prometheus
Celine George
 
Apache Airflow
Knoldus Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Redis
Arnab Mitra
 
Apache Flink internals
Kostas Tzoumas
 
Iceberg: a fast table format for S3
DataWorks Summit
 

Similar to Running Airflow Workflows as ETL Processes on Hadoop (20)

PPTX
Apache Airflow in Production
Robert Sanders
 
PPTX
DataPipelineApacheAirflow.pptx
John J Zhao
 
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
PDF
Managing transactions on Ethereum with Apache Airflow
Michael Ghen
 
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
Accentfuture
 
PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
PPSX
Introduce Airflow.ppsx
ManKD
 
PDF
Airflow - Insane power in a Tiny Box
Dovy Paukstys
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PPTX
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
PPTX
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
PPTX
Apache Airdrop detailed description.pptx
prince07031999
 
PDF
Introduction to Apache Airflow - Programmatically Manage Your Workflows for ...
Xiaodong DENG
 
PPTX
Apache Airflow presentation by GenPPT.pptx
VikasTomar93
 
Apache Airflow in Production
Robert Sanders
 
DataPipelineApacheAirflow.pptx
John J Zhao
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Managing transactions on Ethereum with Apache Airflow
Michael Ghen
 
Introduction to Apache Airflow & Workflow Orchestration.pptx
Accentfuture
 
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
Introduce Airflow.ppsx
ManKD
 
Airflow - Insane power in a Tiny Box
Dovy Paukstys
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Apache Airdrop detailed description.pptx
prince07031999
 
Introduction to Apache Airflow - Programmatically Manage Your Workflows for ...
Xiaodong DENG
 
Apache Airflow presentation by GenPPT.pptx
VikasTomar93
 
Ad

More from clairvoyantllc (12)

PPTX
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
PPTX
MongoDB Replication fundamentals - Desert Code Camp - October 2014
clairvoyantllc
 
PPTX
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
clairvoyantllc
 
PPTX
Big data in the cloud - Shekhar Vemuri
clairvoyantllc
 
PPTX
Webservices Workshop - september 2014
clairvoyantllc
 
PPTX
Bigdata workshop february 2015
clairvoyantllc
 
PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
Databricks Community Cloud
clairvoyantllc
 
PPTX
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
clairvoyantllc
 
PPTX
Event Driven Architectures - Phoenix Java Users Group 2013
clairvoyantllc
 
PDF
Strata+Hadoop World NY 2016 - Avinash Ramineni
clairvoyantllc
 
PDF
HBase from the Trenches - Phoenix Data Conference 2015
clairvoyantllc
 
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
clairvoyantllc
 
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
clairvoyantllc
 
Big data in the cloud - Shekhar Vemuri
clairvoyantllc
 
Webservices Workshop - september 2014
clairvoyantllc
 
Bigdata workshop february 2015
clairvoyantllc
 
Intro to Apache Spark
clairvoyantllc
 
Databricks Community Cloud
clairvoyantllc
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
clairvoyantllc
 
Event Driven Architectures - Phoenix Java Users Group 2013
clairvoyantllc
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
clairvoyantllc
 
HBase from the Trenches - Phoenix Data Conference 2015
clairvoyantllc
 
Ad

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 

Running Airflow Workflows as ETL Processes on Hadoop

  • 1. Running Apache Airflow Workflows as ETL Processes on Hadoop By: Robert Sanders
  • 2. 2Page: Agenda • What is Apache Airflow? • Features • Architecture • Terminology • Operator Types • ETL Best Practices • How they’re supported in Apache Airflow • Executing Airflow Workflows on Hadoop • Use Cases • Q&A
  • 3. 3Page: Robert Sanders • Big Data Manager, Engineer, Architect, etc. • Work for Clairvoyant LLC • 5+ Years of Big Data Experience • Email: [email protected] • LinkedIn: https://www.linkedin.com/in/robert-sanders- 61446732 • Slide Share: http://www.slideshare.net/RobertSanders49
  • 4. 4Page: What’s the problem? • As a Big Data Engineer you work to create jobs that will perform various operations • Ingest data from external data sources • Transformation of Data • Run Predictions • Export data • Etc. • You need to have some mechanism to schedule and run these jobs • Cron • Oozie • Existing Scheduling Services have a number of limitations that make them difficult to work with and not usable in all instances
  • 5. 5Page: What is Apache Airflow? • Airflow is an Open Source platform to programmatically author, schedule and monitor workflows • Workflows as Code • Schedules Jobs through Cron Expressions • Provides monitoring tools like alerts and a web interface • Written in Python • As well as user defined Workflows and Plugins • Was started in the fall of 2014 by Maxime Beauchemin at Airbnb • Apache Incubator Project • Joined Apache Foundation in early 2016 • https://github.com/apache/incubator-airflow/
  • 6. 6Page: Why use Apache Airflow? • Define Workflows as Code • Makes workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of Features • Feature Rich Web Interface • Worker Processes Scale Horizontally and Vertically • Can be a cluster or single node setup • Lightweight Workflow Platform
  • 7. 7Page: Apache Airflow Features (Some of them) • Automatic Retries • SLA monitoring/alerting • Complex dependency rules: branching, joining, sub- workflows • Defining ownership and versioning • Resource Pools: limit concurrency + prioritization • Plugins • Operators • Executors • New Views • Built-in integration with other services • Many more…
  • 10. 10Page: What is a DAG? • Directed Acyclic Graph • A finite directed graph that doesn’t have any cycles • A collection of tasks to run, organized in a way that reflects their relationships and dependencies • Defines your Workflow
  • 11. 11Page: What is an Operator? • An operator describes a single task in a workflow • Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated • All operators derive from BaseOperator and inherit many attributes and methods that way
  • 12. 12Page: Workflow Operators (Sensors) • A type of operator that keeps running until a certain criteria is met • Periodically pokes • Parameterized poke interval and timeout • Example • HdfsSensor • HivePartitionSensor • NamedHivePartitionSensor • S3KeyPartition • WebHdfsSensor • Many More…
  • 13. 13Page: Workflow Operators (Transfer) • Operator that moves data from one system to another • Data will be pulled from the source system, staged on the machine where the executor is running and then transferred to the target system • Example: • HiveToMySqlTransfer • MySqlToHiveTransfer • HiveToMsSqlTransfer • MsSqlToHiveTransfer • S3ToHiveTransfer • Many More…
  • 14. 14Page: Defining a DAG from airflow.models import DAG from airflow.operators import … from datetime import datetime, timedelta default_args = dict( 'owner'='Airflow’, 'retries': 1, 'retry_delay': timedelta(minutes=5), ) # Define the DAG dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *') # Define the Tasks task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag) task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag) task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag) # Define the task relationships task1.set_downstream(task2) task2.set_downstream(task3) task1 task2 task3
  • 15. 15Page: Defining a DAG (Dynamically) dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *') last_task = None for i in range(1, 3): task = BashOperator( task_id='task' + str(i), bash_command="echo 'Task" + str(i) + "'", dag=dag) if last_task is None: last_task = task else: last_task.set_downstream(task) last_task = task task1 task2 task3
  • 16. 16Page: ETL Best Practices (Some of Them) • Load Data Incrementally • Operators will receive an execution_date entry which you can use to pull in data since that date • Process historic Data • Backfill operations are supported • Enforce Idempotency (retry safe) • Execute Conditionally • Branching, Joining • Understand SLA’s and Alerts • Alert if Failures • Sense when to start a task • Sensor Operators • Build Validation into your Workflows
  • 17. 17Page: Executing Airflow Workflows on Hadoop • Airflow Workers should be installed on a edge/gateway nodes • Allows Airflow to interact with Hadoop related commands • Utilize the BashOperator to run command line functions and interact with Hadoop services • Put all necessary scripts and Jars in HDFS and pull the files down from HDFS during the execution of the script • Avoids requiring you to keep copies of the scripts on every machine where the executors are running • Support for Kerborized Clusters • Airflow can renew Kerberos tickets for itself and store it in the ticket cache
  • 18. 18Page: Use Case (BPV) • Daily ETL Batch Process to Ingest data into Hadoop • Extract • 23 databases total • 1226 tables total • Transform • Impala scripts to join and transform data • Load • Impala scripts to load data into common final tables • Other requirements • Make it extensible to allow the client to import more databases and tables in the future • Status emails to be sent out after daily job to report on success and failures • Solution • Create a DAG that dynamically generates the workflow based off data in a Metastore
  • 19. 19Page: Use Case (BPV) (Architecture)
  • 20. 20Page: Use Case (BPV) (DAG) 100 foot view 10,000 foot view
  • 21. 21Page: Use Case (Kogni) • New Product being built by Clairvoyant to facilitate: • kogni-inspector – Sensitive Data Analyzer • kogni-ingestor – Ingests Data • kogni-guardian – Sensitive Data Masking (Encrypt and Tokenize) • Others components coming soon • Utilizes Airflow for Data Ingestion and Masking • Dynamically creates a workflow based off what is in the Metastore • Learn More: http://kogni.io/
  • 22. 22Page: Use Case (Kogni) (Architecture)
  • 23. 23Page: References • https://pythonhosted.org/airflow/ • https://gtoonstra.github.io/etl-with-airflow/principles.html • https://github.com/apache/incubator-airflow • https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
  • 24. Q&A