SlideShare a Scribd company logo
DA332
Lace Lofranco
Orchestrating Big Data Pipelines
with Azure Data Factory
Session Objective
Learn to design, build and manage big data
orchestration pipelines using Azure Data Factory
Agenda
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Types of analytics
Types of analytics
Top-down
(Deductive)
Bottom-up
(Inductive)
Types of analytics
What
happened?
Why did it
happen?
Top-down
(Deductive)
Bottom-up
(Inductive)
Big Data Pipelines Examples
Velocity, Variety, Volume
Big Data Pipelines
Ingest Analyze PublishData
Sources
Data Lake
Lamda Architecture
Data architecture for designing Big Data applications
Three layers: Batch, Speed, Serving
Popularized by Nathan Marz
Lamda Architecture
Lamda Architecture
HDInsight
IoT Hub
Event Hubs
Data Factory
Storm Streaming
PowerBiSQL databaseHDInsight
(LLAP/Spark)
Lamda Architecture
HDInsight
Data Factory
PowerBiSQL databaseHDInsight
(LLAP/Spark)
SPEED LAYER
Storm Streaming
Predicting tram load based on foot
traffic
Data sources:
https://data.melbourne.vic.gov.au
https://www.ptv.vic.gov.au
Sensor locations
Sensor locations
Solution Architecture
Data Factory
HDInsight Machine
Learning
Storage blobBatch
HTTP
data.melbourne
.vic.gov.au
PowerBI
Data Lake
SQL DW
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Data Factory Concepts
Data Store
Data Factory Concepts
Data Store Linked Service
Data Factory Concepts
Data Store Linked Service
Dataset
Data Factory Concepts
Data Store Linked Service
Dataset Activity
Data Factory Concepts
Data Store Linked Service
Dataset Dataset
Data Store
Activity
Linked Service
Data Factory Concepts
Data Store Linked Service
Dataset Dataset
Data Store
Activity
Linked Service
Scheduling and Execution
Sink Dataset Sink Dataset
Scheduling and Execution
Source Dataset Sink Dataset
Source Dataset Sink Dataset
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Data Factory
HDInsight Machine
Learning
Storage blobBatch
HTTP
data.melbourne
.vic.gov.au
PowerBI
Data Lake
SQL DW
Demo: Copy Activity
Data Factory Design Considerations
• Ideal for time series data
• Design repeatable activity windows
• Handle ‘late’ / out-of-order runs
• Try to finalize pipeline schedules in advance
• Currently no first class support for on-demand /
event driven pipelines and control flow activities
Agenda
Monitor pipeline health
Developer tools
Data Factory vs OozieLamda Architecture
Intro to Data Factory
Considerations
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Data Movement Activity: Copy
Cloud to Cloud
On-premises to Cloud
Data Management Gateway
Client agent installed on-premises environment to
copy data between cloud and on-premises data stores
Category Data store Source Sink
File File System* ✔ ✔
HDFS* ✔
Amazon S3 ✔
FTP ✔
Others Salesforce ✔
Generic ODBC* ✔
Generic OData ✔
Web Table (table from HTML) ✔
GE Historian* ✔
Category Data store Source Sink
Azure Azure Blob storage ✔ ✔
Azure Data Lake Store ✔ ✔
Azure SQL Database ✔ ✔
Azure SQL Data Warehouse ✔ ✔
Azure Table storage ✔ ✔
Azure DocumentDB ✔ ✔
Azure Search Index ✔ ✔
Databases SQL Server* ✔ ✔
Oracle* ✔ ✔
MySQL* ✔
DB2* ✔
Teradata* ✔
PostgreSQL* ✔
Sybase* ✔
Cassandra* ✔
MongoDB* ✔
Amazon Redshift ✔
Supported Data Stores
Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an
on-premises/Azure IaaS machine.
Copy Wizard
Considerations
• Copy service typically runs at region closest to sink
• Performance & Tuning
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Data Transformation Activities
Stored Procedure Activity
Azure SQL Database
Azure SQL Datawarehouse
Use Polybase for loading large datasets
SQL Server Database (on-premises or IAAS)
Needs Data Management Gateway
Polybase and CTAS
Polybase to access non-relational sources
‘CREATE TABLE AS’ is fully parallelized operation to
create and load tables from a SELECT statements
Super-charge SELECT INTO statement
Data Factory
HDInsight Machine
Learning
Storage blobBatch
HTTP
data.melbourne
.vic.gov.au
PowerBI
Data Lake
SQL DW
Demo: Stored Procedure Activity
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
HDInsight Activities
Hive Activity Advanced Properties
Property Hadoop config
coreConfiguration core-site.xml
hBaseConfiguration hbase-site.xml
hdfsConfiguration hdfs-site.xml
hiveConfiguration hive-site.xml
mapReduceConfiguration mapred-site.xml
oozieConfiguration oozie-site.xml
stormConfiguration storm-site.xml
yarnConfiguration yarn-site.xml
Data Factory
HDInsight Machine
Learning
Storage blobBatch
HTTP
data.melbourne
.vic.gov.au
PowerBI
Data Lake
SQL DW
Demo: Hive Activity
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
AzureML Activities
Batch Execution Activity
Scoring datasets
Retraining models
Update Resource Activity
Updating ML models
Batch Execution Activity
1. Use Web service inputs/outputs
2. Use AzureML import/export modules
Retraining your model
Machine
Learning
Predictions
.iLearner file
Updated
Scoring
web service
Training
web service
Data for
scoring
Data for
training
Batch Execution Service
Produces
Update Trained Model
Publish web sevices
Produces
Batch Execution Service
Retraining your model
Machine
Learning
Predictions
.iLearner file
Updated
Scoring
web service
Training
web service
Data for
scoring
Data for
training
Batch Execution Service
Produces
Update Trained Model
Publish web sevices
Produces
Batch Execution Service
Batch Execution Activity
Retraining your model
Machine
Learning
Predictions
.iLearner file
Updated
Scoring
web service
Training
web service
Data for
scoring
Data for
training
Batch Execution Service
Produces
Update Trained Model
Publish web sevices
Produces
Batch Execution Service
Update Resource Activity
Data Factory
HDInsight Machine
Learning
Storage blobBatch
HTTP
data.melbourne
.vic.gov.au
PowerBI
Data Lake
SQL DW
Demo: AzureML Batch Scoring Activity
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
.NET Custom Activity
Use if data source/sinks not supported by ADF
Compute: Azure Batch or HDInsight
public class ApiDownloadActivity : IDotNetActivity
{
IDictionary<string, string>
IEnumerable<LinkedService>
IEnumerable<Dataset>
Activity
IActivityLogger
//your code
}
.NET Custom Activity on Batch
Slice 1
Slice 2
Slice 4
Slice 3
Slice 5
Activity
Run
Activity
Run
Activity
Run
Activity
Run
Activity
Run
Task 1
Task 2
Task 5
Task 3
Task 4
Compute node
Compute node
Data Factory Batch
Demo: Custom Activity on Batch
Data Factory
HDInsight Machine
Learning
Storage blobBatch
HTTP
data.melbourne
.vic.gov.au
PowerBI
Data Lake
SQL DW
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Agenda
Data Movement
Data Transformation
Data Factory vs OozieBig data pipelines
Lamda Architecture
Data Factory Concepts
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Monitor and Manage Dashboard
Pause and resuming pipelines
Creating alerts
Re-running failed pipelines
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
Developer tools
Azure Portal
VS Data Factory Extension
PowerShell
Azure Resource Manager Templates
C# SDK
Agenda
Data Movement
Data Transformation
Monitor pipeline health
Developer tools
Big data pipelines
Lamda Architecture
Data Factory Concepts
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory
What is Oozie?
Oozie is a workflow scheduler system to manage
Apache Hadoop jobs
De-facto workflow scheduler of the Hadoop stack
Out-of-the-box integration with Hadoop jobs
Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and shell
Sample Oozie workflow
Data Factory vs Oozie
Data Factory Oozie
Sources and Sinks
Azure
Hadoop Stack
Numerous 3rd party
Hadoop Stack
Relational through Sqoop
Shell, Distcp
Hybrid pipelines Data Management Gateway None
Tooling Visual Studio, Portal, PS Hue, Eclipse plugin, 3rd party
Extensibility C# Custom Activity Java Action Node
Control flow Limited Fork and Join, Decision Control, Kill (on error)
Event-based No first class support. Some workarounds Dataset polling
Performance Designed for big data workloads Designed for big data workloads
Maturity Preview in Oct 2014, GA in August 2015 Developed since 2008, Open sourced since 2010
Data Factory vs Oozie
Use Oozie when…
Pipelines are exclusively on the Hadoop stack
Use Data Factory when…
Pipelines use Azure services along with the Hadoop
stack
Overall Summary
Complete your session evaluation on MyIgnite
for your chance to WIN one of many daily prizes.
(image of prizes tbc)
Session evaluation
Visit Channel 9 to access a wide range of Microsoft training
and event recordings https://channel9.msdn.com/
Head to the TechNet Eval Centre to download trials of the latest
Microsoft products http://Microsoft.com/en-us/evalcenter/
Visit Microsoft Virtual Academy for free online training visit
https://www.microsoftvirtualacademy.com
Continue your Ignite learning path
Microsoft Ignite

More Related Content

What's hot (20)

PPTX
Intro to Azure Data Factory v1
Eric Bragas
 
PDF
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
PPTX
Deep Dive into Azure Data Factory v2
Eric Bragas
 
PPTX
Azure data factory
David Giard
 
PPTX
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
PDF
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
PPTX
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
PPTX
A lap around Azure Data Factory
BizTalk360
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
PPTX
Microsoft Azure Databricks
Sascha Dittmann
 
PPTX
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Mark Kromer
 
PDF
Azure Data Factory v2
inovex GmbH
 
PPTX
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
PPTX
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
PPTX
Is there a way that we can build our Azure Data Factory all with parameters b...
Erwin de Kreuk
 
PDF
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Lace Lofranco
 
PDF
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 
PPTX
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
PPTX
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
Intro to Azure Data Factory v1
Eric Bragas
 
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
Deep Dive into Azure Data Factory v2
Eric Bragas
 
Azure data factory
David Giard
 
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Mark Kromer
 
Cortana Analytics Workshop: Azure Data Lake
MSAdvAnalytics
 
Azure data bricks by Eugene Polonichko
Alex Tumanoff
 
A lap around Azure Data Factory
BizTalk360
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Microsoft Azure Databricks
Sascha Dittmann
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Mark Kromer
 
Azure Data Factory v2
inovex GmbH
 
ETL in the Cloud With Microsoft Azure
Mark Kromer
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Microsoft Tech Community
 
Is there a way that we can build our Azure Data Factory all with parameters b...
Erwin de Kreuk
 
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Lace Lofranco
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Databricks
 
Azure Data Factory ETL Patterns in the Cloud
Mark Kromer
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 

Similar to Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory (20)

PDF
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
PPTX
Azure datafactory
Dimko Zhluktenko
 
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
PPTX
Microsoft Azure Big Data Analytics
Mark Kromer
 
PPTX
New big data architecture in hadoop.pptx
VanshGupta597842
 
PDF
azure-cloud-data-engineer-training-curriculum (1).pdf
k6640559
 
PPTX
Transform your data with Azure Data factory
Prometix Pty Ltd
 
PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
 
PPTX
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
nnakasone
 
PPTX
Designing big data analytics solutions on azure
Mohamed Tawfik
 
PPTX
Big Data with Azure
Aaron (Ari) Bornstein
 
PDF
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
PDF
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
PDF
Azure Hd insigth news
nnakasone
 
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PPTX
Big Data Analytics .pptx
priti jadhao
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PDF
Azure Data Factory usage at Aucfanlab
Aucfan
 
PDF
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
Azure datafactory
Dimko Zhluktenko
 
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Microsoft Azure Big Data Analytics
Mark Kromer
 
New big data architecture in hadoop.pptx
VanshGupta597842
 
azure-cloud-data-engineer-training-curriculum (1).pdf
k6640559
 
Transform your data with Azure Data factory
Prometix Pty Ltd
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
nnakasone
 
Designing big data analytics solutions on azure
Mohamed Tawfik
 
Big Data with Azure
Aaron (Ari) Bornstein
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Trivadis
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking VN
 
Azure Hd insigth news
nnakasone
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Big Data Analytics .pptx
priti jadhao
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Azure Data Factory usage at Aucfanlab
Aucfan
 
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Ad

Recently uploaded (20)

PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Import Data Form Excel to Tally Services
Tally xperts
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Ad

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data Factory