SlideShare a Scribd company logo
Analyzing StackExchange
data with Azure Data Lake
Tom Kerkhove
Azure Consultant
Tom Kerkhove
Azure Consultant @ Codit
Microsoft Azure MVP & Advisor
“Integration of Things” whitepaper (https://bit.ly/azure-iot)
Nice to meet you
blog.tomkerkhove.be
@TomKerkhove
tomkerkhove
Agenda
• Introduction to Azure Data Lake
• What is Azure Data Lake Store?
• What is Azure Data Lake Analytics?
3
4
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Let’s go open-source, right?!
➔ Comes with a few challenges for C#/SQL professional
➔ New languages to learn & maintain
➔ Rapidly evolving ecosystem
➔ Cluster management
➔ Typically linux machines
Analyzing Big Data in Azure
➔ WebHDFS compatible
➔ Any size
➔ Any format as-is
➔ Write-once-read-many
➔ Enterprise-grade security
➔ Thé big data store in Azure
Azure Data Lake Store
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Characteristics
➔ Data Warehousing
➔ Structured data
➔ Defined set of schemas
➔ Requires Extract-Transform-
Load (ETL) before storing
➔ Known for some of us
➔ Exploratory analysis is hard
because of transforming
the data
Data Warehousing vs Data Lakes
➔ Data Lakes
➔ Raw data
(unstructured/semi-structured/structured)
➔ “Dump” all your data in the lake
➔ Data scientists will interpret data
from the lake
➔ Without metadata, turns in a data
swamp pretty fast
Martin Fowler on Data Lake & Data Warehouses: https://bit.ly/martin-fowler-data-lake
Security
➔ Roled-based Access Control (RBAC)
➔ Grant user/groups access to folder/file
(https://bit.ly/data-lake-store-acls)
➔ Firewall (off by default)
➔ Encryption at rest
➔ Keys managed by Microsoft
➔ Bring-your-own-key with Azure Key Vault
➔ ~$0,032/GB stored per month
➔ Transaction costs
➔ ~$0,043 per 1M write transactions
➔ ~$0,0034 per 1M read transactions
➔ 1 transaction is block of up to 128 kB
➔ Regular Egress fees
➔ Monthly commitment packages
➔ Save up to 33%
Pricing
Azure Data Lake Store vs Blob Storage
No Limitations
Store whatever you
want in any format
Security
Built-in Azure Active
Directory support
Pricing
More expensive than
Storage GRS
Redundancy
It’s there but no control
over it
Built for Scale
Optimized for high-
scale reads
Integration
With Data Factory, Data
Catalog & HDInsight
Full comparison on https://bit.ly/adls-vs-storage
Demo – Data Lake Store
15
Meet StackExchange
➔ Over 280 websites
➔ 150+ GB of open-source data
➔ Different kinds of data
➔ Posts
➔ Users
➔ Votes
➔ ...
➔ A big data sample data set
What Are We Going To Do?
• Download the
original data set
Acquiring The
Data
• Upload data set to
Azure
• Determine what
service to use
Moving The
Data • Merging data from
each site into one file
• Conversion from XML
to CSV
Aggregating
The Data
• Run business logic on
it
• Attempt to gain
knowledge from it
Analyzing The
Data • Visualize what we’ve
learned
Visualizing The
Data
How is it setup?
Azure Data Lake Analytics
➔ Run analytics jobs on managed clusters
➔ No maintenance ~ Serverless
➔ Written in U-SQL
➔ SQL Syntax
➔ Extensibility in C#
➔ Easily scaled with Analytics Units
➔ Pay for processing time only
➔ Built-in partitioned tables
➔ Query data where it lives
➔ No need to prepare data
➔ One query that runs on multiple
data stores
➔ Use the correct data store
for the job
Data Sources
Writing U-SQL scripts
Extract from data source by
using built-in or custom
extractors.
Transform / Analyse the data
using SQL-syntax, in-line C#
or C# method calls
Output the result to a data
source by using built-in or
custom extractors
➔ C# Expressions
➔ User-Defined Functions (UDF)
➔ User-Defined Operations (UDO)
➔ User-Defined Aggregators (UDAGG)
Extensibility
➔ User-Defined Extractors
➔ User-Defined Processors
➔ Take one row and produce
one row
➔ Pass-through versus
transforming
➔ User-Defined Reducers
➔ Take n rows and produce 1
row
➔ User-Defined Outputters
➔ User-Defined Appliers
➔ Take one row and produce 0 to
n rows
➔ Used with OUTER/CROSS
APPLY
➔ User-Defined Combiners
➔ Combines rowsets (like a user-
defined join)
User-Defined Operations (UDO)
Metadata Model
U-SQL Batch Job Execution Lifetime
Michael Rys on “Tuning & Optimizing U-SQL” https://bit.ly/tuning-optimizing-u-sql
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Job States
➔ Roled-based Access Control (RBAC)
➔ Firewall (Off by default)
➔ Access control on service catalog
➔ Access control on a per-database level
Security
➔ Account-level limitations
➔ Maximum of AUs
➔ Maximum of concurrent job
➔ Days to retain queries
➔ Job-level limitations
➔ Maximum of AUs
➔ Maximum priority
➔ Granted per user and/or group
Resource Management
Demo – Data Lake Analytics
30
➔ Store Explorer
➔ Browse store
➔ Download complete / subset of file
➔ Preview
➔ Only in Visual Studio
➔ Job Visualizer
➔ Determine bottlenecks by using heatmaps
➔ Playback jobs based on telemetry
➔ Query optimization
➔ Job Profiler
Azure Data Lake tools for Visual Studio
➔ Integration with Source control
➔ Unit Testing extensibility
➔ Local execution
➔ Simulate Data Lake Store
➔ Run & debug jobs
Azure Data Lake tools for Visual Studio (Code)
➔ Billed for processing time, not per job
➔ Billed per second
➔ $1,687 per hour per Analytics Unit
➔ ~ $0,028 per minute
➔ Monthly commitment packages
➔ Save up to 74%
Pricing
Operations
Data Lake Store Data Lake Analytics
Available Graphs
• Storage Utilization
• Read & Write
• Ingress & Egress
• Job status
• Used # of AU time
Available Metrics
• Data Read & Write
• Read & Write Requests
• Total Storage
• Job status
• Used # of AU time
Support for alerts Yes Built-in & custom Log Analytics queries
(Requires Audit logs)
Support for Audit Logs Yes Yes
Support for Request Logs Yes Yes
➔ Integrate with your data pipelines in Azure Data Factory
➔ Move data from Azure Data Lake Store to other store
➔ Move data to Azure Data Lake Store
➔ Run U-SQL jobs within pipeline
➔ Integration with Azure Data Catalog
➔ Register your Azure Data Lake Store assets
Integration with Azure Services
➔ Azure Data Architecture Guide
(https://docs.microsoft.com/en-us/azure/architecture/data-guide/)
➔ “Mastering Azure Analytics” by Zoiner Tejada
(https://bit.ly/mastering-azure-analytics)
➔ MVA “Introducing Azure Data Lake”
(https://bit.ly/intro-to-azure-data-lake)
➔ Azure Data Lake GitHub Repo
(https://azure.github.io/AzureDataLake/)
➔ U-SQL Documentation
(https://usql.io)
Learn more!
➔ Big Data is not just a hype so get ready
➔ Azure Data Lake Store
➔ Analyse today & explore tomorrow
➔ Beware of the data swamps
➔ Data Lake Analytics
➔ Serverless
➔ Re-use existing skills
➔ Pay for what we use
➔ Big Data in Azure? Use Azure Data Lake!
Summary
38

More Related Content

What's hot (20)

PDF
What's new in MongoDB 2.6 at India event by company
MongoDB APAC
 
PDF
Azure SQL Data Warehouse
Antonios Chatzipavlis
 
PDF
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
DataStax
 
PPTX
Webinar : Nouveautés de MongoDB 3.2
MongoDB
 
PDF
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
PPTX
Analyzing StackExchange data with Azure Data Lake
BizTalk360
 
PPTX
Azure SQL Data Warehouse for beginners
Michaela Murray
 
PPTX
Introduction to Azure DocumentDB
Ike Ellis
 
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
PPTX
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
PPTX
Data Modeling Basics for the Cloud with DataStax
DataStax
 
PPTX
From PoCs to Production
DataStax
 
PPTX
SQL Server R Services: What Every SQL Professional Should Know
Bob Ward
 
PDF
Azure Data Factory v2
Sergio Zenatti Filho
 
PPTX
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
PPTX
Database Choices
Lynn Langit
 
PDF
Replicate Elasticsearch Data with Cross-Cluster Replication (CCR)
Elasticsearch
 
PPTX
Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...
ScyllaDB
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
PPTX
Elasticsearch 5.0
Matias Cascallares
 
What's new in MongoDB 2.6 at India event by company
MongoDB APAC
 
Azure SQL Data Warehouse
Antonios Chatzipavlis
 
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
DataStax
 
Webinar : Nouveautés de MongoDB 3.2
MongoDB
 
Introducing Azure SQL Data Warehouse
Grant Fritchey
 
Analyzing StackExchange data with Azure Data Lake
BizTalk360
 
Azure SQL Data Warehouse for beginners
Michaela Murray
 
Introduction to Azure DocumentDB
Ike Ellis
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
Move your on prem data to a lake in a Lake in Cloud
CAMMS
 
Data Modeling Basics for the Cloud with DataStax
DataStax
 
From PoCs to Production
DataStax
 
SQL Server R Services: What Every SQL Professional Should Know
Bob Ward
 
Azure Data Factory v2
Sergio Zenatti Filho
 
Survey of the Microsoft Azure Data Landscape
Ike Ellis
 
Database Choices
Lynn Langit
 
Replicate Elasticsearch Data with Cross-Cluster Replication (CCR)
Elasticsearch
 
Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...
ScyllaDB
 
Elastic Stack Introduction
Vikram Shinde
 
Elasticsearch 5.0
Matias Cascallares
 

Similar to NDC Minnesota - Analyzing StackExchange data with Azure Data Lake (20)

PPTX
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Codit
 
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
PPTX
Ai big dataconference_eugene_polonichko_azure data lake
Olga Zinkevych
 
PPTX
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
PDF
Talavant Data Lake Analytics
Sean Forgatch
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
PDF
USQ Landdemos Azure Data Lake
Trivadis
 
PPTX
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
PPTX
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
PPTX
Designing big data analytics solutions on azure
Mohamed Tawfik
 
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
PPTX
An intro to Azure Data Lake
Rick van den Bosch
 
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
PPTX
Microsoft Azure Big Data Analytics
Mark Kromer
 
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
PPTX
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
PPTX
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
PDF
Complete Guide to Microsoft Azure Data Lake.pdf
microteklearningss
 
Analyzing StackExchange Data with Azure Data Lake (Tom Kerkhove @ Integration...
Codit
 
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
DataConf
 
Ai big dataconference_eugene_polonichko_azure data lake
Olga Zinkevych
 
Azure Data Lake and Azure Data Lake Analytics
Waqas Idrees
 
Talavant Data Lake Analytics
Sean Forgatch
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
MS Cloud Summit
 
USQ Landdemos Azure Data Lake
Trivadis
 
Azure data lake sql konf 2016
Kenneth Michael Nielsen
 
Azure Data Lake Intro (SQLBits 2016)
Michael Rys
 
Designing big data analytics solutions on azure
Mohamed Tawfik
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
An intro to Azure Data Lake
Rick van den Bosch
 
ADL/U-SQL Introduction (SQLBits 2016)
Michael Rys
 
Microsoft Azure Big Data Analytics
Mark Kromer
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Michael Rys
 
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Eric Bragas
 
Introducing U-SQL (SQLPASS 2016)
Michael Rys
 
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 
Complete Guide to Microsoft Azure Data Lake.pdf
microteklearningss
 
Ad

More from Tom Kerkhove (20)

PPTX
Techorama 2022 - Adventures of building Promitor, an open-source product
Tom Kerkhove
 
PPTX
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
Tom Kerkhove
 
PPTX
Introduction to Promitor
Tom Kerkhove
 
PPTX
Azure Lowlands 2020 - API management for microservices in a hybrid and multi-...
Tom Kerkhove
 
PPTX
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
Tom Kerkhove
 
PPTX
Global Azure Virtual - Application Autoscaling with KEDA
Tom Kerkhove
 
PPTX
Building Bruges 2020 - Adventures of building a multi-tenant PaaS on Microsof...
Tom Kerkhove
 
PPTX
AZUG Lightning Talk - Application autoscaling on Kubernetes with Kubernetes E...
Tom Kerkhove
 
PPTX
IglooConf 2020 - API management for microservices in a hybrid and multi-cloud...
Tom Kerkhove
 
PPTX
IglooConf 2020 - Adventures of building a multi-tenant PaaS on Microsoft Azure
Tom Kerkhove
 
PPTX
Microsoft Ignite 2019 - API management for microservices in a hybrid and mult...
Tom Kerkhove
 
PPTX
Integrate UK 2019 - Adventures of building a (multi-tenant) PaaS on Microsoft...
Tom Kerkhove
 
PDF
Techdays Finland 2019 - Adventures of building a (multi-tenant) PaaS on Micro...
Tom Kerkhove
 
PPTX
Azure Low Lands 2019 - Building secure cloud applications with Azure Key Vault
Tom Kerkhove
 
PPTX
Next Generation Data Integration with Azure Data Factory
Tom Kerkhove
 
PPTX
Intelligent Cloud Conference 2018 - Automatically scaling Kubernetes pods bas...
Tom Kerkhove
 
PPTX
Intelligent Cloud Conference 2018 - Building secure cloud applications with A...
Tom Kerkhove
 
PPTX
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
Tom Kerkhove
 
PPTX
Techdays Finland 2018 - Building secure cloud applications with Azure Key Vault
Tom Kerkhove
 
PPTX
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
Techorama 2022 - Adventures of building Promitor, an open-source product
Tom Kerkhove
 
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
Tom Kerkhove
 
Introduction to Promitor
Tom Kerkhove
 
Azure Lowlands 2020 - API management for microservices in a hybrid and multi-...
Tom Kerkhove
 
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
Tom Kerkhove
 
Global Azure Virtual - Application Autoscaling with KEDA
Tom Kerkhove
 
Building Bruges 2020 - Adventures of building a multi-tenant PaaS on Microsof...
Tom Kerkhove
 
AZUG Lightning Talk - Application autoscaling on Kubernetes with Kubernetes E...
Tom Kerkhove
 
IglooConf 2020 - API management for microservices in a hybrid and multi-cloud...
Tom Kerkhove
 
IglooConf 2020 - Adventures of building a multi-tenant PaaS on Microsoft Azure
Tom Kerkhove
 
Microsoft Ignite 2019 - API management for microservices in a hybrid and mult...
Tom Kerkhove
 
Integrate UK 2019 - Adventures of building a (multi-tenant) PaaS on Microsoft...
Tom Kerkhove
 
Techdays Finland 2019 - Adventures of building a (multi-tenant) PaaS on Micro...
Tom Kerkhove
 
Azure Low Lands 2019 - Building secure cloud applications with Azure Key Vault
Tom Kerkhove
 
Next Generation Data Integration with Azure Data Factory
Tom Kerkhove
 
Intelligent Cloud Conference 2018 - Automatically scaling Kubernetes pods bas...
Tom Kerkhove
 
Intelligent Cloud Conference 2018 - Building secure cloud applications with A...
Tom Kerkhove
 
Intelligent Cloud Conference 2018 - Next Generation of Data Integration with ...
Tom Kerkhove
 
Techdays Finland 2018 - Building secure cloud applications with Azure Key Vault
Tom Kerkhove
 
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Tom Kerkhove
 
Ad

Recently uploaded (20)

PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Executive Business Intelligence Dashboards
vandeslie24
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 

NDC Minnesota - Analyzing StackExchange data with Azure Data Lake

  • 1. Analyzing StackExchange data with Azure Data Lake Tom Kerkhove Azure Consultant
  • 2. Tom Kerkhove Azure Consultant @ Codit Microsoft Azure MVP & Advisor “Integration of Things” whitepaper (https://bit.ly/azure-iot) Nice to meet you blog.tomkerkhove.be @TomKerkhove tomkerkhove
  • 3. Agenda • Introduction to Azure Data Lake • What is Azure Data Lake Store? • What is Azure Data Lake Analytics? 3
  • 4. 4
  • 6. Let’s go open-source, right?! ➔ Comes with a few challenges for C#/SQL professional ➔ New languages to learn & maintain ➔ Rapidly evolving ecosystem ➔ Cluster management ➔ Typically linux machines
  • 8. ➔ WebHDFS compatible ➔ Any size ➔ Any format as-is ➔ Write-once-read-many ➔ Enterprise-grade security ➔ Thé big data store in Azure Azure Data Lake Store
  • 10. Characteristics ➔ Data Warehousing ➔ Structured data ➔ Defined set of schemas ➔ Requires Extract-Transform- Load (ETL) before storing ➔ Known for some of us ➔ Exploratory analysis is hard because of transforming the data Data Warehousing vs Data Lakes ➔ Data Lakes ➔ Raw data (unstructured/semi-structured/structured) ➔ “Dump” all your data in the lake ➔ Data scientists will interpret data from the lake ➔ Without metadata, turns in a data swamp pretty fast
  • 11. Martin Fowler on Data Lake & Data Warehouses: https://bit.ly/martin-fowler-data-lake
  • 12. Security ➔ Roled-based Access Control (RBAC) ➔ Grant user/groups access to folder/file (https://bit.ly/data-lake-store-acls) ➔ Firewall (off by default) ➔ Encryption at rest ➔ Keys managed by Microsoft ➔ Bring-your-own-key with Azure Key Vault
  • 13. ➔ ~$0,032/GB stored per month ➔ Transaction costs ➔ ~$0,043 per 1M write transactions ➔ ~$0,0034 per 1M read transactions ➔ 1 transaction is block of up to 128 kB ➔ Regular Egress fees ➔ Monthly commitment packages ➔ Save up to 33% Pricing
  • 14. Azure Data Lake Store vs Blob Storage No Limitations Store whatever you want in any format Security Built-in Azure Active Directory support Pricing More expensive than Storage GRS Redundancy It’s there but no control over it Built for Scale Optimized for high- scale reads Integration With Data Factory, Data Catalog & HDInsight Full comparison on https://bit.ly/adls-vs-storage
  • 15. Demo – Data Lake Store 15
  • 16. Meet StackExchange ➔ Over 280 websites ➔ 150+ GB of open-source data ➔ Different kinds of data ➔ Posts ➔ Users ➔ Votes ➔ ... ➔ A big data sample data set
  • 17. What Are We Going To Do? • Download the original data set Acquiring The Data • Upload data set to Azure • Determine what service to use Moving The Data • Merging data from each site into one file • Conversion from XML to CSV Aggregating The Data • Run business logic on it • Attempt to gain knowledge from it Analyzing The Data • Visualize what we’ve learned Visualizing The Data
  • 18. How is it setup?
  • 19. Azure Data Lake Analytics ➔ Run analytics jobs on managed clusters ➔ No maintenance ~ Serverless ➔ Written in U-SQL ➔ SQL Syntax ➔ Extensibility in C# ➔ Easily scaled with Analytics Units ➔ Pay for processing time only
  • 20. ➔ Built-in partitioned tables ➔ Query data where it lives ➔ No need to prepare data ➔ One query that runs on multiple data stores ➔ Use the correct data store for the job Data Sources
  • 21. Writing U-SQL scripts Extract from data source by using built-in or custom extractors. Transform / Analyse the data using SQL-syntax, in-line C# or C# method calls Output the result to a data source by using built-in or custom extractors
  • 22. ➔ C# Expressions ➔ User-Defined Functions (UDF) ➔ User-Defined Operations (UDO) ➔ User-Defined Aggregators (UDAGG) Extensibility
  • 23. ➔ User-Defined Extractors ➔ User-Defined Processors ➔ Take one row and produce one row ➔ Pass-through versus transforming ➔ User-Defined Reducers ➔ Take n rows and produce 1 row ➔ User-Defined Outputters ➔ User-Defined Appliers ➔ Take one row and produce 0 to n rows ➔ Used with OUTER/CROSS APPLY ➔ User-Defined Combiners ➔ Combines rowsets (like a user- defined join) User-Defined Operations (UDO)
  • 25. U-SQL Batch Job Execution Lifetime Michael Rys on “Tuning & Optimizing U-SQL” https://bit.ly/tuning-optimizing-u-sql
  • 28. ➔ Roled-based Access Control (RBAC) ➔ Firewall (Off by default) ➔ Access control on service catalog ➔ Access control on a per-database level Security
  • 29. ➔ Account-level limitations ➔ Maximum of AUs ➔ Maximum of concurrent job ➔ Days to retain queries ➔ Job-level limitations ➔ Maximum of AUs ➔ Maximum priority ➔ Granted per user and/or group Resource Management
  • 30. Demo – Data Lake Analytics 30
  • 31. ➔ Store Explorer ➔ Browse store ➔ Download complete / subset of file ➔ Preview ➔ Only in Visual Studio ➔ Job Visualizer ➔ Determine bottlenecks by using heatmaps ➔ Playback jobs based on telemetry ➔ Query optimization ➔ Job Profiler Azure Data Lake tools for Visual Studio
  • 32. ➔ Integration with Source control ➔ Unit Testing extensibility ➔ Local execution ➔ Simulate Data Lake Store ➔ Run & debug jobs Azure Data Lake tools for Visual Studio (Code)
  • 33. ➔ Billed for processing time, not per job ➔ Billed per second ➔ $1,687 per hour per Analytics Unit ➔ ~ $0,028 per minute ➔ Monthly commitment packages ➔ Save up to 74% Pricing
  • 34. Operations Data Lake Store Data Lake Analytics Available Graphs • Storage Utilization • Read & Write • Ingress & Egress • Job status • Used # of AU time Available Metrics • Data Read & Write • Read & Write Requests • Total Storage • Job status • Used # of AU time Support for alerts Yes Built-in & custom Log Analytics queries (Requires Audit logs) Support for Audit Logs Yes Yes Support for Request Logs Yes Yes
  • 35. ➔ Integrate with your data pipelines in Azure Data Factory ➔ Move data from Azure Data Lake Store to other store ➔ Move data to Azure Data Lake Store ➔ Run U-SQL jobs within pipeline ➔ Integration with Azure Data Catalog ➔ Register your Azure Data Lake Store assets Integration with Azure Services
  • 36. ➔ Azure Data Architecture Guide (https://docs.microsoft.com/en-us/azure/architecture/data-guide/) ➔ “Mastering Azure Analytics” by Zoiner Tejada (https://bit.ly/mastering-azure-analytics) ➔ MVA “Introducing Azure Data Lake” (https://bit.ly/intro-to-azure-data-lake) ➔ Azure Data Lake GitHub Repo (https://azure.github.io/AzureDataLake/) ➔ U-SQL Documentation (https://usql.io) Learn more!
  • 37. ➔ Big Data is not just a hype so get ready ➔ Azure Data Lake Store ➔ Analyse today & explore tomorrow ➔ Beware of the data swamps ➔ Data Lake Analytics ➔ Serverless ➔ Re-use existing skills ➔ Pay for what we use ➔ Big Data in Azure? Use Azure Data Lake! Summary
  • 38. 38

Editor's Notes

  • #8: HDI – Managed cluster service, Open-source technology, Runs on Windows or Linux Store – Unlimited Storage, WebHDFS Analytics - Managed job service, U-SQL batch-processing Based on MSFT Cosmos Cortana, Bing, Xbox Live, etc.
  • #12: Analogy with fishing – Go fishing in lake, but it in your warehouse. Lake becomes swamp, fish dies
  • #15: No Limitations – Store is unlimited, storage is limited to 100 accounts in a subscription, 500 TB each Security –AAD vs SAS or Name/Key auth Pricing – ADLS is more expensive Redundancy – No control over redundancy Built for Scale – Optimized for high reads and analytics, scales with the reads, high volume of small writes  Real-time analytics