Designing performant and scalable data
lakes using Azure Data Lake Storage
Rukmani Gopalan
@RukmaniGopalan
Agenda • Data Lake Concepts and Patterns
• Designing your data lake
• Set up
• Organize data
• Secure data
• Manage cost
• Optimizing your data lake
• Achieve the best performance and scale
Traditional on-prem analytics pipeline
Operational
database
Business/custom apps
Operational
database
Operational
database
Enterprise data
warehouse
Data mart
Data mart
Data mart
ETL
ETL
ETL
ETL ETL
ETL
ETL
Reporting
Analytics
Data mining
Modern data warehouse
Logs (structured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure DatabricksAzure Data Factory
Power BI
Azure Synapse Analytics
Azure Synapse Analytics
Advanced Analytics
Logs (structured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure Data Factory
Power BI
Apps
Azure Databricks
Azure Synapse Analytics Azure Synapse Analytics
Cosmos DB
Realtime Analytics
Logs (structured)
Media (unstructured)
Files (unstructured)
Business/custom apps
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure DatabricksAzure Data Factory
Power BI
Apps
Message Broker
Azure Synapse Analytics Azure Synapse Analytics
Cosmos DB
Sensors and IoT
(unstructured)
A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
A z u r e D a t a L a k e S t o r a g e
M A N A G E A B L ES C A L A B L E F A S T S E C U R E
 No limits on
data store size
 Global footprint
(50 regions)
 Optimized for Spark
and Hadoop
Analytic Engines
 Tightly integrated
with Azure end to
end analytics
solutions
 Automated
Lifecycle Policy
Management
 Object Level
tiering
 Support for fine-
grained ACLs,
protecting data at the
file and folder level
 Multi-layered
protection via at-rest
Storage Service
encryption and Azure
Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
 Atomic directory
operations
means jobs
complete faster
 Object store
pricing levels
 File system
operations
minimize
transactions
required for job
completion
Azure Data Lake Storage
Cloud Storage platform with first class file/folder semantics and support for multiple
protocols and cost/performance tiers. Built on Object Storage.
Common Blob Storage Foundation
Blob API ADLS API
Server Backups, Archive
Storage, Semi-structured
Data
Object Data
Hadoop File System, File
and Folder Hierarchy,
Granular ACLS Atomic File
Transactions
Analytics Data
Object Tiering and Lifecycle
Policy Management
AAD Integration, RBAC,
Storage Account Security
HA/DR support through ZRS
and RA-GRS
NFS v3 (preview)
HPC Data, Applications
using NFS v3 against large
sequentially read data sets
File Data
Data Lake Architecture - Summary
Store large volume of multi-structured data in its native format
Defer work to ‘schematize’ after value & requirements are known
(Schema-on-read)
Extract high value insights from the multi-structured data
Build intelligent business scenarios based on the insights
Designing Your Data Lake
• How do I set up my data lake?
• How do I organize my data?
• How do I secure my data?
• How do I manage cost?
How do I set up my data lake?
• Centralized vs Federated implementation
• Data management and administration – done by a central team vs business units/domains
• Blueprint approach to federated data lakes with centralized governance
Flexible – single or multiple storage accounts
Blueprint
Recommendations
 Isolate development vs pre-production and production data lakes
 Identify logical datasets, resources and management needs – this
drives the centralized vs federated approach
• Business unit boundaries
• Regional boundaries
 Promote sharing data/insights across business units – beware of
data silos
How do I organize my data?
• Azure Data Lake Storage hierarchy
• Storage account
Azure resource that contains data objects
• Container
Organize within storage account - contains a set
of files/folders
• Folder/directory
Organize within container - contains a set of
files/folders, Hadoop file system friendly
• File
Holds data that can be read or written
Recommendations
 Organize data based on semantic structure as well as desired
access control
 Separate the different zones into different accounts, containers or
folders depending on business need
How do I secure my data?
PERIMETER/NETWORK
Service Endpoints
Private Endpoints
AUTHENTICATION
Azure Active Directory
(recommended)
Shared Keys
SAS tokens
Shared Key
AUTHORIZATION
RBACs (coarse grained)
POSIX ACLs (fine
grained)
Shared Key
DATA PROTECTION
Encryption on-the-wire with HTTPS
Encryption at Rest
• Service and Customer Managed Keys
Diagnostic Logs
A Little More on Authorization
 RBACs and ACLs integrated with AAD
• RBACs – Storage account and container
• ACLs – File and folders
 Other access mechanisms (not
recommended)
• Shared Keys – Disable if not needed
(preview)
• SAS Tokens – short lived access
Recommendations
 Service or Private endpoints for network security
 Use Azure Active Directory authentication to manage access
 Use RBACs for coarse grained access (at storage account or
container level) and ACLs for fine grained access control (at file or
folder level)
 AAD groups largely simplify your access management issues
How do I manage cost?
• Choose the right set of features for your business – cost vs benefit
• E.g. Redundancy option – criticality of geo-redundancy for production vs dev environments
LRS ZRS GZRS(RA-)GRS
Single Region Dual RegionGRS
How do I manage cost? (Continued…)
• Control data growth –
minimize risk of data
swamp
• Workspace data
management
• Leverage lifecycle
management policies
• Tiering
• Retention
Recommendations
 Choose the features of data lake storage based on business need
 Pre-prod and development environment needs might vary from
production environment needs
 Leverage lifecycle management policies for better data
management
 Move data to a cooler tier if not actively used – be aware of higher
transaction costs and minimum retention policies
 Use retention policies to delete data that is not needed
How do I optimize my data lake?
Goal
• Optimize for performance AND scale as the data and
applications continue to grow on the data lake
The basic considerations are…
• Optimize for high throughput
• Target getting at least a few MBs (higher the better)
per transaction.
• Optimize data access patterns
• Reduce unnecessary scanning of files
• Read only the data you need to read.
• Write efficiently so downstream applications that read
the data benefit
File size and format
• Too many small files adversely impact performance
• Choosing the right format – better performance
AND lower cost
• Parquet – integrated optimizations with Azure Synapse
Analytics and Azure Databricks
• Recommendations
 Modify source to ingest larger files into the data lake
 Coalesce and convert to right format (E.g. Parquet) in
curation phase of your analytics pipelines
 Realtime analytics pipelines (E.g. sensor data in IoT
application) – microbatch for larger writes
Partition your data for optimized access
Partition based on consumption patterns for optimized performance
Sensor ID Year Temperature
Humidity
Pressure
Microsoft Confidential
Query Acceleration (Preview)
 Optimize access to structured data by filtering data directly in the storage service
 Single file predicate evaluation and column projection to optimize analytics engines
 Eg:
 SELECT _1, _3 FROM BlobStorage WHERE _14 < 250 AND _16 > '2019-07-01'
Guidance from experts
Microsoft Docs
Explore overviews, tutorials,
code samples, and more.
Azure Data Lake Storage: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction
Azure Data Lake Storage Guidance Document: https://aka.ms/adls/guidancedoc
Azure Synapse Analytics: https://docs.microsoft.com/azure/synapse-analytics
© Copyright Microsoft Corporation. All rights reserved.

More Related Content

PPTX
Mapping Data Flows Training April 2021
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
PPTX
Digital Transformation with Microsoft Azure
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PPTX
Azure Data Factory ETL Patterns in the Cloud
PPTX
Azure Data Factory Data Wrangling with Power Query
PPTX
Mapping Data Flows Training deck Q1 CY22
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Mapping Data Flows Training April 2021
Data Quality Patterns in the Cloud with Azure Data Factory
Digital Transformation with Microsoft Azure
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory Data Wrangling with Power Query
Mapping Data Flows Training deck Q1 CY22
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...

What's hot (20)

PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Azure Data Factory for Azure Data Week
PPTX
Deep Dive into Azure Data Factory v2
PPTX
Azure Data Factory
PPTX
Azure data factory
PPTX
Azure Data Factory Data Flows Training v005
PPTX
An intro to Azure Data Lake
PPTX
ADF Mapping Data Flows Level 300
PPTX
Microsoft Azure Data Factory Hands-On Lab Overview Slides
PDF
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
PPTX
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
PPTX
A lap around Azure Data Factory
PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
PPTX
Azure data factory
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
PDF
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
PPTX
Streaming Real-time Data to Azure Data Lake Storage Gen 2
PPTX
Azure Data Factory for Redmond SQL PASS UG Sept 2018
PPTX
Analyzing StackExchange data with Azure Data Lake
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Azure Data Factory for Azure Data Week
Deep Dive into Azure Data Factory v2
Azure Data Factory
Azure data factory
Azure Data Factory Data Flows Training v005
An intro to Azure Data Lake
ADF Mapping Data Flows Level 300
Microsoft Azure Data Factory Hands-On Lab Overview Slides
Building Dynamic Data Pipelines in Azure Data Factory (Microsoft Ignite 2019)
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
A lap around Azure Data Factory
Integration Monday - Analysing StackExchange data with Azure Data Lake
Azure data factory
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Pipelines and Packages: Introduction to Azure Data Factory (Techorama NL 2019)
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Analyzing StackExchange data with Azure Data Lake
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen

Similar to Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage (18)

PDF
Prague data management meetup 2018-03-27
PDF
Dipping Your Toes: Azure Data Lake for DBAs
PPTX
Modern data warehouse
PPTX
Is the traditional data warehouse dead?
PDF
AWS Tech Talks - Data Lake Analytics
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
PDF
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
PDF
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
PPTX
Move your on prem data to a lake in a Lake in Cloud
PPTX
Modern Analytics Academy - Data Modeling (1).pptx
PDF
So You Want to Build a Data Lake?
PPTX
Afternoons with Azure - Azure Data Services
 
PPTX
Microsoft Data Platform - What's included
PPTX
Data lake-itweekend-sharif university-vahid amiry
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PDF
Cortana Analytics Workshop: Azure Data Lake
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
PDF
Owning Your Own (Data) Lake House
Prague data management meetup 2018-03-27
Dipping Your Toes: Azure Data Lake for DBAs
Modern data warehouse
Is the traditional data warehouse dead?
AWS Tech Talks - Data Lake Analytics
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Move your on prem data to a lake in a Lake in Cloud
Modern Analytics Academy - Data Modeling (1).pptx
So You Want to Build a Data Lake?
Afternoons with Azure - Azure Data Services
 
Microsoft Data Platform - What's included
Data lake-itweekend-sharif university-vahid amiry
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Cortana Analytics Workshop: Azure Data Lake
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Owning Your Own (Data) Lake House

Recently uploaded (20)

PPTX
lung disease detection using transfer learning approach.pptx
PDF
General category merit rank list for neet pg
PPT
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
PPT
What is life? We never know the answer exactly
PPT
Classification methods in data analytics.ppt
PPT
Technicalities in writing workshops indigenous language
PPTX
cardiac failure and associated notes.pptx
PDF
Mcdonald's : a half century growth . pdf
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PDF
American Journal of Multidisciplinary Research and Review
PDF
PPT nikita containers of the company use
PDF
NU-MEP-Standards معايير تصميم جامعية .pdf
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPTX
AI-Augmented Business Process Management Systems
PDF
Introduction to Database Systems Lec # 1
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PDF
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PPTX
Basic Statistical Analysis for experimental data.pptx
lung disease detection using transfer learning approach.pptx
General category merit rank list for neet pg
BME 301 Lecture Note 1_2.ppt mata kuliah Instrumentasi
What is life? We never know the answer exactly
Classification methods in data analytics.ppt
Technicalities in writing workshops indigenous language
cardiac failure and associated notes.pptx
Mcdonald's : a half century growth . pdf
inbound6529290805104538764.pptxmmmmmmmmm
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
American Journal of Multidisciplinary Research and Review
PPT nikita containers of the company use
NU-MEP-Standards معايير تصميم جامعية .pdf
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
AI-Augmented Business Process Management Systems
Introduction to Database Systems Lec # 1
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
Lesson 1 - intro Cybersecurity and Cybercrime.pptx.pdf
inbound2857676998455010149.pptxmmmmmmmmm
Basic Statistical Analysis for experimental data.pptx

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

  • 1. Designing performant and scalable data lakes using Azure Data Lake Storage Rukmani Gopalan @RukmaniGopalan
  • 2. Agenda • Data Lake Concepts and Patterns • Designing your data lake • Set up • Organize data • Secure data • Manage cost • Optimizing your data lake • Achieve the best performance and scale
  • 3. Traditional on-prem analytics pipeline Operational database Business/custom apps Operational database Operational database Enterprise data warehouse Data mart Data mart Data mart ETL ETL ETL ETL ETL ETL ETL Reporting Analytics Data mining
  • 4. Modern data warehouse Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure DatabricksAzure Data Factory Power BI Azure Synapse Analytics Azure Synapse Analytics
  • 5. Advanced Analytics Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure Data Factory Power BI Apps Azure Databricks Azure Synapse Analytics Azure Synapse Analytics Cosmos DB
  • 6. Realtime Analytics Logs (structured) Media (unstructured) Files (unstructured) Business/custom apps (structured) Ingest Prep & train Model & serve Store Azure Data Lake Storage Azure DatabricksAzure Data Factory Power BI Apps Message Broker Azure Synapse Analytics Azure Synapse Analytics Cosmos DB Sensors and IoT (unstructured)
  • 7. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and scale profile of object storage together with the performance and analytics feature set of data lake storage A z u r e D a t a L a k e S t o r a g e M A N A G E A B L ES C A L A B L E F A S T S E C U R E  No limits on data store size  Global footprint (50 regions)  Optimized for Spark and Hadoop Analytic Engines  Tightly integrated with Azure end to end analytics solutions  Automated Lifecycle Policy Management  Object Level tiering  Support for fine- grained ACLs, protecting data at the file and folder level  Multi-layered protection via at-rest Storage Service encryption and Azure Active Directory integration C O S T E F F E C T I V E I N T E G R AT I O N R E A D Y  Atomic directory operations means jobs complete faster  Object store pricing levels  File system operations minimize transactions required for job completion
  • 8. Azure Data Lake Storage Cloud Storage platform with first class file/folder semantics and support for multiple protocols and cost/performance tiers. Built on Object Storage. Common Blob Storage Foundation Blob API ADLS API Server Backups, Archive Storage, Semi-structured Data Object Data Hadoop File System, File and Folder Hierarchy, Granular ACLS Atomic File Transactions Analytics Data Object Tiering and Lifecycle Policy Management AAD Integration, RBAC, Storage Account Security HA/DR support through ZRS and RA-GRS NFS v3 (preview) HPC Data, Applications using NFS v3 against large sequentially read data sets File Data
  • 9. Data Lake Architecture - Summary Store large volume of multi-structured data in its native format Defer work to ‘schematize’ after value & requirements are known (Schema-on-read) Extract high value insights from the multi-structured data Build intelligent business scenarios based on the insights
  • 10. Designing Your Data Lake • How do I set up my data lake? • How do I organize my data? • How do I secure my data? • How do I manage cost?
  • 11. How do I set up my data lake? • Centralized vs Federated implementation • Data management and administration – done by a central team vs business units/domains • Blueprint approach to federated data lakes with centralized governance Flexible – single or multiple storage accounts Blueprint
  • 12. Recommendations  Isolate development vs pre-production and production data lakes  Identify logical datasets, resources and management needs – this drives the centralized vs federated approach • Business unit boundaries • Regional boundaries  Promote sharing data/insights across business units – beware of data silos
  • 13. How do I organize my data? • Azure Data Lake Storage hierarchy • Storage account Azure resource that contains data objects • Container Organize within storage account - contains a set of files/folders • Folder/directory Organize within container - contains a set of files/folders, Hadoop file system friendly • File Holds data that can be read or written
  • 14. Recommendations  Organize data based on semantic structure as well as desired access control  Separate the different zones into different accounts, containers or folders depending on business need
  • 15. How do I secure my data? PERIMETER/NETWORK Service Endpoints Private Endpoints AUTHENTICATION Azure Active Directory (recommended) Shared Keys SAS tokens Shared Key AUTHORIZATION RBACs (coarse grained) POSIX ACLs (fine grained) Shared Key DATA PROTECTION Encryption on-the-wire with HTTPS Encryption at Rest • Service and Customer Managed Keys Diagnostic Logs
  • 16. A Little More on Authorization  RBACs and ACLs integrated with AAD • RBACs – Storage account and container • ACLs – File and folders  Other access mechanisms (not recommended) • Shared Keys – Disable if not needed (preview) • SAS Tokens – short lived access
  • 17. Recommendations  Service or Private endpoints for network security  Use Azure Active Directory authentication to manage access  Use RBACs for coarse grained access (at storage account or container level) and ACLs for fine grained access control (at file or folder level)  AAD groups largely simplify your access management issues
  • 18. How do I manage cost? • Choose the right set of features for your business – cost vs benefit • E.g. Redundancy option – criticality of geo-redundancy for production vs dev environments LRS ZRS GZRS(RA-)GRS Single Region Dual RegionGRS
  • 19. How do I manage cost? (Continued…) • Control data growth – minimize risk of data swamp • Workspace data management • Leverage lifecycle management policies • Tiering • Retention
  • 20. Recommendations  Choose the features of data lake storage based on business need  Pre-prod and development environment needs might vary from production environment needs  Leverage lifecycle management policies for better data management  Move data to a cooler tier if not actively used – be aware of higher transaction costs and minimum retention policies  Use retention policies to delete data that is not needed
  • 21. How do I optimize my data lake? Goal • Optimize for performance AND scale as the data and applications continue to grow on the data lake The basic considerations are… • Optimize for high throughput • Target getting at least a few MBs (higher the better) per transaction. • Optimize data access patterns • Reduce unnecessary scanning of files • Read only the data you need to read. • Write efficiently so downstream applications that read the data benefit
  • 22. File size and format • Too many small files adversely impact performance • Choosing the right format – better performance AND lower cost • Parquet – integrated optimizations with Azure Synapse Analytics and Azure Databricks • Recommendations  Modify source to ingest larger files into the data lake  Coalesce and convert to right format (E.g. Parquet) in curation phase of your analytics pipelines  Realtime analytics pipelines (E.g. sensor data in IoT application) – microbatch for larger writes
  • 23. Partition your data for optimized access Partition based on consumption patterns for optimized performance Sensor ID Year Temperature Humidity Pressure
  • 24. Microsoft Confidential Query Acceleration (Preview)  Optimize access to structured data by filtering data directly in the storage service  Single file predicate evaluation and column projection to optimize analytics engines  Eg:  SELECT _1, _3 FROM BlobStorage WHERE _14 < 250 AND _16 > '2019-07-01'
  • 25. Guidance from experts Microsoft Docs Explore overviews, tutorials, code samples, and more. Azure Data Lake Storage: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction Azure Data Lake Storage Guidance Document: https://aka.ms/adls/guidancedoc Azure Synapse Analytics: https://docs.microsoft.com/azure/synapse-analytics
  • 26. © Copyright Microsoft Corporation. All rights reserved.

Editor's Notes

  • #16: An Azure Virtual Network (VNet) is a representation of your own network in the cloud. It is a logical isolation of the Azure cloud dedicated to your subscription. ... When you create a VNet, your services and VMs within your VNet can communicate directly and securely with each other in the cloud.
  • #23: Symptom: Job latencies Investigation Storage request throttling Root cause Too many read operations to storage. Large number of row groups in databrick delta parquet file resulted in lots of reads operations. Solution Adjusted parquet.block.size config value to reduce number of row groups per parquet file Job runtimes reduced by 3x
  • #24: Symptom: Job timeouts Investigation Transaction and throughput peaks, bursty pattern of load Storage request throttling Root cause Data cleanup during SLA job execution Large number of partitions 10’s of thousands of partitions Solution Reduced number of partitions to 250 Best practice: Partitioning strategy must align with your query pattern Reduced the number of delete operations while SLA job is running