Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

Designing performant and scalable data
lakes using Azure Data Lake Storage
Rukmani Gopalan
@RukmaniGopalan

Agenda • Data Lake Concepts and Patterns
• Designing your data lake
• Set up
• Organize data
• Secure data
• Manage cost
• Optimizing your data lake
• Achieve the best performance and scale

Traditional on-prem analytics pipeline
Operational
database
Business/custom apps
Operational
database
Operational
database
Enterprise data
warehouse
Data mart
Data mart
Data mart
ETL
ETL
ETL
ETL ETL
ETL
ETL
Reporting
Analytics
Data mining

Modern data warehouse
Logs (structured)
Media (unstructured)
Files (unstructured)
(structured)
Ingest Prep & train Model & serve
Store
Azure Data Lake Storage
Azure DatabricksAzure Data Factory
Power BI
Azure Synapse Analytics
Azure Synapse Analytics

Advanced Analytics
Logs (structured)
(structured)
Store
Azure Data Factory
Power BI
Apps
Azure Databricks
Azure Synapse Analytics Azure Synapse Analytics
Cosmos DB

Realtime Analytics
Logs (structured)
(structured)
Store
Azure DatabricksAzure Data Factory
Power BI
Apps
Message Broker
Azure Synapse Analytics Azure Synapse Analytics
Cosmos DB
Sensors and IoT
(unstructured)

A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
A z u r e D a t a L a k e S t o r a g e
M A N A G E A B L ES C A L A B L E F A S T S E C U R E
 No limits on
data store size
 Global footprint
(50 regions)
 Optimized for Spark
and Hadoop
Analytic Engines
 Tightly integrated
with Azure end to
end analytics
solutions
 Automated
Lifecycle Policy
Management
 Object Level
tiering
 Support for fine-
grained ACLs,
protecting data at the
file and folder level
 Multi-layered
protection via at-rest
Storage Service
encryption and Azure
Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
 Atomic directory
operations
means jobs
complete faster
 Object store
pricing levels
 File system
operations
minimize
transactions
required for job
completion

Cloud Storage platform with first class file/folder semantics and support for multiple
protocols and cost/performance tiers. Built on Object Storage.
Common Blob Storage Foundation
Blob API ADLS API
Server Backups, Archive
Storage, Semi-structured
Data
Object Data
Hadoop File System, File
and Folder Hierarchy,
Granular ACLS Atomic File
Transactions
Analytics Data
Object Tiering and Lifecycle
Policy Management
AAD Integration, RBAC,
Storage Account Security
HA/DR support through ZRS
and RA-GRS
NFS v3 (preview)
HPC Data, Applications
using NFS v3 against large
sequentially read data sets
File Data

Data Lake Architecture - Summary
Store large volume of multi-structured data in its native format
Defer work to ‘schematize’ after value & requirements are known
(Schema-on-read)
Extract high value insights from the multi-structured data
Build intelligent business scenarios based on the insights

Designing Your Data Lake
• How do I set up my data lake?
• How do I organize my data?
• How do I secure my data?
• How do I manage cost?

How do I set up my data lake?
• Centralized vs Federated implementation
• Data management and administration – done by a central team vs business units/domains
• Blueprint approach to federated data lakes with centralized governance
Flexible – single or multiple storage accounts
Blueprint

Recommendations
 Isolate development vs pre-production and production data lakes
 Identify logical datasets, resources and management needs – this
drives the centralized vs federated approach
• Business unit boundaries
• Regional boundaries
 Promote sharing data/insights across business units – beware of
data silos

How do I organize my data?
• Azure Data Lake Storage hierarchy
• Storage account
Azure resource that contains data objects
• Container
Organize within storage account - contains a set
of files/folders
• Folder/directory
Organize within container - contains a set of
files/folders, Hadoop file system friendly
• File
Holds data that can be read or written

Recommendations
 Organize data based on semantic structure as well as desired
access control
 Separate the different zones into different accounts, containers or
folders depending on business need

How do I secure my data?
PERIMETER/NETWORK
Service Endpoints
Private Endpoints
AUTHENTICATION
Azure Active Directory
(recommended)
Shared Keys
SAS tokens
Shared Key
AUTHORIZATION
RBACs (coarse grained)
POSIX ACLs (fine
grained)
Shared Key
DATA PROTECTION
Encryption on-the-wire with HTTPS
Encryption at Rest
• Service and Customer Managed Keys
Diagnostic Logs

A Little More on Authorization
 RBACs and ACLs integrated with AAD
• RBACs – Storage account and container
• ACLs – File and folders
 Other access mechanisms (not
recommended)
• Shared Keys – Disable if not needed
(preview)
• SAS Tokens – short lived access

Recommendations
 Service or Private endpoints for network security
 Use Azure Active Directory authentication to manage access
 Use RBACs for coarse grained access (at storage account or
container level) and ACLs for fine grained access control (at file or
folder level)
 AAD groups largely simplify your access management issues

How do I manage cost?
• Choose the right set of features for your business – cost vs benefit
• E.g. Redundancy option – criticality of geo-redundancy for production vs dev environments
LRS ZRS GZRS(RA-)GRS
Single Region Dual RegionGRS

How do I manage cost? (Continued…)
• Control data growth –
minimize risk of data
swamp
• Workspace data
management
• Leverage lifecycle
management policies
• Tiering
• Retention

Recommendations
 Choose the features of data lake storage based on business need
 Pre-prod and development environment needs might vary from
production environment needs
 Leverage lifecycle management policies for better data
management
 Move data to a cooler tier if not actively used – be aware of higher
transaction costs and minimum retention policies
 Use retention policies to delete data that is not needed

How do I optimize my data lake?
Goal
• Optimize for performance AND scale as the data and
applications continue to grow on the data lake
The basic considerations are…
• Optimize for high throughput
• Target getting at least a few MBs (higher the better)
per transaction.
• Optimize data access patterns
• Reduce unnecessary scanning of files
• Read only the data you need to read.
• Write efficiently so downstream applications that read
the data benefit

File size and format
• Too many small files adversely impact performance
• Choosing the right format – better performance
AND lower cost
• Parquet – integrated optimizations with Azure Synapse
Analytics and Azure Databricks
• Recommendations
 Modify source to ingest larger files into the data lake
 Coalesce and convert to right format (E.g. Parquet) in
curation phase of your analytics pipelines
 Realtime analytics pipelines (E.g. sensor data in IoT
application) – microbatch for larger writes

Partition your data for optimized access
Partition based on consumption patterns for optimized performance
Sensor ID Year Temperature
Humidity
Pressure

Microsoft Confidential
Query Acceleration (Preview)
 Optimize access to structured data by filtering data directly in the storage service
 Single file predicate evaluation and column projection to optimize analytics engines
 Eg:
 SELECT _1, _3 FROM BlobStorage WHERE _14 < 250 AND _16 > '2019-07-01'

Guidance from experts
Microsoft Docs
Explore overviews, tutorials,
code samples, and more.
Azure Data Lake Storage: https://docs.microsoft.com/azure/storage/blobs/data-lake-storage-introduction
Azure Data Lake Storage Guidance Document: https://aka.ms/adls/guidancedoc
Azure Synapse Analytics: https://docs.microsoft.com/azure/synapse-analytics

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

More Related Content

What's hot (20)

Similar to Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage (18)

Recently uploaded (20)

Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data Lake Storage

Editor's Notes