2. Design Presentation: Data Loading & Data Lake Organization
Day 1
Welcome, Objectives
Break
Break
PoC Challenge 1
Long Break
Design Presentation: DW Optimization
PoC Challenge 2
Please refer to the descriptions below for reference
Types of activities:
Attendees will participate in two different types of activities:
Location of activities:
Each activity will take place in the following location:
A look into the Day 1 agenda:
Keynote
Presentations
Challenges
All-up Session: Teams bridge in calendar invite
Independent working time: No meeting, working in Spektra labs
Design Presentation: Data Transformations
Continue PoC Challenge 1
Break
Continue PoC Challenge 2
Break
3. Ciprian Jichici
Chief Data Scientist
[email protected]
Ciprian Jichici is the Chief Data Scientist of
Solliance, one of the top worldwide Microsoft AI
partners.
He is recognized internationally as a Microsoft
Regional Director and a Microsoft Most Valuable
Professional for Artificial Intelligence and Quantum
Computing.
Cloud Computing, Artificial Intelligence, and
Machine Learning are some of the key areas of his
expertise spanning 20+ years of IT.
Ciprian is also very passionate about quantum
physics and consequently, about quantum
computing.
linkedin.com/in/ciprianjichici/
6. Azure Synapse Analytics
Limitless analytics service with unmatched time to insight
Synapse Analytics
Platform
Azure
Data Lake Storage
Common Data Model
Enterprise Security
Optimized for Analytics
Data lake integrated and
Common Data Model aware
METASTORE
SECURITY
MANAGEMENT
MONITORING
Integrated platform services
for, management, security,
monitoring, and meta-store
DATA INTEGRATION
SQL
Analytics Runtimes
Integrated analytics runtimes
available dedicated and serverless
Synapse SQL offering T-SQL for
batch, streaming and interactive
processing
Apache Spark for big data
processing with Python, Scala
and .NET
DEDICATED SERVERLESS
Form Factors
SQL
Languages
Python .NET Java Scala
Multiple languages suited to
different analytics workloads
Experience Synapse Analytics Studio
SaaS developer experiences for
code free and code first
Artificial Intelligence / Machine Learning / Internet of
Things
Intelligent Apps / Business Intelligence
Designed for analytics workloads
at any scale
METASTORE
SECURITY
MANAGEMENT
MONITORING
9. Azure Services
Command and
Control
L E G E N D
Data
Components of Orchestration
Trigger
On demand
Schedule
Data Window
Event
Pipeline
Activity
foreach (…)
Activity
Activity Activity
Activity
Self-hosted
Integration Runtime
On-prem
Apps & Data
Azure
Integration Runtime
Linked
Service
Synapse Pipelines shares codebase with Azure Data Factory
10. Pipelines
Create pipelines to ingest, transform and load data with 90+ inbuilt connectors.
Offers a wide range of activities that a pipeline can perform.
11. Pipelines
Overview
• Provide ability to load data from storage
account to desired linked service.
• Load data by manual execution of
pipeline or by orchestration.
Benefits
• Supports common loading patterns.
• Fully parallel loading into data lake or
SQL tables.
• Graphical development experience.
12. Integration runtimes
Overview
Integration runtimes are the compute infrastructure
used by Pipelines to provide the data integration
capabilities across different network environments. An
integration runtime provides the bridge between the
activity and linked services.
Benefits
• Offers Azure Integration Runtime or Self-Hosted
Integration Runtime
• Azure Integration Runtime – provides fully managed,
serverless compute in Azure
• Self-Hosted Integration Runtime – use compute
resources in on-premises machine or a VM inside
private network
13. Linked services
Overview
Linked services define the connection information
needed to connect to external resources.
Benefits
• Offers pre-build 90+ connectors
• Easy cross platform data migration
• Represents data store or compute resources
14. Develop Hub - Data Flows
Data flows are a visual way of specifying how to transform data.
Provides a code-free experience.
15. Data Flow Capabilities
Handle upserts, updates,
deletes on sql sinks
Add new partition
methods
Add schema drift support
Add file handling (move
files after read, write files
to file names described in
rows etc)
New inventory of functions
(for e.g Hash functions for
row comparison)
Commonly used ETL
patterns(Sequence
generator/Lookup
transformation/SCD…)
Data lineage – Capturing
sink column lineage &
impact analysis(invaluable
if this is for enterprise
deployment)
Implement commonly
used ETL patterns as
templates(SCD Type1,
Type2, Data Vault)
16. Triggers
Overview
Triggers represent a unit of processing that
determines when a pipeline execution needs to
be kicked off.
Data Integration offers 3 trigger types as –
1. Schedule – gets fired at a schedule with
information of start date, recurrence, end
date
2. Event – gets fired on specified Storage
event
3. Tumbling window – gets fired at a periodic
time interval from a specified start date,
while retaining state
It also provides ability to monitor pipeline runs
and control trigger execution.
17. Datasets
Orchestration datasets describe data that is persisted.
Once a dataset is defined, it can be used in pipelines and sources of data or as sinks of data.
18. Azure (15) Database & DW (26) File Storage (6)
File
Formats(6)
NoSQL (3) Services and App (28) Generic (4)
Blob storage Amazon Redshift Oracle Amazon S3 AVRO Cassandra Amazon MWS Oracle Service Cloud Generic HTTP
Cosmos DB - SQL API DB2 Phoenix File system Binary Couchbase CDS for Apps PayPal Generic OData
Cosmos DB - MongoDB
API
Drill PostgreSQL FTP Delimited Text MongoDB Concur QuickBooks Generic ODBC
Data Explorer
Google
BigQuery
Presto
Google Cloud
Storage
JSON Dynamics 365 Salesforce Generic REST
Data Lake Storage Gen1 Greenplum
SAP BW Open
Hub
HDFS ORC Dynamics AX SF Service Cloud
Data Lake Storage Gen2 HBase SAP BW via MDX SFTP Parquet Dynamics CRM SF Marketing Cloud
Database for MariaDB Hive SAP HANA Google AdWords SAP C4C
Database for MySQL Apache Impala SAP table HubSpot SAP ECC
Database for PostgreSQL Informix Spark Jira ServiceNow
File Storage MariaDB SQL Server Magento Shopify
SQL Database Microsoft Access Sybase Marketo Square
SQL Database MI MySQL Teradata Office 365 Web table
SQL Data Warehouse Netezza Vertica Oracle Eloqua Xero
Search index Oracle Responsys Zoho
Table storage
90+ Connectors out of the box
19. Pop Quiz
Which one of these is NOT a
component of a Synapse pipeline?
A)
I.R.
C)
Table
D)
Activity
B)
Linked
Service
20. Pop Quiz
Which one of these is NOT a
component of a Synapse pipeline?
A)
I.R.
C)
Table
D)
Activity
B)
Linked
Service
22. Overview
Copies data from source to destination
Benefits
• Retrieves data from all files from the folder and all its
subfolders.
• Supports multiple locations from the same storage
account, separated by comma
• Supports Azure Data Lake Storage (ADLS) Gen 2 and
Azure Blob Storage.
• Supports CSV, PARQUET, ORC file formats
COPY command
COPY INTO test_1
FROM 'https://XYZ.blob.core.windows.net/customerdatasets/test_1.txt'
WITH (
FILE_TYPE = 'CSV',
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
SECRET='<Your_SAS_Token>'),
FIELDQUOTE = '"',
FIELDTERMINATOR=';',
ROWTERMINATOR='0X0A',
ENCODING = 'UTF8',
DATEFORMAT = 'ymd',
MAXERRORS = 10,
ERRORFILE = '/errorsfolder/'--path starting from the storage container,
IDENTITY_INSERT
)
COPY INTO test_parquet
FROM 'https://XYZ.blob.core.windows.net/customerdatasets/test.parquet'
WITH (
FILE_FORMAT = myFileFormat
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
SECRET='<Your_SAS_Token>')
)
23. Create External Table As Select (Polybase)
Overview
Creates an external table and then exports results of
the Select statement. These operations will import
data into the database for the duration of the query
Steps:
1. Create Master Key
2. Create Credentials
3. Create External Data Source
4. Create External Data Format
5. Create External Table
-- Create a database master key if one does not already exist
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo'
;
-- Create a database scoped credential with Azure storage account key as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = '<my_account>'
, SECRET = '<azure_storage_account_key>'
;
-- Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
( LOCATION = 'wasbs://[email protected]/'
, CREDENTIAL = AzureStorageCredential
, TYPE = HADOOP
)
-- Create an external file format
CREATE EXTERNAL FILE FORMAT MyAzureCSVFormat
WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
FIRST_ROW = 2)
--Create an external table
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
WITH(
LOCATION = '/files/Customer',
DATA_SOURCE = MyAzureStorage,
FILE_FORMAT = MyAzureCSVFormat
)
AS SELECT T1.* FROM dbo.FactInternetSales T1 JOIN dbo.DimCustomer T2
ON ( T1.CustomerKey = T2.CustomerKey )
OPTION ( HASH JOIN );
24. Polybase vs Copy
• GA, stable
• Needs CONTROL permission
• Fastest (at present)
• Enables querying via external tables
• Challenges:
• Row width
• Delimiters in text
• Fixed line delimiter
• Code complexity
• Currently in Preview
• Relaxed permission
• Slightly slower, but improving
• No row width limit
• Supports delimiters in text
• Supports custom column and row
delimiters
Polybase Copy
26. Ingest Flat files to tables
Ingest flat file data into Azure Storage (Azure Data Lake Store Gen2)
• When your data sources are on-premises, you need to move the
data to Azure Storage before ingestion.
• Data in other cloud platforms needs to be moved to Azure Storage
before ingestion.
Load from flat files as relational tables within the data warehouse
27. ADLS Gen 2 Filesystem
Ingest - Structuring ADLS Gen2
• Separate storage accounts for each environment: dev, test, &
production.
• Use a common folder structure to organize data by degree of
refinement.
Raw Data
/bronze
Query Ready
/silver
Report Ready
/gold
28. Ingest from on-premises data sources
Fastest is done by batch:
• Extract from data source to multiple CSV/Parquet files
• Use AzCopy to upload to ADLS
Alternative is query-insert:
• Set up SSIS self-hosted integration runtime on-premises
• Use Synapse Pipeline to extract/copy
• Use Synapse Pipeline to execute load procedure
Large Migrations:
• Use Azure Data Box where available
29. Ingest from Cloud Data Sources
Options:
• Extract using Synapse Pipelines
• Write to ADLS as Parquet files
• AzCopy is a fast move for files from S3 to ADLS
30. Ingest File Data Sources
Look out for these file format challenges…
Invalid file format
• Multiple row types
• Ragged columns
Row size > 1Mb
Datetime format/s (e.g., use of nanosecond date time)
NULL value literal/s
Free form text
Parquet partitions
XML data
Use of non-standard line delimiters (e.g., CR)
…and try these Solutions
• Use Spark to pre-process and fix
data errors
• Flatten and parse XML in Spark
• Use COPY to ingest complex CSV
instead of Polybase
31. Ingest and Store – Formats
For batch flat files, Azure Synapse Analytics supports
CSV, Parquet, ORC, and JSON formats.
Ingest streaming data messages/events via Event Hub or IoT Hub.
Parquet format recommended for storing ingested data at various
levels of refinement.
32. Ingest - When to BCP / Bulk Copy
Green fields: Never
• Network unreliability, no retries
• Needs VM in cloud, performance dependent on VM configuration
• Doesn’t support ADLS
• Reduces concurrency
• Control-gated performance limitation, can not scale with DWU
Migrations:
• Use Synapse Pipeline or AzCopy
• Bulk Copy will work, but it will be slower than other methods
33. Ingest – Synapse Pipelines
• Un-check USE TYPE DEFAULT, it is not a best practice.
• Land data in ADLS Gen2, then ingest using Polybase / COPY.
• This means you can re-ingest the same data set without having to repeat extracts, and better
demonstrate ingestion performance.
34. Ingest and Store – Loading staging tables
Indexing
Use Heap tables
Speed load performance by staging data in heap tables and temporary
tables prior to running transformations.
Only load to a CCI table if the test requires a load to a single table, then
complex end-user queries against that table.
as
35. Ingest and Store – Loading staging tables
Distribution
Use Round Robin Distribution for:
Potentially useful tables created from raw input.
Temporary staging tables used in data preparation.
Other distribution considerations:
Never load to a REPLICATED table
Load to a ROUND_ROBIN table if the test is ONLY raw ingestion performance, or
if the table is very small
Load to a HASH table if the test is a pipeline with subsequent transformations
using the loaded table
36. Ingest – Scaling to shorten duration
Ingestion duration is correlated with the number of DWU’s allocated to the
SQL Pool.
For every doubling of the DWU’s you halve the ingestion time.
2d = t/2
d: DWU
T: ingestion time
Only applies from DWU500c – DWU30000c
37. Pop Quiz
True or False: Both COPY command AND
Polybase require CONTROL permission
TRUE FALSE
38. Pop Quiz
True or False: Both COPY command AND
Polybase require CONTROL permissions
TRUE FALSE
43. 2 Serverless transforms
Use Azure Synapse SQL
Serverless to transform
data with SQL scripts.
Agenda
1 Transform with
Pipelines
Understanding and
exploring the data.
3 Transform with Spark
Here we have an example
of what the agenda item
would look like.
4 Best practices
Best practices for data
transformation.
44. Typical Data Transformations
• Create persistent staging area / data vault
• Standardize data from different sources
• Remove duplicate rows
• Impute missing values
• Calculate derived values
• Prepare data for facts and dimensions
46. Code based transformations
Familiar gesture to generate T-SQL scripts from SQL
metadata objects such as tables.
Starting from a table, auto-generate a single line of
PySpark code that makes it easy to load a SQL table into a
Spark dataframeand author transforms in a notebook.
48. No Code Transform with Mapping Data Flows
Overview
It offers data cleansing,
transformation, aggregation,
conversion, etc.
Benefits
• Cloud scale via Spark
execution
• Guided experience to
easily build resilient data
flows
• Flexibility to transform
data per user’s comfort
• Monitor and manage
dataflows from a single
pane of glass
This…
NOT
this…
50. Pop Quiz
What’s the largest scale TPC-H workload
SQL Serverless has successfully run?
A)
100TB
B)
1PB
C)
10PB
51. Pop Quiz
What’s the largest scale TPC-H workload
SQL Serverless has successfully run?
A)
100TB
B)
1PB
C)
10PB
52. Serverless SQL Pool
Overview
An interactive query service that
provides T-SQL queries over high
scale data in Azure Storage.
Benefits
• Pay-per-query with serverless model
• Query data in-place on the data lake
with T-SQL (no ETL)
• Supports various file formats
(Parquet, CSV, JSON)
• Integrates with Databricks,
HDInsight, PowerBI, and the shared
Synapse metastore
10
01
Azure Storage
SQL On Demand
Query
Power BI
Azure Data Studio
SSMS
Read and write
data files
Curate and transform data
Sync table
definitions
Read and write
data files
54. Serverless SQL – Querying CSV File
Overview
Uses OPENROWSET function to access data
Benefits
Ability to read CSV File with
- no header row, Windows style new line
- no header row, Unix-style new line
- header row, Unix-style new line
- header row, Unix-style new line, quoted
- header row, Unix-style new line, escape
- header row, Unix-style new line, tab-delimited
- without specifying all columns
SELECT *
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/population/population.csv',
FORMAT = 'CSV',
FIELDTERMINATOR =',',
ROWTERMINATOR = 'n'
)
WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
[year] smallint,
[population] bigint
) AS [r]
WHERE
country_name = 'Luxembourg'
AND year = 2017
55. Serverless SQL – Querying folders
Overview
Uses OPENROWSET function to access data from
multiple files or folders
Benefits
• Offers reading multiple files/folders through usage
of wildcards
• Offers reading specific file/folder
• Supports use of multiple wildcards
SELECT YEAR(pickup_datetime) as [year], SUM(passenger_count) AS passengers_total,
COUNT(*) AS [rides_total]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/*.*’,
FORMAT = 'CSV’
, FIRSTROW = 2 )
WITH (
vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count INT,
trip_distance FLOAT,
rate_code INT,
store_and_fwd_flag VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_location_id INT,
dropoff_location_id INT,
payment_type INT,
fare_amount FLOAT,
extra FLOAT, mta_tax FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
improvement_surcharge FLOAT,
total_amount FLOAT
) AS nyc
GROUP BY YEAR(pickup_datetime)
ORDER BY YEAR(pickup_datetime)
56. Serverless SQL – Querying specific files
Overview
filename – Provides file name that originates
row result
filepath – Provides full path when no
parameter is passed or part of path when
parameter is passed that originates result
Benefits
Provides source name/path of file/folder for
row result set
SELECT
r.filename() AS [filename]
,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-1*.csv’,
FORMAT = 'CSV',
FIRSTROW = 2
)
WITH (
vendor_id INT,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count SMALLINT,
trip_distance FLOAT,
<…columns>
) AS [r]
GROUP BY r.filename()
ORDER BY [filename]
Example of filename function
57. Serverless SQL – Querying specific files
SELECT
r.filepath() AS filepath
,r.filepath(1) AS [year]
,r.filepath(2) AS [month]
,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_*-*.csv’,
FORMAT = 'CSV',
FIRSTROW = 2 )
WITH (
vendor_id INT,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count SMALLINT,
trip_distance FLOAT,
<… columns>
) AS [r]
WHERE r.filepath(1) IN ('2017’)
AND r.filepath(2) IN ('10', '11', '12’)
GROUP BY r.filepath() ,r.filepath(1) ,r.filepath(2)
ORDER BY filepath
filepath year month rows
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-10.csv 2017 10 9768815
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-11.csv 2017 11 9284803
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-12.csv 2017 12 9508276
Example of filepath function
58. Serverless SQL – Querying Parquet files
Overview
Uses OPENROWSET function to access data
Benefits
Ability to specify column names of interest
Offers auto reading of column names and data types
Provides target specific partitions using filepath function
SELECT
YEAR(pickup_datetime),
passenger_count,
COUNT(*) AS cnt
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/parquet/taxi/*/*/*',
FORMAT='PARQUET'
) WITH (
pickup_datetime DATETIME2,
passenger_count INT
) AS nyc
GROUP BY
passenger_count,
YEAR(pickup_datetime)
ORDER BY
YEAR(pickup_datetime),
passenger_count
59. Serverless SQL – Querying JSON files
SELECT *
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/json/books/book1.json’,
FORMAT='CSV',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(8000)
) AS [r]
Overview
Read JSON files and provides data in tabular
format
Benefits
Supports OPENJSON, JSON_VALUE and
JSON_QUERY functions
60. Serverless SQL – Querying JSON files
SELECT
JSON_QUERY(jsonContent, '$.authors') AS authors,
jsonContent
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/json/books/*.json',
FORMAT='CSV',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(8000)
) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology,
An Introduction by Selected Topics'
SELECT
JSON_VALUE(jsonContent, '$.title') AS title,
JSON_VALUE(jsonContent, '$.publisher') as publisher,
jsonContent
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/json/books/*.json',
FORMAT='CSV',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(8000)
) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, A
n Introduction by Selected Topics'
Example of JSON_QUERY function
Example of JSON_VALUE function
62. Transforming with Spark – Querying SQL Pools
val jdbcUsername = "<SQL DB ADMIN USER>"
val jdbcPwd = "<SQL DB ADMIN PWD>"
val jdbcHostname = "servername.database.windows.net”
val jdbcPort = 1433
val jdbcDatabase ="<AZURE SQL DB NAME>“
val jdbc_url = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=$
{jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertifi
cate=*.database.windows.net;loginTimeout=60;“
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPwd}")
val sqlTableDf = spark.read.jdbc(jdbc_url, “dbo.Tbl1", connectionProperties)
// Construct a Spark DataFrame from SQL Pool table
var df = spark.read.sqlanalytics("sql1.dbo.Tbl1")
// Write the Spark DataFrame into SQL Pool table
df.write.sqlanalytics(“sql1.dbo.Tbl2”)
Existing Approach
New Approach Using
Scala
%%spark
val df = spark.read.sqlanalytics("sql1.dbo.Tbl1")
df.createOrReplaceTempView("tbl1")
%%pyspark
sample = spark.sql("SELECT * FROM tbl1")
sample.createOrReplaceTempView("tblnew")
%%spark
val df = spark.sql("SELECT * FROM tblnew")
df.write.sqlanalytics(“sql1.dbo.tbl2",
Constants.INTERNAL)
Using Python
67. CCI vs Heap
• Transformations using Heap tables are generally faster than CCI. This is because
rows need to be assembled from column stores on read tables, and columnar
compression is needed on targets.
• The wider the table, and the more text fields it contains, the faster Heap is over
CCI.
• Use Heap tables at transformation layer, use CCI tables where appropriate
at presentation layer
68. CCI Best Practice
• MAX data types not supported
• At least 1 million rows * 60 distributions * number of partitions
• At least 100k rows per batch, up to 1million
• Load using at least LARGERC or STATICRC60
• Create a loading user
• Minimal UPDATE and DELETE (or REBUILD frequently)
69. Automatic statistics management – Dedicated
SQL
Overview
Statistics are automatically created and maintained for dedicated
SQL pool. Incoming queries are analyzed, and individual column
statistics are generated on the columns that improve cardinality
estimates to enhance query performance.
Statistics are automatically updated as data modifications occur in
underlying tables. By default, these updates are synchronous but
can be configured to be asynchronous.
Statistics are considered out of date when:
• There was a data change on an empty table
• The number of rows in the table at time of statistics creation
was 500 or less, and more than 500 rows have been updated
• The number of rows in the table at time of statistics creation
was more than 500, and more than 500 + 20% of rows have
been updated
-- Turn on/off auto-create statistics settings
ALTER DATABASE {database_name}
SET AUTO_CREATE_STATISTICS { ON | OFF }
-- Turn on/off auto-update statistics settings
ALTER DATABASE {database_name}
SET AUTO_UPDATE_STATISTICS { ON | OFF }
-- Configure synchronous/asynchronous update
ALTER DATABASE {database_name}
SET AUTO_UPDATE_STATISTICS_ASYNC { ON | OFF }
-- Check statistics settings for a database
SELECT is_auto_create_stats_on,
is_auto_update_stats_on,
is_auto_update_stats_async_on
FROM sys.databases
70. Statistics (serverless SQL)
Automatic creation available only for Parquet and CSV support
Same goes for recreation of statistics
Only single-column statistics are currently supported
CSV sampling not supported yet (only FULLSCAN)
71. CTAS vs Insert / Update / Delete / Merge
• Prefer CTAS when you update or delete more than 10% of rows
• Prefer CTAS when you are updating or deleting a clustered
Columnstore index, and do not have time for an offline rebuild
72. UPDATE FROM and DELETE FROM
• Azure Synapse Analytics does not currently support (*) joins in UPDATE
FROM and DELETE FROM queries.
• Implement the join as a temporary / transient table, then UPDATE / DELETE
from that table
(*) Coming soon
73. Simple is better than clever
• Persist standard columns early, to avoid calculations and functions in
WHERE clause
• Unroll CTEs and JOIN sub-selects to transient / temporary tables to manage
distribution
• Simple queries are easier to tune and debug
74. Pop Quiz #2
What is the optimal size for a
rowgroup in columnstore format in a
Synapse SQL Pool?
A)
99,999
B)
60,000,000
C)
1,048,576
75. Pop Quiz #2
What is the optimal size for a
rowgroup in columnstore format in a
Synapse SQL Pool?
A)
99,999
B)
60,000,000
C)
1,048,576
76. STORE
VISUALIZE
INGEST PREPARE TRANSFORM &
ENRICH
SERVE
Synapse
Pipelines
ADLS Gen 2
Storage Account
Data Lake
Synapse SQL
(Serverless)
Synapse
Pipeline
Synapse
Pipeline
Synapse SQL
(Provisioned)
Power BI
Synapse SQL
(Serverless)
OR
Synapse
Spark
AZURE SYNAPSE ANALYTICS
Synapse SQL
(Serverless)
Synapse
Spark
Synapse SQL
(Provisioned)
OR
Data
Sources
78. Break
Please take this time for a short 15-minute break
If at anytime you require assistance, please send a message to the “Need help – ask here” channel in
the Microsoft Teams site for this event
Relax and come back refreshed for our next activity
80. Agenda
1 Performance Patterns
Result set caching and materialized
views.
2 Table design
Clustered columnstore index and
ordered variant, clustered index, heap
and non-clustered index.
3 Index design
Clustered columnstore index and
ordered variant, clustered index, heap
and non-clustered index.
81. Azure Synapse Analytics
Limitless analytics service with unmatched time to insight
Synapse Analytics
Platform
Azure
Data Lake Storage
Common Data Model
Enterprise Security
Optimized for Analytics
Data lake integrated and
Common Data Model aware
METASTORE
SECURITY
MANAGEMENT
MONITORING
Integrated platform services
for, management, security,
monitoring, and meta-store
DATA INTEGRATION
SQL
Analytics Runtimes
Integrated analytics runtimes
available dedicated and serverless
Synapse SQL offering T-SQL for
batch, streaming and interactive
processing
Apache Spark for big data
processing with Python, Scala
and .NET
DEDICATED SERVERLESS
Form Factors
SQL
Languages
Python .NET Java Scala
Multiple languages suited to
different analytics workloads
Experience Synapse Analytics Studio
SaaS developer experiences for
code free and code first
Artificial Intelligence / Machine Learning / Internet of
Things
Intelligent Apps / Business Intelligence
Designed for analytics workloads
at any scale
METASTORE
SECURITY
MANAGEMENT
MONITORING
85. Use result-set caching to improve query performance when the same
queries are executed repeatedly against mainly static data.
Result-set cache is invalidated and refreshed when underlying table data
changes or the query code changes
Result-set cache persists when SQL Pool is paused and resumed.
Result set caching motivated
86. Overview
Cache the results of a query in SQL pool storage. This enables
interactive response times for repetitive queries against tables
with infrequent data changes.
The result-set cache persists even if SQL pool is paused and
resumed later.
Query cache is invalidated and refreshed when underlying table
data or query code changes.
Result cache is evicted regularly based on a time-aware least
recently used algorithm (TLRU).
Benefits
• Enhances performance when same result is requested
repetitively
• Reduced load on server for repeated queries
• Offers monitoring of query execution with a result cache hit or
miss
Result-set caching
-- Turn on/off result-set caching for a database
-- Must be run on the MASTER database
ALTER DATABASE {database_name}
SET RESULT_SET_CACHING { ON | OFF }
-- Turn on/off result-set caching for a client session
-- Run on target Azure Synapse Analytics
SET RESULT_SET_CACHING {ON | OFF}
-- Check result-set caching setting for a database
-- Run on target Azure Synapse Analytics
SELECT is_result_set_caching_on
FROM sys.databases
WHERE name = {database_name}
-- Return all query requests with cache hits
-- Run on target data warehouse
SELECT *
FROM sys.dm_pdw_request_steps
WHERE command like '%DWResultCacheDb%'
AND step_index = 0
87. Result-set caching flow
Client sends query to
SQL pool
1 Query is processed using compute nodes
which pull data from remote storage,
process query and output back to client
app
2 Query results are cached in remote
storage so subsequent requests can
be served immediately
0101010001
0100101010
01010100010
100101010
Subsequent executions for the same
query bypass compute nodes and can
be fetched instantly from persistent
cache in remote storage
3
01010100010
100101010
Remote storage cache is evicted regularly
based on time, cache usage, and any
modifications to underlying table data.
4 Cache will need to be
regenerated if query results
have been evicted from cache
5
88. Overview
A materialized view pre-computes, stores, and maintains its
data like a table.
Materialized views are automatically updated when data in
underlying tables are changed. This is a synchronous operation
that occurs as soon as the data is changed.
The auto caching functionality allows Azure Synapse Analytics
Query Optimizer to consider using indexed view even if the
view is not referenced in the query.
Supported aggregations: MAX, MIN, AVG, COUNT, COUNT_BIG,
SUM, VAR, STDEV
Benefits
• Automatic and synchronous data refresh with data changes
in base tables. No user action is required.
• High availability and resiliency as regular tables
Materialized views
-- Create indexed view
CREATE MATERIALIZED VIEW Sales.vw_Orders
WITH
(
DISTRIBUTION = ROUND_ROBIN |
HASH(ProductID)
)
AS
SELECT SUM(UnitPrice*OrderQty) AS Revenue,
OrderDate,
ProductID,
COUNT_BIG(*) AS OrderCount
FROM Sales.SalesOrderDetail
GROUP BY OrderDate, ProductID;
GO
-- Disable index view and put it in suspended mode
ALTER INDEX ALL ON Sales.vw_Orders DISABLE;
-- Re-enable index view by rebuilding it
ALTER INDEX ALL ON Sales.vw_Orders REBUILD;
89. In this example, a query to get the year total sales per customer is shown to have a lot of data
shuffles and joins that contribute to slow performance:
Materialized views - example
-- Get year total sales per customer
(WITH year_total AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
SUM(ISNULL(list_price – wholesale_cost –
discount_amt + sales_price, 0)/2)year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id
, first_name
,
last_name
,birth_country
,
login
,email_address,d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Execution time: 103 seconds
Lots of data shuffles and joins needed to complete query
No relevant indexed views created on the data
warehouse
90. Now, we add an indexed view to the data warehouse to increase the performance of the previous
query. This view can be leveraged by the query even though it is not directly referenced.
Indexed Materialized views - example
-- Create indexed view for query
CREATE INDEXED VIEW nbViewCS WITH (DISTRIBUTION=HASH(customer_id)) AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
SUM(ISNULL(list_price – wholesale_cost – discount_amt +
sales_price, 0)/2) AS year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id
, first_name
,
last_name
,birth_country
,
login
, email_address
, d_year
Create indexed view with hash distribution on customer_id column
-- Get year total sales per customer
(WITH year_total AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
SUM(ISNULL(list_price – wholesale_cost –
discount_amt + sales_price, 0)/2)year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id
, first_name
,
last_name
,birth_country
,
login
,email_address,d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Original query – get year total sales per customer
91. SQL pool query optimizer automatically leverages the indexed view to speed up the same query. Notice
that the query does not need to reference the view directly
Indexed (materialized) views - example
-- Get year total sales per customer
(WITH year_total AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
SUM(ISNULL(list_price – wholesale_cost –
discount_amt + sales_price, 0)/2)year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id
, first_name
,
last_name
,birth_country
,
login
,email_address,d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Original query – no changes have been made to query
Execution time: 6 seconds
Optimizer leverages materialized view to reduce data shuffles and joins needed
92. EXPLAIN - provides query plan for SQL statement
without running the statement; view estimated cost
of the query operations.
EXPLAIN WITH_RECOMMENDATIONS - provides
query plan with recommendations to optimize the
SQL statement performance.
Materialized views- Recommendations
EXPLAIN WITH_RECOMMENDATIONS
select count(*)
from (
(select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between 1194 and 1194+11)
except
(select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1194 and 1194+11)
) top_customers
93. Indexed Materialized Views
• Indexed views cache the schema and data for a view in DW remote storage.
They are useful for improving the performance of ‘SELECT’ statement queries
that include aggregations
• Indexed views are automatically updated when data in underlying tables are
changed. This is a synchronous operation that occurs as soon as the data is
changed.
• The auto caching functionality allows Synapse Query Optimizer to consider using
indexed view even if the view is not referenced in the query
• Supported aggregations: MAX, MIN, AVG, COUNT, COUNT_BIG, SUM, VAR, STDEV
95. CREATE TABLE dbo.OrderTable
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
REPLICATED
);
Round-robin distributed
Distributes table rows evenly across all
distributions at random.
Hash distributed
Distributes table rows across the Compute nodes
by using a deterministic hash function to assign
each row to one distribution.
Replicated
Full copy of table accessible on each Compute
node.
Tables – Distributions
96. CREATE TABLE partitionedOrderTable
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]),
PARTITION (
[Date] RANGE RIGHT FOR VALUES (
'2000-01-01', '2001-01-01', '2002-01-01’,
'2003-01-01', '2004-01-01', '2005-01-01'
)
)
);
Overview
Table partitions divide data into smaller groups
In most cases, partitions are created on a date
column
Supported on all table types
RANGE RIGHT – Used for time partitions
RANGE LEFT – Used for number partitions
Benefits
• Improves efficiency and performance of
loading and querying by limiting the scope to
subset of data.
• Offers significant query performance
enhancements where filtering on the partition
key can eliminate unnecessary scans and
eliminate IO.
Tables – Partitions
97. OrderId Date Name Country
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85216 11-2-2018 Q DE
85395 11-2-2018 V NL
82147 11-2-2018 Q FR
86881 11-2-2018 D UK
93080 11-3-2018 R UK
94156 11-3-2018 S FR
96250 11-3-2018 Q NL
98799 11-3-2018 R NL
98015 11-3-2018 T UK
98310 11-3-2018 D DE
98979 11-3-2018 Z DE
98137 11-3-2018 T FR
… … … …
Logical table structure
Tables – Distributions & Partitions
Physical data distribution
( Hash distribution (OrderId), Date partitions )
OrderId Date Name Country
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85216 11-2-2018 Q DE
85395 11-2-2018 V NL
82147 11-2-2018 Q FR
86881 11-2-2018 D UK
… … … …
OrderId Date Name Country
93080 11-3-2018 R UK
94156 11-3-2018 S FR
96250 11-3-2018 Q NL
98799 11-3-2018 R NL
98015 11-3-2018 T UK
98310 11-3-2018 D DE
98979 11-3-2018 Z DE
98137 11-3-2018 T FR
… … … …
11-2-2018 partition
11-3-2018 partition
x 60 distributions (shards)
Distribution1
(OrderId 80,000 – 100,000)
…
• Each shard is partitioned with the same
date partitions
• A minimum of 1 million rows per
distribution and partition is needed for
optimal compression and performance
of clustered Columnstore tables
98. Common table distribution methods
Table Category Recommended Distribution Option
Fact
Use hash-distribution with clustered columnstore index. Performance improves because
hashing enables the platform to localize certain operations within the node itself during query
execution.
Operations that benefit:
COUNT(DISTINCT( <hashed_key> ))
OVER PARTITION BY <hashed_key>
most JOIN <table_name> ON <hashed_key>
GROUP BY <hashed_key>
Dimension
Use replicated for smaller tables. If tables are too large to store on each Compute node, use
hash-distributed.
Staging
Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the
99. Question…
Why is the best practice to load into
a staging table, and then CTAS into
the production table?
101. Hash Distribution:
Large fact tables exceeding several GBs with frequent inserts should use a hash
distribution.
Round Robin Distribution:
Potentially useful tables created from raw input.
Temporary staging tables used in data preparation.
Replicated Tables:
Lookup tables that range in size from 100’s MBs to 1.5 GBs should be replicated.
Works best when table size is less than 2 GB compressed.
Distributed table design recommendations
102. Automatic statistics management – Dedicated
SQL
Overview
Statistics are automatically created and maintained for dedicated
SQL pool. Incoming queries are analyzed, and individual column
statistics are generated on the columns that improve cardinality
estimates to enhance query performance.
Statistics are automatically updated as data modifications occur in
underlying tables. By default, these updates are synchronous but
can be configured to be asynchronous.
Statistics are considered out of date when:
• There was a data change on an empty table
• The number of rows in the table at time of statistics creation
was 500 or less, and more than 500 rows have been updated
• The number of rows in the table at time of statistics creation
was more than 500, and more than 500 + 20% of rows have
been updated
-- Turn on/off auto-create statistics settings
ALTER DATABASE {database_name}
SET AUTO_CREATE_STATISTICS { ON | OFF }
-- Turn on/off auto-update statistics settings
ALTER DATABASE {database_name}
SET AUTO_UPDATE_STATISTICS { ON | OFF }
-- Configure synchronous/asynchronous update
ALTER DATABASE {database_name}
SET AUTO_UPDATE_STATISTICS_ASYNC { ON | OFF }
-- Check statistics settings for a database
SELECT is_auto_create_stats_on,
is_auto_update_stats_on,
is_auto_update_stats_async_on
FROM sys.databases
104. Too many partitions
• Partitions can be useful when maintaining current rows in very large
fact tables. Partition switching is a good alternative to full CTAS
• Partitioning CCIs is only useful when the row count is greater than
60million * #partitions
• In general, avoid partitions, particularly in POCs
105. Views on Views
• Views on Views will not support performance optimization using
Materialized Views (more later)
• Views cannot be distributed
108. -- Create table with index
CREATE TABLE orderTable
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX |
HEAP |
CLUSTERED INDEX (OrderId)
);
-- Add non-clustered index to table
CREATE INDEX NameIndex ON orderTable (Name);
Clustered Columnstore index (Default Primary)
Highest level of data compression
Best overall query performance
Clustered index (Primary)
Performant for looking up a single to few rows
Heap (Primary)
Faster loading and landing temporary data
Best for small lookup tables
Nonclustered indexes (Secondary)
Enable ordering of multiple columns in a table
Allows multiple nonclustered on a single table
Can be created on any of the above primary indexes
More performant lookup queries
Tables – Indexes
109. OrderId Date Name Country
98137 11-3-2018 T FR
98310 11-3-2018 D DE
98799 11-3-2018 R NL
OrderId Date Name Country
82147 11-2-2018 Q FR
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
OrderId Date Name Country
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85216 11-2-2018 Q DE
85395 11-2-2018 V NL
82147 11-2-2018 Q FR
86881 11-2-2018 D UK
93080 11-3-2018 R UK
94156 11-3-2018 S FR
96250 11-3-2018 Q NL
98799 11-3-2018 R NL
98015 11-3-2018 T UK
98310 11-3-2018 D DE
98979 11-3-2018 Z DE
98137 11-3-2018 T FR
… … … …
Logical table structure
OrderId
82147
85016
85018
85216
85395
Date
11-2-2018
Country
FR
UK
SP
DE
NL
Name
Q
V
Rowgroup1
Min (OrderId): 82147 | Max (OrderId): 85395
OrderId Date Name Country
98137 11-3-2018 T FR
98310 11-3-2018 D DE
98799 11-3-2018 R NL
98979 11-3-2018 Z DE
Delta Rowstore
SQL Analytics Columnstore Tables
Clustered columnstore index
(OrderId)
…
• Data stored in compressed columnstore segments after
being sliced into groups of rows (rowgroups/micro-
partitions) for maximum compression
• Rows are stored in the delta rowstore until the number of
rows is large enough to be compressed into a
columnstore
Clustered/Non-clustered rowstore index
(OrderId)
• Data is stored in a B-tree index structure for performant
lookup queries for particular rows.
• Clustered rowstore index: The leaf nodes in the structure
store the data values in a row (as pictured above)
• Non-clustered (secondary) rowstore index: The leaf nodes
store pointers to the data values, not the values
themselves
+
OrderId PageId
82147 1001
98137 1002
OrderId PageId
82147 1005
85395 1006
OrderId PageId
98137 1007
98979 1008
OrderId Date Name Country
82147 11-2-2018 Q FR
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
OrderId Date Name Country
98137 11-3-2018 T FR
98310 11-3-2018 D DE
98799 11-3-2018 R NL
… …
110. Overview
Queries against tables with ordered columnstore segments can take advantage of improved
segment elimination to drastically reduce the time needed to service a query.
Ordered Clustered Columnstore Indexes
-- Insert data into table with ordered columnstore index
INSERT INTO sortedOrderTable
VALUES (1, '01-01-2019','Dave’, 'UK')
-- Create Table with Ordered Columnstore Index
CREATE TABLE sortedOrderTable
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX ORDER (OrderId)
)
-- Create Clustered Columnstore Index on existing table
CREATE CLUSTERED COLUMNSTORE INDEX cciOrderId
ON dbo.OrderTable ORDER (OrderId)
111. Ordered CCI
• Queries against tables with ordered columnstore segments can take
advantage of improved segment elimination to drastically reduce the
time needed to service a query.
• Columnstore Segments are automatically updated as data is inserted,
updated, or deleted in data warehouse tables.
112. Clustered Columnstore indexes (CCI) are best for fact tables.
CCI offer the highest level of data compression and best query performance for tables
with over 100 million rows.
Heap tables are best for small lookup tables and recommended for tables with less
than 100 million rows.
Clustered Indexes may outperform CCI when very few rows need to be retrieved
quickly.
Add non-clustered indexes to improve performance for less selective queries.
Each additional index added to a table increases storage space required and processing time during
data loads.
Speed load performance by staging data in heap tables and temporary tables prior to
running transformations.
as
Choosing the right index
114. Too many indexes
• Start without indexes. The overhead of maintaining them can be
greater than their value.
• A primary-key non-clustered index may improve performance of joins
when fact tables are joined to very large (billion+) dimensions
115. Pop Quiz
Match the tables with their
recommended index! dbo.LineItem
30B rows
Primary fact table
dbo.Sales
150M rows
Single sales lookups
stg.stagingLineItem
10k rows
Staging table for loads
dbo.dimProduct
1.5k rows
Product information
Heap
CCI
CI
116. Pop Quiz
Match the tables with their
recommended index! dbo.LineItem
30B rows
Primary fact table
dbo.Sales
150M rows
Single sales lookups
stg.stagingLineItem
10k rows
Staging table for loads
dbo.dimProduct
1.5k rows
Product information
Heap
CCI
CI
117. STORE
VISUALIZE
INGEST PREPARE TRANSFORM &
ENRICH
SERVE
Synapse
Pipelines
ADLS Gen 2
Storage Account
Data Lake
Synapse SQL
(Serverless)
Synapse
Pipeline
Synapse
Pipeline
Synapse SQL
(Provisioned)
Power BI
Synapse SQL
(Serverless)
OR
Synapse
Spark
AZURE SYNAPSE ANALYTICS
Synapse SQL
(Serverless)
Synapse
Spark
Synapse SQL
(Provisioned)
OR
Data
Sources
119. Break
Please take this time for a short 15-minute break
If at anytime you require assistance, please send a message to the “Need help – ask here” channel in
the Microsoft Teams site for this event
Relax and come back refreshed for our next activity
120. Your challenge should you
choose to accept it:
Wide World Importers
needs your help!
Work as a team and
prove to them you have
what it takes.
It won’t be easy alone,
but as ONE Microsoft,
you can do this!
Complete as many of the
challenges as you can,
but don’t worry about
getting them all done.
This Photo by Unknown Author is licensed under CC BY
121. Step 1: Go to your
table group channel
Step 2: You will use your
cloud lab login for Azure
Step 3: Login to Azure. Engage your table group via a Microsoft Teams “meet now” call to
collaborate with each other (meet now button in upper right corner of application)
POC Challenges 1 & 2: group-based lab exercises
GUIDE:
PoC Challenge
https://github.com/solliancenet/azure-synapse-analytics-workshop-300-2
-day
122. Day 1 Wrap Up
What was covered today:
• Design, optimize, and secure a file system within ALDS Gen2
• Decide on which Azure Synapse Analytics component to use for specific data
engineering scenarios
• Implement optimization strategizes for the data warehouse using SQL based
approached in Azure Synapse Analytics.
What we will learn tomorrow:
• Address scenarios to monitor and manage Azure solutions
• Apply security concepts to a customer scenario
Thank you for your participation in the Day 1 Technical Boot
Camp
If at anytime you require assistance, please send a message to the “Need help – ask here” channel in
the Microsoft Teams site for this event
123. We’d love to hear from you!
Day 1
1. Scan the QR code below using your smartphone or access the link https://aka.ms/BC_EMEA_Day1
2. You’ll also receive a link to this survey by email. You only need to complete it once.
3. Answer the survey questions to provide your feedback on Day 1 of Boot Camp. Be honest. Every bit of
feedback helps us improve.
123
#26: What specific approach would you say is the most efficient way for moving flat file data from the ingest storage locations to the data lake?
Follow the pattern of landing data in the data lake first, then ingest from the flat files into relational tables within the data warehouse. Then create pipelines that extract the source data and store in Azure Data Lake Store Gen2 as Parquet files.
What storage service would you recommend to use?
They should use Azure Data Lake Store (ADLS) Gen2 (Azure Storage with hierarchical file systems).
#27: How would you recommend to structure the folder to manage the data at the various levels of refinement?
They should use Azure Data Lake Store (ADLS) Gen2 (Azure Storage with hierarchical file systems).
In ADLS, it is a best practice to have a dedicated Storage Account for production, and a separate Storage Account for dev and test workloads. This will ensure that dev or test workloads never interfere with production.
One common folder structure is to organize the data in separate folders by degree of refinement. For example a bronze folder contains the raw data, silver contains the cleaned, prepared and integrated data and gold contains data ready to support analytics, which might include final refinements such as pre-computed aggregates.
#31: When it comes to ingesting raw data in batch from new data sources, what data formats are supported by Synapse?
CSV, Parquet, ORC, JSON
How do you ingest streaming data?
Collect messages in Event Hub or IoT Hub and process them with Stream Analytics.
Azure offers purpose-built stream ingestion services such as Azure Kafka and Azure Event Hubs that are robust, proven and performant.
(Preview) Azure Synapse will also support native stream ingestion through integration with Azure Stream Analytics.
When it comes to storing refined versions of the data for possible querying, what data format would you recommend they use? Why?
Parquet. There is industry alignment around the Parquet format for sharing data at the storage layer (e.g., across Hadoop, Databricks, and SQL engine scenarios). Parquet is a high-performance, column oriented format optimized for big data scenarios.
#33: https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-data-warehouse#polybase-troubleshooting <- a good resource to share out as well
Lino 68
#34: What should you use for the fastest loading of staging tables?
A heap table. If you are loading data only to stage it before running more transformations, loading the table to heap table is much faster than loading the data to a clustered columnstore table.
A temporary table. Loading data to a temporary table loads faster than loading a table to permanent storage.
#35: How should you configure table distribution for tables created from the raw input that might be useful, or tables used for staging?
Consider round-robin distribution.
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra step can slow down your queries. For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
Consider using the round-robin distribution for your table in the following scenarios:
When getting started as a simple starting point since it is the default
If there is no obvious joining key
If there is not good candidate column for hash distributing the table
If the table does not share a common join key with other tables
If the join is less significant than other joins in the query
When the table is a temporary staging table
#38: FALSE – only Polybase requires CONTROL permission
#62: Important Points:
No need to pre-create the table in the SQL Pool to write to, it will be created if it does not exist.
sqlanalytics only works in Scala language cells.
To read with other languages like Python, use Spark to register a temporary view and then query the view using Spark.SQL(“select * from viewname”)
To write with other languages, create a view for table you want write, then query from that view in Scala. Then write.
#69: Statistics synchronous option to be asynchronous
Recalculated when there are changes over 500 rows or 20% of rows are updated
By default, auto update statistics are configured to be synchronous but can be configured to be asynchronous operations
Statistics are updated opportunistically when queries are run.
#70: CSV support added [Cost management for serverless SQL pool - Azure Synapse Analytics | Microsoft Docs]
When statistics are created for a Parquet column, only the relevant column is read from files. When statistics are created for a CSV column, whole files are read and parsed.
#75: 1,048,576 – this is why we recommend at least 60 million rows for a CCI table. Azure Synapse automatically distributes data into 60 distribution, and each distribution needs at least 1 million rows for good rowgroup compression (or each partition, if your table is partitioned).
#83: Answer: PERFORMANCE for both! A well designed and optimized Azure Synapse can blow away the competition but falling prey to common performance pitfalls can hurt our chances.
#85: Their downstream reports are used by many users, which often means the same query is being executed repeatedly against data that does not change that often. What can WWI to improve the performance of these types of queries? How does this approach work when the underlying data changes?
They should consider result-set caching.
Cache the results of a query in provisioned Azure Synapse SQL Pool storage. This enables interactive response times for repetitive queries against tables with infrequent data changes.
The result-set cache persists even if SQL pool is paused and resumed later.
Query cache is invalidated and refreshed when underlying table data or query code changes.
Result cache is evicted regularly based on a time-aware least recently used algorithm (TLRU).
#86: Result-set caching
The maximum size of the result-set cache is 1TB
Query results are persisted for a maximum of 48 hours but can be evicted earlier to save space based on the least recently used result
Disabled on DW by default unless turned on at a session level or the entire database level
Additional storage costs are incurred by caching query result sets
Check the is_result_set_caching column in the sys.databases DMV to show the result-set caching setting for a database
Users can tell if a query was executed with a result cache hit or miss by querying sys.pdw_request_steps for commands where value is like ‘%DWResultCacheDb%’
#87: Result-set caching
The maximum size of the result-set cache is 1TB
Query results are persisted for a maximum of 48 hours but can be evicted earlier to save space based on the least recently used result
Disabled on DW by default unless turned on at a session level or the entire database level
Additional storage costs are incurred by caching query result sets
Check the is_result_set_caching column in the sys.databases DMV to show the result-set caching setting for a database
Users can tell if a query was executed with a result cache hit or miss by querying sys.pdw_request_steps for commands where value is like ‘%DWResultCacheDb%’
#88: Current Limitations:
If MIN/MAX aggregates are used in the SELECT list, the indexed view will automatically be disabled when UPDATE and DELETE occur in the referenced base tables. Run ALTER INDEX with REBUILD to re-enable the indexed view
Only INNER JOIN is supported
Only HASH and ROUND_ROBIN distributions are supported
Only CLUSTERED COLUMNSTORE INDEX is supported
ALTER VIEW is not supported
#95: A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.
Hash distributed
A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one distribution.
Since identical values always hash to the same distribution, the data warehouse has built-in knowledge of the row locations. SQL Data Warehouse uses this knowledge to minimize data movement during queries, which improves query performance.
Hash-distributed tables work well for large fact tables in a star schema. They can have very large numbers of rows and still achieve high performance. There are, of course, some design considerations that help you to get the performance the distributed system is designed to provide. Choosing a good distribution column is one such consideration that is described in this article.
Consider using a hash-distributed table when:
The table size on disk is more than 2 GB.
The table has frequent insert, update, and delete operations.
Round-robin distributed
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra step can slow down your queries. For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
Consider using the round-robin distribution for your table in the following scenarios:
When getting started as a simple starting point since it is the default
If there is no obvious joining key
If there is not good candidate column for hash distributing the table
If the table does not share a common join key with other tables
If the join is less significant than other joins in the query
When the table is a temporary staging table
Replicated Tables
A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed.
Replicated tables work well for small dimension tables in a star schema. Dimension tables are usually of a size that makes it feasible to store and maintain multiple copies. Dimensions store descriptive data that changes slowly, such as customer name and address, and product details. The slowly changing nature of the data leads to fewer rebuilds of the replicated table.
Consider using a replicated table when:
The table size on disk is less than 2 GB, regardless of the number of rows. To find the size of a table, you can use the DBCC PDW_SHOWSPACEUSED command: DBCC PDW_SHOWSPACEUSED('ReplTableCandidate').
The table is used in joins that would otherwise require data movement. When joining tables that are not distributed on the same column, such as a hash-distributed table to a round-robin table, data movement is required to complete the query. If one of the tables is small, consider a replicated table. We recommend using replicated tables instead of round-robin tables in most cases. To view data movement operations in query plans, use sys.dm_pdw_request_steps. The BroadcastMoveOperation is the typical data movement operation that can be eliminated by using a replicated table.
Source: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/design-guidance-for-replicated-tables
#96: What are table partitions?
Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column. Partitioning is supported on all SQL Data Warehouse table types; including clustered columnstore, clustered index, and heap. Partitioning is also supported on all distribution types, including both hash or round robin distributed.
Partitioning can benefit data maintenance and query performance. Whether it benefits both or just one is dependent on how data is loaded and whether the same column can be used for both purposes, since partitioning can only be done on one column.
Benefits to loads
The primary benefit of partitioning in SQL Data Warehouse is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging. In most cases data is partitioned on a date column that is closely tied to the order in which the data is loaded into the database. One of the greatest benefits of using partitions to maintain data it the avoidance of transaction logging. While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance.
Partition switching can be used to quickly remove or replace a section of a table. For example, a sales fact table might contain just data for the past 36 months. At the end of every month, the oldest month of sales data is deleted from the table. This data could be deleted by using a delete statement to delete the data for the oldest month. However, deleting a large amount of data row-by-row with a delete statement can take too much time, as well as create the risk of large transactions that take a long time to rollback if something goes wrong. A more optimal approach is to drop the oldest partition of data. Where deleting the individual rows could take hours, deleting an entire partition could take seconds.
Benefits to queries
Partitioning can also be used to improve query performance. A query that applies a filter to partitioned data can limit the scan to only the qualifying partitions. This method of filtering can avoid a full table scan and only scan a smaller subset of data. With the introduction of clustered columnstore indexes, the predicate elimination performance benefits are less beneficial, but in some cases there can be a benefit to queries. For example, if the sales fact table is partitioned into 36 months using the sales date field, then queries that filter on the sale date can skip searching in partitions that don’t match the filter.
Sizing partitions
While partitioning can be used to improve performance some scenarios, creating a table with too many partitions can hurt performance under some circumstances. These concerns are especially true for clustered columnstore tables. For partitioning to be helpful, it is important to understand when to use partitioning and the number of partitions to create. There is no hard fast rule as to how many partitions are too many, it depends on your data and how many partitions you loading simultaneously. A successful partitioning scheme usually has tens to hundreds of partitions, not thousands.
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. Before partitions are created, SQL Data Warehouse already divides each table into 60 distributed databases. Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact table contained 36 monthly partitions, and given that SQL Data Warehouse has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition. For more information, see the Indexing article, which includes queries that can assess the quality of cluster columnstore indexes.
Source: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
#99: Answer: Because the risky part of the load is moving data from external storage into the SQL Pool – loading into a staging table (round robin/heap) ensures that the riskiest part of the load happens the fastest!
#101: What are the typical issues they should look out for with regards to distributed table design for the following scenarios?
Their smallest fact table exceeds several GB’s and by their nature experiences frequent inserts.
They should use a hash distribution.
A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one distribution.
Since identical values always hash to the same distribution, the data warehouse has built-in knowledge of the row locations. SQL Data Warehouse uses this knowledge to minimize data movement during queries, which improves query performance.
Hash-distributed tables work well for large fact tables in a star schema. They can have very large numbers of rows and still achieve high performance.
Consider using a hash-distributed table when:
The table size on disk is more than 2 GB.
The table has frequent insert, update, and delete operations.
As they develop the data warehouse, the WWI data team identified some tables created from the raw input that might be useful, but they don’t currently join to other tables and they are not sure of the best columns they should use for distributing the data.
They should consider round-robin distribution.
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra step can slow down your queries. For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
Consider using the round-robin distribution for your table in the following scenarios:
When getting started as a simple starting point since it is the default
If there is no obvious joining key
If there is not good candidate column for hash distributing the table
If the table does not share a common join key with other tables
If the join is less significant than other joins in the query
When the table is a temporary staging table
Their data engineers sometimes use temporary staging tables in their data preparation.
They should use a round-robin distributed table.
They have lookup tables that range from several hundred MBs to 1.5 GBs.
They should consider using replicated tables.
A replicated table has a full copy of the table accessible on each Compute node. Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed.
#102: Statistics synchronous option to be asynchronous
Recalculated when there are changes over 500 rows or 20% of rows are updated
By default, auto update statistics are configured to be synchronous but can be configured to be asynchronous operations
Statistics are updated opportunistically when queries are run.
#106: Clustered Columnstore, Clustered index (and non-clustered index), Heap
#108: Clustered columnstore indexes
By default, SQL Data Warehouse creates a clustered columnstore index when no index options are specified on a table. Clustered columnstore tables offer both the highest level of data compression as well as the best overall query performance. Clustered columnstore tables will generally outperform clustered index or heap tables and are usually the best choice for large tables. For these reasons, clustered columnstore is the best place to start when you are unsure of how to index your table.
There are a few scenarios where clustered columnstore may not be a good option:
Columnstore tables do not support varchar(max), nvarchar(max) and varbinary(max). Consider heap or clustered index instead.
Columnstore tables may be less efficient for transient data. Consider heap and perhaps even temporary tables.
Small tables with less than 100 million rows. Consider heap tables.
Clustered and nonclustered indexes
Clustered indexes may outperform clustered columnstore tables when a single row needs to be quickly retrieved. For queries where a single or very few row lookup is required to performance with extreme speed, consider a cluster index or nonclustered secondary index. The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective filter on the clustered index column. To improve filter on other columns a nonclustered index can be added to other columns. However, each index which is added to a table adds both space and processing time to loads.
Heap tables
When you are temporarily landing data on SQL Data Warehouse, you may find that using a heap table makes the overall process faster. This is because loads to heaps are faster than to index tables and in some cases the subsequent read can be done from cache. If you are loading data only to stage it before running more transformations, loading the table to heap table is much faster than loading the data to a clustered columnstore table. In addition, loading data to a temporary table loads faster than loading a table to permanent storage.
For small lookup tables, less than 100 million rows, often heap tables make sense. Cluster columnstore tables begin to achieve optimal compression once there is more than 100 million rows.
Source: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-index
#112: Their sales transaction dataset exceeds a billion rows. For their downstream reporting queries, they need to be able to join, project and filter these rows in no longer than 10s of seconds. WWI is concerned their data is just too big to do this.
What specific indexing techniques should they use to reach this kind of performance for their fact tables? Why?
Clustered Columnstore Indexes. As they offer the highest level of data compression and best overall query performance, columnstore indexes are usually the best choice for large tables such as fact tables.
Would you recommend the same approach for tables they have with less than 100 million rows?
No. For "small" tables with less than 100 million rows, they should consider Heap tables.
How should they configure indexes on their smaller lookup tables (e.g., those that contain store names and addresses)?
They should consider using Heap tables. For small lookup tables, less than 100 million rows, often heap tables make sense. Cluster columnstore tables begin to achieve optimal compression once there is more than 100 million rows.
What would you suggest for their larger lookup tables that are used just for point lookups that retrieve only a single row? How could they makes these more flexible so that queries filtering against different sets of columns would still yield efficient lookups?
Use clustered indexes. Clustered indexes may outperform clustered columnstore tables when a single row needs to be quickly retrieved. For queries where a single or a very few number of rows to lookup is required to perform with extreme speed, consider a clustered index or non-clustered secondary index.
The disadvantage to using a clustered index is that the only queries that benefit are the ones that use a highly selective filter on the clustered index column. To improve filter performance on other columns, a non-clustered index can be added to other columns.
However, be aware that each index which is added to a table adds both space and processing time to data loads.
What should they use for the fastest loading of staging tables?
A heap table. If you are loading data only to stage it before running more transformations, loading the table to heap table is much faster than loading the data to a clustered columnstore table.
A temporary table. Loading data to a temporary table loads faster than loading a table to permanent storage.