Day 1 - Technical Bootcamp azure synapse analytics

Welcome to Technical
Bootcamp: Day 1

Design Presentation: Data Loading & Data Lake Organization
Day 1
Welcome, Objectives
Break
Break
PoC Challenge 1
Long Break
Design Presentation: DW Optimization
PoC Challenge 2
Please refer to the descriptions below for reference
Types of activities:
Attendees will participate in two different types of activities:
Location of activities:
Each activity will take place in the following location:
A look into the Day 1 agenda:
Keynote
Presentations
Challenges
All-up Session: Teams bridge in calendar invite
Independent working time: No meeting, working in Spektra labs
Design Presentation: Data Transformations
Continue PoC Challenge 1
Break
Continue PoC Challenge 2
Break

Ciprian Jichici
Chief Data Scientist
ciprian@solliance.net
Ciprian Jichici is the Chief Data Scientist of
Solliance, one of the top worldwide Microsoft AI
partners.
He is recognized internationally as a Microsoft
Regional Director and a Microsoft Most Valuable
Professional for Artificial Intelligence and Quantum
Computing.
Cloud Computing, Artificial Intelligence, and
Machine Learning are some of the key areas of his
expertise spanning 20+ years of IT.
Ciprian is also very passionate about quantum
physics and consequently, about quantum
computing.
linkedin.com/in/ciprianjichici/

Data Loading &
Data Lake Organization

Agenda
1 Orchestration
Synapse pipelines
2 Ingest files to tables
Copy versus CTAS
3 Best practices
Various ingest and storage best
practices

Azure Synapse Analytics
Limitless analytics service with unmatched time to insight
Synapse Analytics
Platform
Azure
Data Lake Storage
Common Data Model
Enterprise Security
Optimized for Analytics
Data lake integrated and
Common Data Model aware
METASTORE
SECURITY
MANAGEMENT
MONITORING
Integrated platform services
for, management, security,
monitoring, and meta-store
DATA INTEGRATION
SQL
Analytics Runtimes
Integrated analytics runtimes
available dedicated and serverless
Synapse SQL offering T-SQL for
batch, streaming and interactive
processing
Apache Spark for big data
processing with Python, Scala
and .NET
DEDICATED SERVERLESS
Form Factors
SQL
Languages
Python .NET Java Scala
Multiple languages suited to
different analytics workloads
Experience Synapse Analytics Studio
SaaS developer experiences for
code free and code first
Artificial Intelligence / Machine Learning / Internet of
Things
Intelligent Apps / Business Intelligence
Designed for analytics workloads
at any scale
METASTORE
SECURITY
MANAGEMENT
MONITORING

STORE
VISUALIZE
INGEST PREPARE TRANSFORM &
ENRICH
SERVE
Modern Data Warehouse

Ingest - Orchestration with Pipelines

Azure Services
Command and
Control
L E G E N D
Data
Components of Orchestration
Trigger
On demand
Schedule
Data Window
Event
Pipeline
Activity
foreach (…)
Activity
Activity Activity
Activity
Self-hosted
Integration Runtime
On-prem
Apps & Data
Azure
Integration Runtime
Linked
Service
Synapse Pipelines shares codebase with Azure Data Factory

Pipelines
Create pipelines to ingest, transform and load data with 90+ inbuilt connectors.
Offers a wide range of activities that a pipeline can perform.

Pipelines
Overview
• Provide ability to load data from storage
account to desired linked service.
• Load data by manual execution of
pipeline or by orchestration.
Benefits
• Supports common loading patterns.
• Fully parallel loading into data lake or
SQL tables.
• Graphical development experience.

Integration runtimes
Overview
Integration runtimes are the compute infrastructure
used by Pipelines to provide the data integration
capabilities across different network environments. An
integration runtime provides the bridge between the
activity and linked services.
Benefits
• Offers Azure Integration Runtime or Self-Hosted
Integration Runtime
• Azure Integration Runtime – provides fully managed,
serverless compute in Azure
• Self-Hosted Integration Runtime – use compute
resources in on-premises machine or a VM inside
private network

Linked services
Overview
Linked services define the connection information
needed to connect to external resources.
Benefits
• Offers pre-build 90+ connectors
• Easy cross platform data migration
• Represents data store or compute resources

Develop Hub - Data Flows
Data flows are a visual way of specifying how to transform data.
Provides a code-free experience.

Data Flow Capabilities
Handle upserts, updates,
deletes on sql sinks
Add new partition
methods
Add schema drift support
Add file handling (move
files after read, write files
to file names described in
rows etc)
New inventory of functions
(for e.g Hash functions for
row comparison)
Commonly used ETL
patterns(Sequence
generator/Lookup
transformation/SCD…)
Data lineage – Capturing
sink column lineage &
impact analysis(invaluable
if this is for enterprise
deployment)
Implement commonly
used ETL patterns as
templates(SCD Type1,
Type2, Data Vault)

Triggers
Overview
Triggers represent a unit of processing that
determines when a pipeline execution needs to
be kicked off.
Data Integration offers 3 trigger types as –
1. Schedule – gets fired at a schedule with
information of start date, recurrence, end
date
2. Event – gets fired on specified Storage
event
3. Tumbling window – gets fired at a periodic
time interval from a specified start date,
while retaining state
It also provides ability to monitor pipeline runs
and control trigger execution.

Datasets
Orchestration datasets describe data that is persisted.
Once a dataset is defined, it can be used in pipelines and sources of data or as sinks of data.

Azure (15) Database & DW (26) File Storage (6)
File
Formats(6)
NoSQL (3) Services and App (28) Generic (4)
Blob storage Amazon Redshift Oracle Amazon S3 AVRO Cassandra Amazon MWS Oracle Service Cloud Generic HTTP
Cosmos DB - SQL API DB2 Phoenix File system Binary Couchbase CDS for Apps PayPal Generic OData
Cosmos DB - MongoDB
API
Drill PostgreSQL FTP Delimited Text MongoDB Concur QuickBooks Generic ODBC
Data Explorer
Google
BigQuery
Presto
Google Cloud
Storage
JSON Dynamics 365 Salesforce Generic REST
Data Lake Storage Gen1 Greenplum
SAP BW Open
Hub
HDFS ORC Dynamics AX SF Service Cloud
Data Lake Storage Gen2 HBase SAP BW via MDX SFTP Parquet Dynamics CRM SF Marketing Cloud
Database for MariaDB Hive SAP HANA Google AdWords SAP C4C
Database for MySQL Apache Impala SAP table HubSpot SAP ECC
Database for PostgreSQL Informix Spark Jira ServiceNow
File Storage MariaDB SQL Server Magento Shopify
SQL Database Microsoft Access Sybase Marketo Square
SQL Database MI MySQL Teradata Office 365 Web table
SQL Data Warehouse Netezza Vertica Oracle Eloqua Xero
Search index Oracle Responsys Zoho
Table storage
90+ Connectors out of the box

Pop Quiz
Which one of these is NOT a
component of a Synapse pipeline?
A)
I.R.
C)
Table
D)
Activity
B)
Linked
Service

Overview
Copies data from source to destination
Benefits
• Retrieves data from all files from the folder and all its
subfolders.
• Supports multiple locations from the same storage
account, separated by comma
• Supports Azure Data Lake Storage (ADLS) Gen 2 and
Azure Blob Storage.
• Supports CSV, PARQUET, ORC file formats
COPY command
COPY INTO test_1
FROM 'https://XYZ.blob.core.windows.net/customerdatasets/test_1.txt'
WITH (
FILE_TYPE = 'CSV',
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
SECRET='<Your_SAS_Token>'),
FIELDQUOTE = '"',
FIELDTERMINATOR=';',
ROWTERMINATOR='0X0A',
ENCODING = 'UTF8',
DATEFORMAT = 'ymd',
MAXERRORS = 10,
ERRORFILE = '/errorsfolder/'--path starting from the storage container,
IDENTITY_INSERT
)
COPY INTO test_parquet
FROM 'https://XYZ.blob.core.windows.net/customerdatasets/test.parquet'
WITH (
FILE_FORMAT = myFileFormat
CREDENTIAL=(IDENTITY= 'Shared Access Signature',
SECRET='<Your_SAS_Token>')
)

Create External Table As Select (Polybase)
Overview
 Creates an external table and then exports results of
the Select statement. These operations will import
data into the database for the duration of the query
Steps:
1. Create Master Key
2. Create Credentials
3. Create External Data Source
4. Create External Data Format
5. Create External Table
-- Create a database master key if one does not already exist
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo'
;
-- Create a database scoped credential with Azure storage account key as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = '<my_account>'
, SECRET = '<azure_storage_account_key>'
;
-- Create an external data source with CREDENTIAL option.
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
( LOCATION = 'wasbs://daily@logs.blob.core.windows.net/'
, CREDENTIAL = AzureStorageCredential
, TYPE = HADOOP
)
-- Create an external file format
CREATE EXTERNAL FILE FORMAT MyAzureCSVFormat
WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
FIRST_ROW = 2)
--Create an external table
CREATE EXTERNAL TABLE dbo.FactInternetSalesNew
WITH(
LOCATION = '/files/Customer',
DATA_SOURCE = MyAzureStorage,
FILE_FORMAT = MyAzureCSVFormat
)
AS SELECT T1.* FROM dbo.FactInternetSales T1 JOIN dbo.DimCustomer T2
ON ( T1.CustomerKey = T2.CustomerKey )
OPTION ( HASH JOIN );

Polybase vs Copy
• GA, stable
• Needs CONTROL permission
• Fastest (at present)
• Enables querying via external tables
• Challenges:
• Row width
• Delimiters in text
• Fixed line delimiter
• Code complexity
• Currently in Preview
• Relaxed permission
• Slightly slower, but improving
• No row width limit
• Supports delimiters in text
• Supports custom column and row
delimiters
Polybase Copy

Ingest Flat files to tables
Ingest flat file data into Azure Storage (Azure Data Lake Store Gen2)
• When your data sources are on-premises, you need to move the
data to Azure Storage before ingestion.
• Data in other cloud platforms needs to be moved to Azure Storage
before ingestion.
Load from flat files as relational tables within the data warehouse

ADLS Gen 2 Filesystem
Ingest - Structuring ADLS Gen2
• Separate storage accounts for each environment: dev, test, &
production.
• Use a common folder structure to organize data by degree of
refinement.
Raw Data
/bronze
Query Ready
/silver
Report Ready
/gold

Ingest from on-premises data sources
Fastest is done by batch:
• Extract from data source to multiple CSV/Parquet files
• Use AzCopy to upload to ADLS
Alternative is query-insert:
• Set up SSIS self-hosted integration runtime on-premises
• Use Synapse Pipeline to extract/copy
• Use Synapse Pipeline to execute load procedure
Large Migrations:
• Use Azure Data Box where available

Ingest from Cloud Data Sources
Options:
• Extract using Synapse Pipelines
• Write to ADLS as Parquet files
• AzCopy is a fast move for files from S3 to ADLS

Ingest File Data Sources
Look out for these file format challenges…
Invalid file format
• Multiple row types
• Ragged columns
Row size > 1Mb
Datetime format/s (e.g., use of nanosecond date time)
NULL value literal/s
Free form text
Parquet partitions
XML data
Use of non-standard line delimiters (e.g., CR)
…and try these Solutions
• Use Spark to pre-process and fix
data errors
• Flatten and parse XML in Spark
• Use COPY to ingest complex CSV
instead of Polybase

Ingest and Store – Formats
For batch flat files, Azure Synapse Analytics supports
CSV, Parquet, ORC, and JSON formats.
Ingest streaming data messages/events via Event Hub or IoT Hub.
Parquet format recommended for storing ingested data at various
levels of refinement.

Ingest - When to BCP / Bulk Copy
Green fields: Never
• Network unreliability, no retries
• Needs VM in cloud, performance dependent on VM configuration
• Doesn’t support ADLS
• Reduces concurrency
• Control-gated performance limitation, can not scale with DWU
Migrations:
• Use Synapse Pipeline or AzCopy
• Bulk Copy will work, but it will be slower than other methods

Ingest – Synapse Pipelines
• Un-check USE TYPE DEFAULT, it is not a best practice.
• Land data in ADLS Gen2, then ingest using Polybase / COPY.
• This means you can re-ingest the same data set without having to repeat extracts, and better
demonstrate ingestion performance.

Ingest and Store – Loading staging tables
Indexing
Use Heap tables
Speed load performance by staging data in heap tables and temporary
tables prior to running transformations.
Only load to a CCI table if the test requires a load to a single table, then
complex end-user queries against that table.
as

Ingest and Store – Loading staging tables
Distribution
Use Round Robin Distribution for:
 Potentially useful tables created from raw input.
 Temporary staging tables used in data preparation.
Other distribution considerations:
 Never load to a REPLICATED table
 Load to a ROUND_ROBIN table if the test is ONLY raw ingestion performance, or
if the table is very small
 Load to a HASH table if the test is a pipeline with subsequent transformations
using the loaded table

Ingest – Scaling to shorten duration
Ingestion duration is correlated with the number of DWU’s allocated to the
SQL Pool.
For every doubling of the DWU’s you halve the ingestion time.
2d = t/2
d: DWU
T: ingestion time
Only applies from DWU500c – DWU30000c

Pop Quiz
True or False: Both COPY command AND
Polybase require CONTROL permission
TRUE FALSE

Pop Quiz
True or False: Both COPY command AND
Polybase require CONTROL permissions
TRUE FALSE

STORE
VISUALIZE
INGEST PREPARE TRANSFORM &
ENRICH
SERVE
Synapse
Pipelines
ADLS Gen 2
Storage Account
Data Lake
Synapse SQL
(Serverless)
Synapse
Pipeline
Synapse
Pipeline
Synapse SQL
(Provisioned)
Power BI
Synapse SQL
(Serverless)
OR
Synapse
Spark
AZURE SYNAPSE ANALYTICS
Synapse SQL
(Serverless)
Synapse
Spark
Synapse SQL
(Provisioned)
OR
Data
Sources

© Copyright Microsoft Corporation. All rights reserved.
Thank you

2 Serverless transforms
Use Azure Synapse SQL
Serverless to transform
data with SQL scripts.
Agenda
1 Transform with
Pipelines
Understanding and
exploring the data.
3 Transform with Spark
Here we have an example
of what the agenda item
would look like.
4 Best practices
Best practices for data
transformation.

Typical Data Transformations
• Create persistent staging area / data vault
• Standardize data from different sources
• Remove duplicate rows
• Impute missing values
• Calculate derived values
• Prepare data for facts and dimensions

Code based transformations
Familiar gesture to generate T-SQL scripts from SQL
metadata objects such as tables.
Starting from a table, auto-generate a single line of
PySpark code that makes it easy to load a SQL table into a
Spark dataframeand author transforms in a notebook.

Transform with Pipelines
Orchestrate transformations with Synapse Pipelines.

No Code Transform with Mapping Data Flows
Overview
It offers data cleansing,
transformation, aggregation,
conversion, etc.
Benefits
• Cloud scale via Spark
execution
• Guided experience to
easily build resilient data
flows
• Flexibility to transform
data per user’s comfort
• Monitor and manage
dataflows from a single
pane of glass
This…
NOT
this…

Pop Quiz
What’s the largest scale TPC-H workload
SQL Serverless has successfully run?
A)
100TB
B)
1PB
C)
10PB

Serverless SQL Pool
Overview
An interactive query service that
provides T-SQL queries over high
scale data in Azure Storage.
Benefits
• Pay-per-query with serverless model
• Query data in-place on the data lake
with T-SQL (no ETL)
• Supports various file formats
(Parquet, CSV, JSON)
• Integrates with Databricks,
HDInsight, PowerBI, and the shared
Synapse metastore
10
01
Azure Storage
SQL On Demand
Query
Power BI
Azure Data Studio
SSMS
Read and write
data files
Curate and transform data
Sync table
definitions
Read and write
data files

Serverless SQL – Querying on storage

Serverless SQL – Querying CSV File
Overview
Uses OPENROWSET function to access data
Benefits
Ability to read CSV File with
- no header row, Windows style new line
- no header row, Unix-style new line
- header row, Unix-style new line
- header row, Unix-style new line, quoted
- header row, Unix-style new line, escape
- header row, Unix-style new line, tab-delimited
- without specifying all columns
SELECT *
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/population/population.csv',
FORMAT = 'CSV',
FIELDTERMINATOR =',',
ROWTERMINATOR = 'n'
)
WITH (
[country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
[country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
[year] smallint,
[population] bigint
) AS [r]
WHERE
country_name = 'Luxembourg'
AND year = 2017

Serverless SQL – Querying folders
Overview
Uses OPENROWSET function to access data from
multiple files or folders
Benefits
• Offers reading multiple files/folders through usage
of wildcards
• Offers reading specific file/folder
• Supports use of multiple wildcards
SELECT YEAR(pickup_datetime) as [year], SUM(passenger_count) AS passengers_total,
COUNT(*) AS [rides_total]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/*.*’,
FORMAT = 'CSV’
, FIRSTROW = 2 )
WITH (
vendor_id VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_datetime DATETIME2,
dropoff_datetime DATETIME2,
passenger_count INT,
trip_distance FLOAT,
rate_code INT,
store_and_fwd_flag VARCHAR(100) COLLATE Latin1_General_BIN2,
pickup_location_id INT,
dropoff_location_id INT,
payment_type INT,
fare_amount FLOAT,
extra FLOAT, mta_tax FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
improvement_surcharge FLOAT,
total_amount FLOAT
) AS nyc
GROUP BY YEAR(pickup_datetime)
ORDER BY YEAR(pickup_datetime)

Serverless SQL – Querying specific files
Overview
filename – Provides file name that originates
row result
filepath – Provides full path when no
parameter is passed or part of path when
parameter is passed that originates result
Benefits
Provides source name/path of file/folder for
row result set
SELECT
r.filename() AS [filename]
,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-1*.csv’,
FORMAT = 'CSV',
FIRSTROW = 2
)
WITH (
vendor_id INT,
passenger_count SMALLINT,
<…columns>
) AS [r]
GROUP BY r.filename()
ORDER BY [filename]
Example of filename function

Serverless SQL – Querying specific files
SELECT
r.filepath() AS filepath
,r.filepath(1) AS [year]
,r.filepath(2) AS [month]
,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_*-*.csv’,
FORMAT = 'CSV',
FIRSTROW = 2 )
WITH (
vendor_id INT,
passenger_count SMALLINT,
<… columns>
) AS [r]
WHERE r.filepath(1) IN ('2017’)
AND r.filepath(2) IN ('10', '11', '12’)
GROUP BY r.filepath() ,r.filepath(1) ,r.filepath(2)
ORDER BY filepath
filepath year month rows
https://XXX.blob.core.windows.net/csv/taxi/yellow_tripdata_2017-10.csv 2017 10 9768815
Example of filepath function

Serverless SQL – Querying Parquet files
Overview
Uses OPENROWSET function to access data
Benefits
Ability to specify column names of interest
Offers auto reading of column names and data types
Provides target specific partitions using filepath function
SELECT
YEAR(pickup_datetime),
passenger_count,
COUNT(*) AS cnt
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/parquet/taxi/*/*/*',
FORMAT='PARQUET'
) WITH (
passenger_count INT
) AS nyc
GROUP BY
passenger_count,
YEAR(pickup_datetime)
ORDER BY
YEAR(pickup_datetime),
passenger_count

Serverless SQL – Querying JSON files
SELECT *
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/json/books/book1.json’,
FORMAT='CSV',
FIELDTERMINATOR ='0x0b',
FIELDQUOTE = '0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(8000)
) AS [r]
Overview
Read JSON files and provides data in tabular
format
Benefits
Supports OPENJSON, JSON_VALUE and
JSON_QUERY functions

Serverless SQL – Querying JSON files
SELECT
JSON_QUERY(jsonContent, '$.authors') AS authors,
jsonContent
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/json/books/*.json',
FORMAT='CSV',
)
WITH (
) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology,
An Introduction by Selected Topics'
SELECT
JSON_VALUE(jsonContent, '$.title') AS title,
JSON_VALUE(jsonContent, '$.publisher') as publisher,
jsonContent
FROM
OPENROWSET(
BULK 'https://XXX.blob.core.windows.net/json/books/*.json',
FORMAT='CSV',
)
WITH (
) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, A
n Introduction by Selected Topics'
Example of JSON_QUERY function
Example of JSON_VALUE function

Transforming with Spark – Querying SQL Pools
val jdbcUsername = "<SQL DB ADMIN USER>"
val jdbcPwd = "<SQL DB ADMIN PWD>"
val jdbcHostname = "servername.database.windows.net”
val jdbcPort = 1433
val jdbcDatabase ="<AZURE SQL DB NAME>“
val jdbc_url = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=$
{jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertifi
cate=*.database.windows.net;loginTimeout=60;“
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPwd}")
val sqlTableDf = spark.read.jdbc(jdbc_url, “dbo.Tbl1", connectionProperties)
// Construct a Spark DataFrame from SQL Pool table
var df = spark.read.sqlanalytics("sql1.dbo.Tbl1")
// Write the Spark DataFrame into SQL Pool table
df.write.sqlanalytics(“sql1.dbo.Tbl2”)
Existing Approach
New Approach Using
Scala
%%spark
val df = spark.read.sqlanalytics("sql1.dbo.Tbl1")
df.createOrReplaceTempView("tbl1")
%%pyspark
sample = spark.sql("SELECT * FROM tbl1")
sample.createOrReplaceTempView("tblnew")
%%spark
val df = spark.sql("SELECT * FROM tblnew")
df.write.sqlanalytics(“sql1.dbo.tbl2",
Constants.INTERNAL)
Using Python

View results in
chart format
SQL support

Exploratory data analysis
with graphs – histogram,
boxplot etc

CCI vs Heap
• Transformations using Heap tables are generally faster than CCI. This is because
rows need to be assembled from column stores on read tables, and columnar
compression is needed on targets.
• The wider the table, and the more text fields it contains, the faster Heap is over
CCI.
• Use Heap tables at transformation layer, use CCI tables where appropriate
at presentation layer

CCI Best Practice
• MAX data types not supported
• At least 1 million rows * 60 distributions * number of partitions
• At least 100k rows per batch, up to 1million
• Load using at least LARGERC or STATICRC60
• Create a loading user
• Minimal UPDATE and DELETE (or REBUILD frequently)

Automatic statistics management – Dedicated
SQL
Overview
Statistics are automatically created and maintained for dedicated
SQL pool. Incoming queries are analyzed, and individual column
statistics are generated on the columns that improve cardinality
estimates to enhance query performance.
Statistics are automatically updated as data modifications occur in
underlying tables. By default, these updates are synchronous but
can be configured to be asynchronous.
Statistics are considered out of date when:
• There was a data change on an empty table
• The number of rows in the table at time of statistics creation
was 500 or less, and more than 500 rows have been updated
• The number of rows in the table at time of statistics creation
was more than 500, and more than 500 + 20% of rows have
been updated
-- Turn on/off auto-create statistics settings
ALTER DATABASE {database_name}
SET AUTO_CREATE_STATISTICS { ON | OFF }
-- Turn on/off auto-update statistics settings
SET AUTO_UPDATE_STATISTICS { ON | OFF }
-- Configure synchronous/asynchronous update
SET AUTO_UPDATE_STATISTICS_ASYNC { ON | OFF }
-- Check statistics settings for a database
SELECT is_auto_create_stats_on,
is_auto_update_stats_on,
is_auto_update_stats_async_on
FROM sys.databases

Statistics (serverless SQL)
 Automatic creation available only for Parquet and CSV support
 Same goes for recreation of statistics
 Only single-column statistics are currently supported
 CSV sampling not supported yet (only FULLSCAN)

CTAS vs Insert / Update / Delete / Merge
• Prefer CTAS when you update or delete more than 10% of rows
• Prefer CTAS when you are updating or deleting a clustered
Columnstore index, and do not have time for an offline rebuild

UPDATE FROM and DELETE FROM
• Azure Synapse Analytics does not currently support (*) joins in UPDATE
FROM and DELETE FROM queries.
• Implement the join as a temporary / transient table, then UPDATE / DELETE
from that table
(*) Coming soon

Simple is better than clever
• Persist standard columns early, to avoid calculations and functions in
WHERE clause
• Unroll CTEs and JOIN sub-selects to transient / temporary tables to manage
distribution
• Simple queries are easier to tune and debug

Pop Quiz #2
What is the optimal size for a
rowgroup in columnstore format in a
Synapse SQL Pool?
A)
99,999
B)
60,000,000
C)
1,048,576

Break
Please take this time for a short 15-minute break
If at anytime you require assistance, please send a message to the “Need help – ask here” channel in
the Microsoft Teams site for this event
Relax and come back refreshed for our next activity

Agenda
1 Performance Patterns
Result set caching and materialized
views.
2 Table design
Clustered columnstore index and
ordered variant, clustered index, heap
and non-clustered index.
3 Index design
Clustered columnstore index and
ordered variant, clustered index, heap
and non-clustered index.

Comprehensive SQL functionality
T-SQL Querying
• Windowing aggregates
• Approximate execution
(Hyperloglog)
• JSON data support
• Score machine learning
models in ONNX format
Advanced storage system
• Columnstore Indexes
• Table partitions
• Distributed tables
• Isolation modes
• Materialized Views
• Nonclustered Indexes
• Result-set caching
Complete SQL object model
• Tables
• Views
• Stored procedures
• Functions

Question…
What is the primary factor for Azure
Synapse winning a POC?
What about losing a POC?

 Use result-set caching to improve query performance when the same
queries are executed repeatedly against mainly static data.
 Result-set cache is invalidated and refreshed when underlying table data
changes or the query code changes
 Result-set cache persists when SQL Pool is paused and resumed.
Result set caching motivated

Overview
Cache the results of a query in SQL pool storage. This enables
interactive response times for repetitive queries against tables
with infrequent data changes.
The result-set cache persists even if SQL pool is paused and
resumed later.
Query cache is invalidated and refreshed when underlying table
data or query code changes.
Result cache is evicted regularly based on a time-aware least
recently used algorithm (TLRU).
Benefits
• Enhances performance when same result is requested
repetitively
• Reduced load on server for repeated queries
• Offers monitoring of query execution with a result cache hit or
miss
Result-set caching
-- Turn on/off result-set caching for a database
-- Must be run on the MASTER database
SET RESULT_SET_CACHING { ON | OFF }
-- Turn on/off result-set caching for a client session
-- Run on target Azure Synapse Analytics
SET RESULT_SET_CACHING {ON | OFF}
-- Check result-set caching setting for a database
-- Run on target Azure Synapse Analytics
SELECT is_result_set_caching_on
FROM sys.databases
WHERE name = {database_name}
-- Return all query requests with cache hits
-- Run on target data warehouse
SELECT *
FROM sys.dm_pdw_request_steps
WHERE command like '%DWResultCacheDb%'
AND step_index = 0

Result-set caching flow
Client sends query to
SQL pool
1 Query is processed using compute nodes
which pull data from remote storage,
process query and output back to client
app
2 Query results are cached in remote
storage so subsequent requests can
be served immediately
0101010001
0100101010
01010100010
100101010
Subsequent executions for the same
query bypass compute nodes and can
be fetched instantly from persistent
cache in remote storage
3
01010100010
100101010
Remote storage cache is evicted regularly
based on time, cache usage, and any
modifications to underlying table data.
4 Cache will need to be
regenerated if query results
have been evicted from cache
5

Overview
A materialized view pre-computes, stores, and maintains its
data like a table.
Materialized views are automatically updated when data in
underlying tables are changed. This is a synchronous operation
that occurs as soon as the data is changed.
The auto caching functionality allows Azure Synapse Analytics
Query Optimizer to consider using indexed view even if the
view is not referenced in the query.
Supported aggregations: MAX, MIN, AVG, COUNT, COUNT_BIG,
SUM, VAR, STDEV
Benefits
• Automatic and synchronous data refresh with data changes
in base tables. No user action is required.
• High availability and resiliency as regular tables
Materialized views
-- Create indexed view
CREATE MATERIALIZED VIEW Sales.vw_Orders
WITH
(
DISTRIBUTION = ROUND_ROBIN |
HASH(ProductID)
)
AS
SELECT SUM(UnitPrice*OrderQty) AS Revenue,
OrderDate,
ProductID,
COUNT_BIG(*) AS OrderCount
FROM Sales.SalesOrderDetail
GROUP BY OrderDate, ProductID;
GO
-- Disable index view and put it in suspended mode
ALTER INDEX ALL ON Sales.vw_Orders DISABLE;
-- Re-enable index view by rebuilding it
ALTER INDEX ALL ON Sales.vw_Orders REBUILD;

In this example, a query to get the year total sales per customer is shown to have a lot of data
shuffles and joins that contribute to slow performance:
Materialized views - example
-- Get year total sales per customer
(WITH year_total AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
SUM(ISNULL(list_price – wholesale_cost –
discount_amt + sales_price, 0)/2)year_total
FROM customer cust
JOIN catalog_sales sales ON cust.sk = sales.sk
JOIN date_dim ON sales.sold_date = date_dim.date
GROUP BY customer_id
, first_name
,
last_name
,birth_country
,
login
,email_address,d_year
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Execution time: 103 seconds
Lots of data shuffles and joins needed to complete query
No relevant indexed views created on the data
warehouse

Now, we add an indexed view to the data warehouse to increase the performance of the previous
query. This view can be leveraged by the query even though it is not directly referenced.
Indexed Materialized views - example
-- Create indexed view for query
CREATE INDEXED VIEW nbViewCS WITH (DISTRIBUTION=HASH(customer_id)) AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
SUM(ISNULL(list_price – wholesale_cost – discount_amt +
sales_price, 0)/2) AS year_total
FROM customer cust
, first_name
,
last_name
,birth_country
,
login
, email_address
, d_year
Create indexed view with hash distribution on customer_id column
(WITH year_total AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
FROM customer cust
, first_name
,
last_name
,birth_country
,
login
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Original query – get year total sales per customer

SQL pool query optimizer automatically leverages the indexed view to speed up the same query. Notice
that the query does not need to reference the view directly
Indexed (materialized) views - example
(WITH year_total AS
SELECT customer_id,
first_name,
last_name,
birth_country,
login,
email_address
,
d_year,
FROM customer cust
, first_name
,
last_name
,birth_country
,
login
)
SELECT TOP 100 …
FROM year_total …
WHERE …
ORDER BY …
Original query – no changes have been made to query
Execution time: 6 seconds
Optimizer leverages materialized view to reduce data shuffles and joins needed

EXPLAIN - provides query plan for SQL statement
without running the statement; view estimated cost
of the query operations.
EXPLAIN WITH_RECOMMENDATIONS - provides
query plan with recommendations to optimize the
SQL statement performance.
Materialized views- Recommendations
EXPLAIN WITH_RECOMMENDATIONS
select count(*)
from (
(select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
where store_sales.ss_sold_date_sk = date_dim.d_date_sk
and store_sales.ss_customer_sk = customer.c_customer_sk
and d_month_seq between 1194 and 1194+11)
except
(select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
and d_month_seq between 1194 and 1194+11)
) top_customers

Indexed Materialized Views
• Indexed views cache the schema and data for a view in DW remote storage.
They are useful for improving the performance of ‘SELECT’ statement queries
that include aggregations
• Indexed views are automatically updated when data in underlying tables are
changed. This is a synchronous operation that occurs as soon as the data is
changed.
• The auto caching functionality allows Synapse Query Optimizer to consider using
indexed view even if the view is not referenced in the query
• Supported aggregations: MAX, MIN, AVG, COUNT, COUNT_BIG, SUM, VAR, STDEV

CREATE TABLE dbo.OrderTable
(
OrderId INT NOT NULL,
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]) |
ROUND ROBIN |
REPLICATED
);
Round-robin distributed
Distributes table rows evenly across all
distributions at random.
Hash distributed
Distributes table rows across the Compute nodes
by using a deterministic hash function to assign
each row to one distribution.
Replicated
Full copy of table accessible on each Compute
node.
Tables – Distributions

CREATE TABLE partitionedOrderTable
(
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH([OrderId]),
PARTITION (
[Date] RANGE RIGHT FOR VALUES (
'2000-01-01', '2001-01-01', '2002-01-01’,
'2003-01-01', '2004-01-01', '2005-01-01'
)
)
);
Overview
Table partitions divide data into smaller groups
In most cases, partitions are created on a date
column
Supported on all table types
RANGE RIGHT – Used for time partitions
RANGE LEFT – Used for number partitions
Benefits
• Improves efficiency and performance of
loading and querying by limiting the scope to
subset of data.
• Offers significant query performance
enhancements where filtering on the partition
key can eliminate unnecessary scans and
eliminate IO.
Tables – Partitions

OrderId Date Name Country
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85216 11-2-2018 Q DE
85395 11-2-2018 V NL
82147 11-2-2018 Q FR
86881 11-2-2018 D UK
93080 11-3-2018 R UK
94156 11-3-2018 S FR
96250 11-3-2018 Q NL
98799 11-3-2018 R NL
98015 11-3-2018 T UK
98310 11-3-2018 D DE
98979 11-3-2018 Z DE
98137 11-3-2018 T FR
… … … …
Logical table structure
Tables – Distributions & Partitions
Physical data distribution
( Hash distribution (OrderId), Date partitions )
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85216 11-2-2018 Q DE
85395 11-2-2018 V NL
82147 11-2-2018 Q FR
86881 11-2-2018 D UK
… … … …
93080 11-3-2018 R UK
94156 11-3-2018 S FR
96250 11-3-2018 Q NL
98799 11-3-2018 R NL
98015 11-3-2018 T UK
98310 11-3-2018 D DE
98979 11-3-2018 Z DE
98137 11-3-2018 T FR
… … … …
11-2-2018 partition
11-3-2018 partition
x 60 distributions (shards)
Distribution1
(OrderId 80,000 – 100,000)
…
• Each shard is partitioned with the same
date partitions
• A minimum of 1 million rows per
distribution and partition is needed for
optimal compression and performance
of clustered Columnstore tables

Common table distribution methods
Table Category Recommended Distribution Option
Fact
Use hash-distribution with clustered columnstore index. Performance improves because
hashing enables the platform to localize certain operations within the node itself during query
execution.
Operations that benefit:
COUNT(DISTINCT( <hashed_key> ))
OVER PARTITION BY <hashed_key>
most JOIN <table_name> ON <hashed_key>
GROUP BY <hashed_key>
Dimension
Use replicated for smaller tables. If tables are too large to store on each Compute node, use
hash-distributed.
Staging
Use round-robin for the staging table. The load with CTAS is faster. Once the data is in the

Question…
Why is the best practice to load into
a staging table, and then CTAS into
the production table?

Hash Distribution:
 Large fact tables exceeding several GBs with frequent inserts should use a hash
distribution.
Round Robin Distribution:
 Potentially useful tables created from raw input.
 Temporary staging tables used in data preparation.
Replicated Tables:
 Lookup tables that range in size from 100’s MBs to 1.5 GBs should be replicated.
Works best when table size is less than 2 GB compressed.
Distributed table design recommendations

Too many partitions
• Partitions can be useful when maintaining current rows in very large
fact tables. Partition switching is a good alternative to full CTAS
• Partitioning CCIs is only useful when the row count is greater than
60million * #partitions
• In general, avoid partitions, particularly in POCs

Views on Views
• Views on Views will not support performance optimization using
Materialized Views (more later)
• Views cannot be distributed

Question…
How many indexing methods are
there for a Synapse table?
Can you name them all?

-- Create table with index
CREATE TABLE orderTable
(
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX |
HEAP |
CLUSTERED INDEX (OrderId)
);
-- Add non-clustered index to table
CREATE INDEX NameIndex ON orderTable (Name);
Clustered Columnstore index (Default Primary)
Highest level of data compression
Best overall query performance
Clustered index (Primary)
Performant for looking up a single to few rows
Heap (Primary)
Faster loading and landing temporary data
Best for small lookup tables
Nonclustered indexes (Secondary)
Enable ordering of multiple columns in a table
Allows multiple nonclustered on a single table
Can be created on any of the above primary indexes
More performant lookup queries
Tables – Indexes

98137 11-3-2018 T FR
98310 11-3-2018 D DE
98799 11-3-2018 R NL
82147 11-2-2018 Q FR
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
85216 11-2-2018 Q DE
85395 11-2-2018 V NL
82147 11-2-2018 Q FR
86881 11-2-2018 D UK
93080 11-3-2018 R UK
94156 11-3-2018 S FR
96250 11-3-2018 Q NL
98799 11-3-2018 R NL
98015 11-3-2018 T UK
98310 11-3-2018 D DE
98979 11-3-2018 Z DE
98137 11-3-2018 T FR
… … … …
Logical table structure
OrderId
82147
85016
85018
85216
85395
Date
11-2-2018
Country
FR
UK
SP
DE
NL
Name
Q
V
Rowgroup1
Min (OrderId): 82147 | Max (OrderId): 85395
98137 11-3-2018 T FR
98310 11-3-2018 D DE
98799 11-3-2018 R NL
98979 11-3-2018 Z DE
Delta Rowstore
SQL Analytics Columnstore Tables
Clustered columnstore index
(OrderId)
…
• Data stored in compressed columnstore segments after
being sliced into groups of rows (rowgroups/micro-
partitions) for maximum compression
• Rows are stored in the delta rowstore until the number of
rows is large enough to be compressed into a
columnstore
Clustered/Non-clustered rowstore index
(OrderId)
• Data is stored in a B-tree index structure for performant
lookup queries for particular rows.
• Clustered rowstore index: The leaf nodes in the structure
store the data values in a row (as pictured above)
• Non-clustered (secondary) rowstore index: The leaf nodes
store pointers to the data values, not the values
themselves
+
OrderId PageId
82147 1001
98137 1002
OrderId PageId
82147 1005
85395 1006
OrderId PageId
98137 1007
98979 1008
82147 11-2-2018 Q FR
85016 11-2-2018 V UK
85018 11-2-2018 Q SP
98137 11-3-2018 T FR
98310 11-3-2018 D DE
98799 11-3-2018 R NL
… …

Overview
Queries against tables with ordered columnstore segments can take advantage of improved
segment elimination to drastically reduce the time needed to service a query.
Ordered Clustered Columnstore Indexes
-- Insert data into table with ordered columnstore index
INSERT INTO sortedOrderTable
VALUES (1, '01-01-2019','Dave’, 'UK')
-- Create Table with Ordered Columnstore Index
CREATE TABLE sortedOrderTable
(
Date DATE NOT NULL,
Name VARCHAR(2),
Country VARCHAR(2)
)
WITH
(
CLUSTERED COLUMNSTORE INDEX ORDER (OrderId)
)
-- Create Clustered Columnstore Index on existing table
CREATE CLUSTERED COLUMNSTORE INDEX cciOrderId
ON dbo.OrderTable ORDER (OrderId)

Ordered CCI
• Queries against tables with ordered columnstore segments can take
advantage of improved segment elimination to drastically reduce the
time needed to service a query.
• Columnstore Segments are automatically updated as data is inserted,
updated, or deleted in data warehouse tables.

 Clustered Columnstore indexes (CCI) are best for fact tables.
 CCI offer the highest level of data compression and best query performance for tables
with over 100 million rows.
 Heap tables are best for small lookup tables and recommended for tables with less
than 100 million rows.
 Clustered Indexes may outperform CCI when very few rows need to be retrieved
quickly.
 Add non-clustered indexes to improve performance for less selective queries.
 Each additional index added to a table increases storage space required and processing time during
data loads.
 Speed load performance by staging data in heap tables and temporary tables prior to
running transformations.
 as
Choosing the right index

Too many indexes
• Start without indexes. The overhead of maintaining them can be
greater than their value.
• A primary-key non-clustered index may improve performance of joins
when fact tables are joined to very large (billion+) dimensions

Pop Quiz
Match the tables with their
recommended index! dbo.LineItem
30B rows
Primary fact table
dbo.Sales
150M rows
Single sales lookups
stg.stagingLineItem
10k rows
Staging table for loads
dbo.dimProduct
1.5k rows
Product information
Heap
CCI
CI

Your challenge should you
choose to accept it:
Wide World Importers
needs your help!
 Work as a team and
prove to them you have
what it takes.
 It won’t be easy alone,
but as ONE Microsoft,
you can do this!
 Complete as many of the
challenges as you can,
but don’t worry about
getting them all done.
This Photo by Unknown Author is licensed under CC BY

Step 1: Go to your
table group channel
Step 2: You will use your
cloud lab login for Azure
Step 3: Login to Azure. Engage your table group via a Microsoft Teams “meet now” call to
collaborate with each other (meet now button in upper right corner of application)
POC Challenges 1 & 2: group-based lab exercises
GUIDE:
PoC Challenge
https://github.com/solliancenet/azure-synapse-analytics-workshop-300-2
-day

Day 1 Wrap Up
What was covered today:
• Design, optimize, and secure a file system within ALDS Gen2
• Decide on which Azure Synapse Analytics component to use for specific data
engineering scenarios
• Implement optimization strategizes for the data warehouse using SQL based
approached in Azure Synapse Analytics.
What we will learn tomorrow:
• Address scenarios to monitor and manage Azure solutions
• Apply security concepts to a customer scenario
Thank you for your participation in the Day 1 Technical Boot
Camp
If at anytime you require assistance, please send a message to the “Need help – ask here” channel in
the Microsoft Teams site for this event

We’d love to hear from you!
Day 1
1. Scan the QR code below using your smartphone or access the link https://aka.ms/BC_EMEA_Day1
2. You’ll also receive a link to this survey by email. You only need to complete it once.
3. Answer the survey questions to provide your feedback on Day 1 of Boot Camp. Be honest. Every bit of
feedback helps us improve.
123

Day 1 - Technical Bootcamp azure synapse analytics

More Related Content

Similar to Day 1 - Technical Bootcamp azure synapse analytics (20)

Recently uploaded (20)

Day 1 - Technical Bootcamp azure synapse analytics

Editor's Notes