Azure Data Fundamentals DP 900 Full Course

DP-900
Azure Data
Fundamentals

Agenda
Below topics will be covered
• Core Data concepts
• Relational Data workload
• Data Analytics and Processing
• NOSQL Data Workload

Core Data Concepts (15-20%)
What is Data?
Collection of facts such as numbers, descriptions, and observations used in decision making.
Structured data is typically tabular data that is represented by rows and columns in a database.
Databases that hold tables in this form are called relational databases
Semi-structured data is information that doesn't reside in a relational database but still has
some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format.
Not all data is structured or even semi-structured. For example, audio and video files, and binary
data files might not have a specific structure. They're referred to as unstructured data.
Structured Semi-Structured Unstructured

Data processing
Data processing is simply the conversion of raw data to meaningful information through a process
Depending on how the data is ingested into your system, you could process each data item as it arrives,
or buffer the raw data and process it in groups
Processing data as it arrives is called streaming
Buffering and processing the data in groups is called batch processing.
Streaming Data: When you play a video on Youtube, Netflix. The service streams the data through your browser
In real-time.
Batch processing: Counting of votes in election where data is collected and counted in batches.
Streaming Batch Processing

RDBMS
Collection of related data entries are called Tables
Data represented in the form of rows and columns
Employee ID Name Department
1 Piyush IT
2 John HR
3 David Management
Record/Row
Columns
Collection of multiple tables and database objects : Relational Database

Store and organize
relational data in most
efficient manner
Improves data integrity
Create relationships
between database tables
Enforces constraints and
fixed schema
Normalization

SQL Commands
(DDL) Data Definition Language
Helps defining structure of database or schema
Defines how the data is stored in a database
Create
Comment
Drop
Truncate
Alter
Rename
To create a database and its objects like (table,
index, views, store procedure, function, and
triggers)
Alters the structure of the existing database
Delete objects from Database( Tables ,
index, views)
Removes all record from a Table
Add comments to a data dictionary
Rename a Database

DML (Data Manipulation Language)
Used to store, modify, retrieve, delete and update data in a database.
Select
Insert
Update
Delete
Retrieve Data from a
Database
Insert data into a table
Update existing data within a
table
Delete records from a database table

Database Objects
Most of the major database engines offer the same set of major database object types:
Table
Index
View
Student ID Name Age
121 Piyush 32
123 David 30
124 John 28
That helps improves the data retrieval speed CREATE INDEX index_name ON table_name;
The fields in a view are fields from one or more real tables in the database. ( Virtual Table)
ID Name Grade StudentID
101 Piyush B 121
201 David A 123
301 John C 124
CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;
CREATE VIEW student_details AS
SELECT s.Name, s.Age, g.Grades
FROM students s, grade g
WHERE s.studentID = g.studentID;
Select * from student_details
Name Age Grade
Piyush 32 B
David 30 A
John 28 C
Students Grades

SQL CONSTRAINTS
Rules enforced on data columns on a table.
These are used to limit the type of data that can go into a table.
These ensures the accuracy and reliability of the data in the database.
NOT NULL Constraint − Ensures that a column cannot have a NULL value.
CREATE TABLE students (
StudentID int NOT NULL,
Name varchar(255) NOT NULL,
FirstName varchar(255) NOT NULL,
lastName varchar(255)
);
CREATE TABLE table_name (
column1 datatype constraint,
....
);

DEFAULT Constraint − Provides a default value for a column when none is specified.
UNIQUE Constraint − Ensures that all the values in a column are different.
StudentID int NOT NULL UNIQUE,
lastName varchar(255)
);
Address varchar(255) DEFAULT ’India’
);

PRIMARY Key − Uniquely identifies each row/record in a database table.
FOREIGN Key − Uniquely identifies a row/record in any another database table.
StudentID int PRIMARY KEY,
Address varchar(255) DEFAULT ’India’
);
UNIQUE
NOT
NULL
PRIMARY KEY
+ =
Student ID Name Age
121 Piyush 32
123 David 30
124 John 28
ID Name Grade StudentID
101 Piyush B 121
201 David A 123
301 John C 124
Students
Grades
Primary Key Foreign Key

CHECK Constraint − ensures that all values in a column satisfy certain conditions.
Age int CHECK (Age>=18)
);
INDEX − Used to create and retrieve data from the database very quickly.
CREATE INDEX index_name ON table_name;

Data Integrity
• Entity Integrity − There are no duplicate rows in a table.
• Domain Integrity − Enforces valid entries for a given column by restricting the type,
the format, or the range of values.
• Referential integrity − Rows cannot be deleted, which are used by other records.
• User-Defined Integrity − Enforces some specific business rules that do not fall into entity,
domain or referential integrity.

OLTP vs OLAP
Management of transactional data using
computer systems
OLTP systems record business interactions
as they occur in the day-to-day operation
of the organization
Choose OLTP when you need to efficiently
process and store business transactions and
immediately make them available to client
applications in a consistent way.
Business Transactions related to payments,
orders, inventories etc.
Complex business analysis on large
business databases.
It can be used to perform complex
analytical queries without negatively
affecting day to day business operations.
Choose OLAP, when you need to execute
complex analytical and ad hoc queries
without impacting your OLTP systems.
Reporting and forecasting, trend reports,
market sentiments, recommendations and
suggestions etc

IaaS PaaS SaaS
Infrastructure as a Service Platform as a Service Software as a Service
Gives full control over infra
resources such as virtual machine
/storage etc
Give runtime environment/platform
To deploy application and
Development tools.
Gives access to the end users
You must take care of all the
Admin tasks such as patching,
upgrades, backups.
Azure takes care of all the admin
tasks including automated backups
Azure takes care of all the
admin tasks.
Azure VM, VNET,
AWS EC2 servers
Azure DevOps, Azure Web App,
OpenShift
DropBox, Office 365 , Teams
Pay-per-use Pay-per-service model Pay-per-subscription model

How to work with Relational Data on Azure (25-30%)

Azure Data Services for RDBMS
Azure Data Services fall into the PaaS category.
These services are a series of DBMSs managed by Microsoft in the cloud.
Azure SQL
Database
Azure Database
for MySQL
Azure Database
for MariaDB
Azure Database
for PostgreSQL
You have no direct control over the platform on which the services run.
Microsoft takes care of all your administrative tasks including server patching, backups and updates.
By default, your DB is protected by a server level firewall

Azure SQL Database ( PaaS)
This option enables you to quickly set up and run a single SQL Server database.(Cheapest)
By default, resources are pre-allocated, and you're charged per hour for the resources you’ve requested
You can also specify a serverless configuration. Your database automatically scales and resources
are allocated or deallocated as required.
This option is similar to Single Database, except that by default multiple databases can share the
same resources, such as memory, data storage space, and processing power.
The resources are referred to as a pool. You create the pool, and only your databases can use the
pool.
Single Database ElasticPool Managed Instance
You are charged per Pool.

Azure SQL Database ( PaaS)
Managed instance effectively runs a fully controllable instance of SQL Server in the cloud
You can install multiple databases on the same instance. You have complete control over this
instance, much as you would for an on-premises server
The Managed instance service automates backups, software patching, database monitoring, and other
general tasks, but you have full control over security and resource allocation for your databases
Managed instance has near 100% compatibility with SQL Server Enterprise Edition, running on-
premises.
Consider Azure SQL Database managed instance if you want to lift-and-shift an on-premises SQL
Server instance and all its databases to the cloud, without incurring the management overhead of
running SQL Server on a virtual machine. (BYOL)
Managed Instance

SQL Server in a Virtual Machine ( IaaS)
 SQL Server on Virtual Machines enables you to use full versions of SQL Server in the Cloud
without having to manage any on-premises hardware
 You can easily move your on-premises SQL Database to Azure VM (Windows/Linux).
 You remain responsible for maintaining the SQL Server software and performing the various
administrative tasks to keep the database running from day-to-day.
 This approach is suitable for migrations and applications requiring access to operating system
features that might be unsupported at the PaaS level.
 SQL virtual machines are lift-and-shift ready for existing applications that require fast migration
to the cloud with minimal changes.
 You get all the cloud benefits such as scalability, elasticity, high performance with no limitation of
DBMS.

IaaS PaaS SaaS
Azure SQL Database
Azure Database for MySQL
Azure Database for MariaDB
Azure Database for PostgreSQL
SQL Server in Virtual
Machine
Single Database
Elastic Pool
Managed Instance

How to work with Non-Relational Data on Azure (25-30%)

Non-Relational DB (NOSQL)
NoSQL database stands for “Not Only SQL” or “Not SQL.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured, semi-
structured, unstructured data.
Doesn’t follow fixed schema structure
Doesn’t support features of a relational database
Types of NOSQL Data Stores
Documents
Graphs
Key-Value
Column based
High volume of JSON data
Relationship between nodes and edges with graph
Multiple key-value pairs
Columns are divides into column families which holds related data
Object based Unstructured/semi data storage for binary large object: images, videos, VM disk image

Azure CosmosDB
 Azure Cosmos DB is a multi-model NoSQL database management system.
 Cosmos DB manages data as a partitioned set of documents.
 A document is a collection of fields, identified by a key.
 The fields in each document can vary, and a field can contain child documents.
 Example
## Document 1 ##
{
"customerID": "101",
"name":
{
"first": "Piyush",
"last": "Sachdeva"
}
}
## Document 2 ##
{
"customerID": "102",
"name":
{
"title" : "Mr"
"firstname": "Piyush",
"lastname": "Sachdeva"
}
}
 Uses partition keys for high performance/query optimization

CosmosDB APIs
SQL API Enables you to run SQL queries over JSON data.
Table API This interface enables you to use the Azure Table Storage API to store and retrieve
documents.
MongoDB API Many organizations run MongoDB(document-based DB) on-premises. You can use the
MongoDB API for Cosmos DB to enable a MongoDB application to run unchanged against a Cosmos
DB database or you can migrate MongoDB to CosmosDB in the cloud.
Cassandra DB API is a column-based DBMS ,the primary purpose of the Cassandra API is to enable
you to quickly migrate Cassandra databases and applications to Cosmos DB.
Gremlin API. The Gremlin API implements a graph database interface to Cosmos DB. A graph is a
collection of data objects(Nodes) and directed relationships(Edges). Data is still held as a set of
documents in Cosmos DB, but the Gremlin API enables you to perform graph queries over data.

Azure Table Storage
Azure Table Storage implements the NoSQL key-value model
In this model, the data for an item is stored as a set of fields, and the item is identified by a unique key.
Items are referred to as rows, and fields are known as columns.
Unlike RDBMS, it allows you to store unstructured data
Simple to scale and allows upto 5PB of data
Fast read/write as comparable to a relational DB, use partition key to increase performance.
Row insertion and data retrieval is fast.

Azure Blob Storage
Many applications need to store large, binary data objects, such as images, video, virtual machine
Images and so on. These are called Blobs.
Azure Blob storage is a service that enables you to store massive amounts of unstructured data, or
blobs, in the cloud.
Block Blobs Page Blobs Append Blobs
Set of blocks
Each block vary in size,
up to 100MB
Up to 100MB
Collection of fixed size pages
512-bytes each
Supports random read/write
Inside an Azure storage account, you create blobs inside containers(folders). You can group similar blobs
together in a container.
Optimized to support append operations
You can only add blocks to the end of an
append blob
Update/deleting existing blocks is not
supported

Azure Blob Storage: Access Tiers
Hot Tier Cool Tier
Archive Tier
The Hot tier is the default.
Used for Frequently access data.
Provide highest performance
Costliest among three
Used for infrequent data access
Cheaper than hot tier
Lower performance than hot tier
You can migrate your storage from
Hot to cool tier to save storage cost.
Used for archival storage Cheapest among all Highest latency Take hours for data retrieval
To retrieve a blob from the Archive tier, you must change the access tier to Hot or Cool.
The blob will then be rehydrated.
You can read the blob only when the rehydration process is complete.

Azure File Storage
Azure File Storage enables you to create files shares in the cloud and access these file shares
from anywhere with an internet connection.
Azure File Storage exposes file shares using the Server Message Block 3.0 (SMB) protocol.
Once you've created a storage account, you can upload files to Azure File Storage using the
Azure portal, or tools such as the AzCopy utility.
Azure File Storage offers two performance tiers.
The Standard tier uses hard disk-based hardware in a datacenter
Premium tier uses solid-state disks. The Premium tier offers greater throughput, but is
charged at a higher rate.

NOSQL DB Suitable for?
Object based: Store unstructured data or Blobs
Column based: When you need low latency, time-series, session details, telemetry data, analytics.
Cosmos Cassandra API
Graph based: When you need to define relationship in form of graphs.
Azure Blob Storage
Cosmos Gremlin API
Key-Value: Data is accessed using a single key , used for caching, user profile mgt, session mgt.
Azure Table Storage
Cosmos SQL API
Document: JSON documents for content/inventory mgt, product catalog
Cosmos Table API
File share in the cloud , SMB 3.0 Protocol Azure File Share

Analytics workload on Azure (25-30%)

Data Analytics Core Concepts
Data Analytics stages :
Ingestion: Taking the data from multiple sources into your processing system.
Processing: Transformation of data into more meaningful form
Visualization: Graphical representation of processed data in the form of graphs, diagrams, charts ,
Maps etc., for reporting and business intelligence purpose.
Data analytics is concerned with examining, transforming, and arranging data so that you can study it
and extract useful information
Ingestion Processing Visualization

ETL vs ELT
ETL (Extract , Transform and Load)
Extract Transform Load
Data Ingestion
Filtering
Sorting
Aggregating
Joining
Cleaning
De-duplication
Validation
ELT (Extract , Load and Transform)
Extract Load Transform
Target data store is a data warehouse using either Hadoop
Cluster or Azure Synapse Analytics.
Target datastore should be powerful enough to transform the
data

1. DESCRIPTIVE 3. PRESCRIPTIVE
2. DIAGNOSTICS 4. PREDICTIVE 5. COGNITIVE
Data Analytics Techniques
What has
happened, based
on historical data
Sales reports,
profit and loss statements,
quarterly earnings reports
why things
happened.
Comparison reports
Drill-down reports
What actions
should we take to
achieve a target
Recommendation,
Suggestions,
Advise on best
approach
What will happen
in the future based
on past trends
Forecasting reports,
What might happen if
circumstances
changes: AI/ML
Self-driving cars,
Video to audio conversion,
Audio transcribing,

Azure Tools for Data Analytics
Arm Template: To Automate Azure resource provisioning ( IaaC)
Azure CLI: Command line tool to interact with Azure resources
Azure Data Studio: Execute queries on SQL sever/big data cluster, restore a Db, execute
admin tasks via sqlcmd/Powershell, Create and run SQL Notebooks
SSMS ( SQL Server Management Studio): complex admin task, platform configuration,
security mgt, user mgt, vulnerability assessment, performance tuning, query Synapse Analytics
Sqlcmd: Command line SQL utility

Data Warehousing
- Central Repository of data collected from one or more sources.
- Current and historical data used for reporting and analysis
- Can rename or reformat columns to make it easier for users to create reports
- Users can run reports without affecting the day-to-day business
When to use data warehousing
When queries are long running and affect day to day operations
When data needs further processing (ETL or ELT)
When you want to archive data (remove historical data from day-to-day system)
When you need to integrate data from multiple sources

Data Warehousing Flow
CosmosDB
Table Storage
On-prem DB
Azure Data Lake
Azure Synapse Analytics
Azure Data Factory
Azure Analysis
Services PowerBI
Ingestion
Orchestration pipeline
Storage and Pre-processing Analysis Visualization

Azure Data Services for Data Warehousing
Azure Data Factory is described as a data integration service. Responsible for collection, transformation and
storage of data collected from multiple sources.
A logical grouping of activities to perform some task
A data factory can contain multiple pipelines
Sequential or parallel
Pipeline Triggers
Scheduled trigger
Azure Data Factory
Tumbling windows ( run as scheduled with the historical data)
Event-Based
Manual

Azure Data Lake Storage
You can think of a data lake as a staging point for your ingested data, before it’s transported and
converted into a format suitable for performing analytics
A data lake is a repository for large quantities of raw data
Compatible with HDFS(Hadoop Distributed File System) used to examine huge datasets.
Role-Based Access Control (RBAC) on your data at file and directory level( POSIX access control list)
Data Lake Storage organizes your files into directories and subdirectories for improved file organization.
(Hierarchical Namespace)
CosmosDB
Table Storage
On-prem DB
Azure Data Lake
Azure Data Factory
Data Sources
Storage
Data Ingestion
To implement azure Data Lake you
need to have a storage account
It Stores data that is in parquet format

Azure Databricks
Azure Databricks is an Apache Spark environment running on Azure to provide big data
processing, streaming, and machine learning.
Can consume and process large amounts of data very quickly.
Azure Databricks also supports structured stream processing
In this model, Databricks performs your computations incrementally, and continuously updates
the result as streaming data arrives.
Azure Databricks provides a graphical user interface where you can define and test your
processing step by step, before submitting it as a set of batch tasks.

Azure Synapse Analytics
You can ingest data from external sources, such as flat files, Azure Data Lake, or another database
management systems, and then transform and aggregate this data into a format suitable for
analytics processing
You can perform complex queries over this data and generate reports, graphs, and charts.
It stores and process the data locally for faster processing
This approach enables you to repeatedly query the same data without the overhead of
fetching and converting it each time.
You can also use this data as input to further analytical processing, using Azure Analysis Services.
Azure Synapse Analytics leverages a massively parallel processing (MPP) architecture.
This architecture includes a control node and a pool of compute nodes.
You can pause Azure Synapse Analytics to reduce cost.

Azure Synapse Analytics flow
It includes a control node and a pool of compute nodes
Control node receive the processing request from applications and distribute
to compute nodes for parallel processing evenly.
Results from each node are then sent back to control node where it gets
combined into overall result
It supports two computational models: SQL pools and Spark Pools
In a SQL pool, each compute node uses an Azure SQL Database and Azure
Storage to handle a portion of the data.
To receive data from multiple sources it uses a technology called PolyBase
It uses storage as it is a disk based processing engine and supports manual
node scaling
Spark pools are optimized for in-memory processing and you can enable
autoscaling of nodes.

Azure Analysis Service
Azure Analysis Services enables you to build tabular models to support OLAP queries.
You can combine data from multiple sources, including Azure SQL Database, Azure Synapse Analytics, Azure
Data Lake store, Azure Cosmos DB, and many others.
You use these data sources to build models
A model is essentially a set of queries and expressions that retrieve data from the various data sources and
generate results.
Analysis Services includes a graphical designer to help you connect data sources together and define queries
that combine, filter, and aggregate data
If you have large amounts of ingested data that require preprocessing, you can use Synapse Analytics to
process the data and reduce into smaller datasets which can further be analyzed by Azure Analysis Service.
Recommended Usage

Azure HD Insight
HDInsight implements a clustered model that distributes processing across a set of computers
Azure HDInsight is a big data processing service, that provides the platform for technologies such as
Spark in an Azure environment
This model is similar to that used by Synapse Analytics, except that the nodes are running the Spark
processing engine rather than Azure SQL Database.
Break down of data and distribute for processing
Data Processing
Create, load and query the data similar to
PolyBase

Data Ingestion using Data factory
Azure Data Factory is a data ingestion and transformation service that allows you to load raw data from many
different sources, both on-premises and in the cloud.
Data Factory can clean, transform, and restructure the data, before loading it into a repository such as a data
warehouse.
Once the data is in the data warehouse, you can analyze it.
Azure Data Factory uses several different resources: linked services, datasets, and pipelines
CosmosDB
Table Storage
On-prem DB
Azure Data Lake
Azure Data Factory
Data Sources
Storage Azure Data Factory
Azure Analysis
Services
Analysis

Data Factory moves data from a data source to a destination.
A linked service provides the information needed for Data Factory to connect to a source or
destination
Data Sets
A dataset in Azure Data Factory represents the data that you want to ingest (input) or store.
If your data has a structure, a dataset specifies how the data is structured.
For example, if you are using blob storage as input The dataset would specify which blob to ingest,
and the format of the information in the blob (binary data, JSON, delimited text, and so on)
Linked Services

To orchestrate a pipeline
Integration Runtime
Compute environment for pipeline
That initiates the pipeline
Control Flow
Trigger
Mapping Data flow
Data flows allow data engineers to develop data transformation logic without writing code.

Power BI
- Data visualization service which lets you generate dashboards, graphs and reports.
- Can consume data from various data sources to create interactive visualizations
Parts of Power BI
Create Share Consume

Building blocks
of Power BI
Visualizations Datasets Reports
Dashboards Tiles

Reports in PowerBI
Paginated
Interactive
Static Report
Printed and shared
Formatted
Contains data on multiple pages
Use Power BI report builder to create the paginated report
Share the report by Power BI service
Viewed on screen
Customized as per your requirements
More visuals
Make use of 'hover’
User can change layout of design
Use PowerBI server to serve the interactive reports. (Premium)

Power BI content workflow
Connect
Connect to the data
source that has data
Pull
• Pull what you need
into the data model
Edit
• Edit, transform data
as you need
Build
• Build reports using
power BI desktop
Share
• Share the report

Azure Data Fundamentals DP 900 Full Course

Azure Data Fundamentals DP 900 Full Course

More Related Content

What's hot (20)

Similar to Azure Data Fundamentals DP 900 Full Course (20)

Recently uploaded (20)

Azure Data Fundamentals DP 900 Full Course

Editor's Notes