DWH DWH DWH DWH DWH DWH DWH DWH- QP.pptx

Internal - General Use BITS Pilani
Maintenance Strategies
 a. An emergency encounter would be measured by patient, diagnosis, intervention, attending physician, emergency department bed location, date, time,
and any other element used to define the encounter. Identifying the attributes that define the measure also identifies the dimensions for the star schema.
Each attribute is important and may form the basis of a dimension or be an attribute of a dimension. Design a star schema with four dimension
modelling steps.
b. Discuss a Solution to the Limitations in a Kimball data warehouse.
 a) Designing a Star Schema for an Emergency Encounter
To design a star schema, follow the four-dimensional modeling steps:
1. Identify the Business Process: The process is Emergency Encounters in the Emergency Department (ED).
2. Identify the Measures (Facts): The fact table will store the numeric measures for analysis, such as:
 Number of encounters
 Duration of stay in the ED
 Cost of treatment
o Fact Table Name: EmergencyEncounterFact
3. Identify the Dimensions: Dimensions describe the context of the fact measures. Based on the attributes provided:
 Patient Dimension: Contains patient details (e.g., patient ID, age, gender).
 Diagnosis Dimension: Stores diagnosis-related information (e.g., diagnosis code, description).
 Intervention Dimension: Captures intervention data (e.g., procedure code, intervention type).
 Time Dimension: Tracks the date and time of the encounter.
 Physician Dimension: Stores details of the attending physician (e.g., physician ID, specialty).
 Location Dimension: Describes the ED bed location (e.g., bed number, ward).

2
4. Design the Star Schema
The schema has a central fact table linked to each dimension table through foreign keys:
Fact Table: EmergencyEncounterFact
| Encounter ID (PK) | Patient ID (FK) | Diagnosis ID (FK) | Intervention ID (FK) | Physician ID (FK) | Time ID (FK) | Location ID (FK) | Duration | Cost |
Dimension Tables:
o Patient Dimension: | Patient ID (PK) | Name | Age | Gender |
o Diagnosis Dimension: | Diagnosis ID (PK) | Diagnosis Code | Description |
o Intervention Dimension: | Intervention ID (PK) | Procedure Code | Type |
o Physician Dimension: | Physician ID (PK) | Name | Specialty |
o Time Dimension: | Time ID (PK) | Date | Time | Day | Month | Year |
o Location Dimension: | Location ID (PK) | Bed Number | Ward |
b) Solution to the Limitations in a Kimball Data Warehouse
Kimball’s Data Warehouse Architecture often faces the following limitations:
1. Complex ETL Process: Solution: Implement an automated ETL pipeline using modern data integration tools (e.g., Apache NiFi, Talend) to
streamline data extraction, transformation, and loading.
2. Redundancy and Storage Costs: Solution: Use data compression and partitioning techniques to reduce storage costs. Additionally,
leveraging cloud-based storage solutions like Amazon Redshift or Google BigQuery can minimize physical storage concerns.
3. Difficulty Handling Semi-structured Data: Solution: Integrate modern tools capable of handling semi-structured data (e.g., JSON, XML)
into the data warehouse architecture, using technologies like Snowflake or Delta Lake, which allow flexible schema integration.
4. Limited Real-Time Processing: Solution: Incorporate real-time data processing frameworks such as Apache Kafka or streaming capabilities
of modern warehouses (e.g., Snowflake’s streaming ingestion) to support real-time analytics.

3
 As all of you know the case study of retail sales, The Sales fact table measures the quantity, price, and total sales amount for a retail company. If a
business reported product sales and product returns using two different product tables it would be impossible to associate the resulting information
between sales and product returns.
i. Draw the product returns star schema with all measures and describe the business process.
ii. Identify the approach of conformed dimensions between retail sales and product return.
iii. Write an SQL Query for : Returns by product and Month
 i) Product Returns Star Schema
Business Process
The business process involves tracking Product Returns, which captures returned products' details such as quantity, return reason, and refund amount. It
allows the business to analyze return patterns and improve product quality or customer satisfaction.
Fact Table: ProductReturnFact
| Return ID (PK) | Product ID (FK) | Customer ID (FK) | Time ID (FK) | Store ID (FK) | Quantity Returned | Refund Amount | Return Reason |
Dimension Tables:
 Product Dimension: Describes the returned product. | Product ID (PK) | Product Name | Category | Brand |
 Customer Dimension: Describes the customer who returned the product. | Customer ID (PK) | Name | Email | Location |
 Time Dimension: Captures the return date and time. | Time ID (PK) | Date | Month | Year |
 Store Dimension: Details of the store where the return was made. | Store ID (PK) | Store Name | Location |
ii) Approach of Conformed Dimensions
Conformed dimensions are shared dimensions that ensure consistency across multiple fact tables (e.g., SalesFact and ProductReturnFact).
Approach:
Shared Dimensions: Both SalesFact and ProductReturnFact use the same Product Dimension, Customer Dimension, Time Dimension, and Store
Dimension.
Benefits: * Ensures consistency across reports.
* Allows combining sales and return data for a holistic view of performance.

4
 iii) SQL Query: Returns by Product and Month
SELECT
P.ProductName,
T.Month,
SUM(F.QuantityReturned) AS TotalReturns,
SUM(F.RefundAmount) AS TotalRefunds
FROM
ProductReturnFact F
JOIN
ProductDimension P ON F.ProductID = P.ProductID
JOIN
TimeDimension T ON F.TimeID = T.TimeID
GROUP BY
P.ProductName, T.Month
ORDER BY
T.Month, P.ProductName;
 Discuss the following:
I. Periodic Snapshot Fact Tables
II. Consolidated Fact Tables
III. Role-Playing Dimensions
Slowly Changing Dimension Techniques up to SCD-4
 Discussion on Data Warehousing Concepts
I. Periodic Snapshot Fact Tables
 Definition: Capture data at regular time intervals (e.g., daily, monthly).
 Use Case: Track inventory levels at the end of each day.
 Advantages: Provides historical trends and performance over time.
 Limitation: Larger storage requirements due to frequent snapshots.
II. Consolidated Fact Tables
 Definition: Combines multiple fact tables into one to provide a unified view.
 Use Case: Merging sales and returns facts for overall revenue insights.
 Advantages: Simplifies querying and reporting.
 Limitation: Can increase complexity and size of the fact table.
III. Role-Playing Dimensions
 Definition: A single dimension table that plays different roles in different contexts.
 Example: A Time Dimension can be used as Order Date, Ship Date, and Return Date in
the same schema.
 Advantages: Reduces redundancy by reusing the same dimension.
 Challenge: Clear documentation is required to avoid confusion.
IV. Slowly Changing Dimension Techniques (SCD-1 to SCD-4)
SCD-1 (Overwrite): Updates dimension data, overwriting old values.
o Simple but loses historical data.
SCD-2 (Versioning): Keeps historical data by adding new rows with version or date stamps.
o Preserves history.
SCD-3 (Partial History): Maintains limited history using additional columns.
o Balances history tracking and table size.
SCD-4 (Hybrid): Combines SCD-1 and SCD-2, storing current data in one table and historical
data in a separate table.
o Useful for detailed historical analysis without affecting operational data.
These techniques ensure the accurate tracking of changes in dimension data.

5
 To improve response time, data warehouse administrator casually uses Indexing techniques. The index should able to operate with other indexes to
filtering out the records before accessing original data. There are many advantages of indexing such as:
* Faster key-based access to table data, *Reduced storage requirements and *Efficient retrieval
Consider you are a Data Warehouse administrator, discuss below specified indexing techniques one by one with advantages and dis-advantages and give
conclusion for each indexing technique.
* Bit- Mapped indexing techniques, *Cluster indexing Techniques and*Hash-based index B-Tree Index
 I. Bit-Mapped Indexing Techniques
Description:
 Uses bitmaps (0s and 1s) to represent the presence or absence of values in a column.
 Particularly useful for columns with low cardinality (few unique values).
Advantages:
Efficient for Low Cardinality: Excellent for columns with a small number of distinct values (e.g., gender, status).
Fast Query Performance: Performs well in complex queries involving AND, OR, and NOT operations.
Space Efficient: Consumes less storage compared to other indexing methods for low-cardinality data.
Disadvantages:
Not Suitable for High Cardinality: Performance decreases with increasing unique values.
Slow for Updates: Changes to data require rebuilding the bitmap index.
Conclusion:
Bit-mapped indexing is highly effective for analytical queries on low-cardinality columns but is less suitable for transactional data or frequently updated
tables.

6
II. Cluster Indexing Techniques
Description:
 Data is stored in the table according to the order of one or more
columns (cluster key).
 The index points directly to data in the order it is clustered.
Advantages:
Improved Range Query Performance: Faster retrieval of rows
within a range of values.
Reduced I/O: Data physically stored in order reduces the number of
disk accesses.
Efficient Data Access: Ideal for queries that involve sorting or
grouping.
Disadvantages:
Expensive Maintenance: Clustered indexes are costly to maintain
during insert, update, or delete operations.
Limited to One per Table: Only one clustered index can be created
per table.
Conclusion:
Cluster indexing is excellent for range queries and ordered data but
can be resource-intensive for write-heavy workloads.
III. Hash-Based Index
Description:
 Uses a hash function to map keys to a fixed location in the index.
 Suitable for equality searches (e.g., finding a specific key).
Advantages:
Fast Equality Searches: Provides constant-time complexity for retrieving
exact matches.
Efficient in Disk Access: Minimal I/O operations as data is accessed directly.
Disadvantages:
Inefficient for Range Queries: Not suitable for queries requiring sorted data
or ranges.
Collision Handling Overhead: Hash collisions require additional handling
mechanisms (e.g., chaining or open addressing).
Not Optimal for Data Warehousing: Since data warehousing often involves
range or aggregation queries, hash indexing is less effective.
Conclusion:
Hash-based indexing is effective for exact lookups but is not ideal for
analytical queries in data warehouses that involve range scans or sorting.

7
IV. B-Tree Index
Description:
 A balanced tree structure where all leaf nodes are at the same depth.
 Allows efficient searching, insertion, and deletion.
Advantages:
Balanced Performance: Works well for both equality and range queries.
Efficient Updates: B-trees adjust dynamically with minimal
rebalancing.
Multi-Level Indexing: Suitable for large datasets as it reduces disk I/O.
Disadvantages:
Higher Storage Overhead: Requires more storage compared to simpler
indexing methods.
Slower for High Update Rates: Frequent updates can slow performance
due to rebalancing.
Conclusion:
B-Tree indexing is versatile and effective for most data warehouse
workloads, making it the go-to choice for both transactional and
analytical queries
Summary Conclusion
Each indexing technique has its strengths and weaknesses, and their
applicability depends on the workload:
 Bit-Mapped Indexing: Best for low-cardinality data and read-heavy
workloads.
 Cluster Indexing: Ideal for range queries and ordered data.
 Hash-Based Indexing: Suited for equality searches but less useful for
analytical workloads.
 B-Tree Indexing: A balanced and flexible option for various types of
queries.
In a data warehouse environment, B-Tree and Bit-Mapped indexes are often
the most effective, depending on the query patterns and cardinality of the
data.
 a. Describe the components of Data Warehouse Architecture.
b. Discuss the differences between the three main types of data warehouse usage: Information processing, analytical processing, and data mining.

8
 a. Components of Data Warehouse Architecture
A data warehouse architecture typically consists of the following key
components:
1. Data Sources
o Collects data from various sources such as operational databases,
external sources, and transactional systems.
o Sources may include relational databases, flat files, APIs, or third-
party data streams.
2. ETL (Extract, Transform, Load) Layer
o Extract: Gathers data from various sources.
o Transform: Cleanses, transforms, and integrates data into a unified
format.
o Load: Loads the processed data into the data warehouse.
o Tools like Talend, Informatica, or Apache Nifi are commonly used
for this process.
3. Data Storage Layer
o Data Warehouse: Centralized repository for storing historical data
in an optimized schema (often star or snowflake schema).
o Includes detailed, summary, and metadata storage.
o Some implementations use distributed or cloud-based storage
solutions like Amazon Redshift or Snowflake.
4. Data Access Layer
o Provides users and applications access to the stored data
for reporting, querying, and analysis.
o This layer may include OLAP (Online Analytical
Processing) cubes for faster multi-dimensional analysis.
5. Data Presentation Layer
o Includes tools and interfaces for data visualization and
reporting.
o Common tools: Power BI, Tableau, or Looker.
6. Metadata Management
o Stores information about data (data definitions, data
lineage, transformations).
o Helps users and administrators understand and manage the
data effectively.
7. Data Governance and Security
o Ensures data quality, compliance, and secure access.
o Includes role-based access control and auditing
mechanisms.

9
 b. Differences Between Types of Data Warehouse Usage
1. Information Processing
o Purpose: Supports querying and reporting of historical data for operational insights.
o Examples: Standard reports, dashboards, and routine queries (e.g., monthly sales reports).
o Tools: SQL queries, BI tools like Power BI, Tableau.
o Characteristics:
 Focus on pre-defined queries.
 Quick and straightforward analysis.
o Usage: Primarily for decision-makers needing routine insights.
2. Analytical Processing (OLAP)
o Purpose: Enables complex analytical queries involving aggregation, slicing, dicing, and drill-down.
o Examples: Multi-dimensional analysis of sales data across regions, time, and products.
o Tools: OLAP tools like Microsoft Analysis Services, Apache Kylin.
o Characteristics:
 Supports multi-dimensional analysis.
 Focus on trends, comparisons, and performance metrics.
o Usage: Used by analysts to derive insights for strategic planning.
3. Data Mining
o Purpose: Discovers hidden patterns, correlations, and anomalies in large datasets.
o Examples: Customer segmentation, fraud detection, and predictive modeling.
o Tools: Machine learning platforms like KNIME, RapidMiner, or Python libraries (Scikit-learn).
o Characteristics:
 Uses algorithms to find patterns.
 Focus on predictive and prescriptive analytics.
o Usage: Often employed by data scientists to forecast trends and uncover insights.
Feature
Information
Processing
Analytical
Processing
(OLAP)
Data Mining
Purpose Query &
reporting
Multi-
dimensional
analysis
Pattern
discovery &
prediction
Tools SQL, BI tools OLAP tools
Machine
learning
platforms
Output Reports
Trends,
comparisons
Predictive
models
Usage
Operational
decisions
Strategic
planning
Advanced
analytics
Summary of Differences

DWH DWH DWH DWH DWH DWH DWH DWH- QP.pptx

More Related Content

Similar to DWH DWH DWH DWH DWH DWH DWH DWH- QP.pptx (20)

Recently uploaded (20)

DWH DWH DWH DWH DWH DWH DWH DWH- QP.pptx