Internal - General Use BITS Pilani
Maintenance Strategies
 a. An emergency encounter would be measured by patient, diagnosis, intervention, attending physician, emergency department bed location, date, time,
and any other element used to define the encounter. Identifying the attributes that define the measure also identifies the dimensions for the star schema.
Each attribute is important and may form the basis of a dimension or be an attribute of a dimension. Design a star schema with four dimension
modelling steps.
b. Discuss a Solution to the Limitations in a Kimball data warehouse.
 a) Designing a Star Schema for an Emergency Encounter
To design a star schema, follow the four-dimensional modeling steps:
1. Identify the Business Process: The process is Emergency Encounters in the Emergency Department (ED).
2. Identify the Measures (Facts): The fact table will store the numeric measures for analysis, such as:
 Number of encounters
 Duration of stay in the ED
 Cost of treatment
o Fact Table Name: EmergencyEncounterFact
3. Identify the Dimensions: Dimensions describe the context of the fact measures. Based on the attributes provided:
 Patient Dimension: Contains patient details (e.g., patient ID, age, gender).
 Diagnosis Dimension: Stores diagnosis-related information (e.g., diagnosis code, description).
 Intervention Dimension: Captures intervention data (e.g., procedure code, intervention type).
 Time Dimension: Tracks the date and time of the encounter.
 Physician Dimension: Stores details of the attending physician (e.g., physician ID, specialty).
 Location Dimension: Describes the ED bed location (e.g., bed number, ward).
Internal - General Use BITS Pilani
2
4. Design the Star Schema
The schema has a central fact table linked to each dimension table through foreign keys:
Fact Table: EmergencyEncounterFact
| Encounter ID (PK) | Patient ID (FK) | Diagnosis ID (FK) | Intervention ID (FK) | Physician ID (FK) | Time ID (FK) | Location ID (FK) | Duration | Cost |
Dimension Tables:
o Patient Dimension: | Patient ID (PK) | Name | Age | Gender |
o Diagnosis Dimension: | Diagnosis ID (PK) | Diagnosis Code | Description |
o Intervention Dimension: | Intervention ID (PK) | Procedure Code | Type |
o Physician Dimension: | Physician ID (PK) | Name | Specialty |
o Time Dimension: | Time ID (PK) | Date | Time | Day | Month | Year |
o Location Dimension: | Location ID (PK) | Bed Number | Ward |
b) Solution to the Limitations in a Kimball Data Warehouse
Kimball’s Data Warehouse Architecture often faces the following limitations:
1. Complex ETL Process: Solution: Implement an automated ETL pipeline using modern data integration tools (e.g., Apache NiFi, Talend) to
streamline data extraction, transformation, and loading.
2. Redundancy and Storage Costs: Solution: Use data compression and partitioning techniques to reduce storage costs. Additionally,
leveraging cloud-based storage solutions like Amazon Redshift or Google BigQuery can minimize physical storage concerns.
3. Difficulty Handling Semi-structured Data: Solution: Integrate modern tools capable of handling semi-structured data (e.g., JSON, XML)
into the data warehouse architecture, using technologies like Snowflake or Delta Lake, which allow flexible schema integration.
4. Limited Real-Time Processing: Solution: Incorporate real-time data processing frameworks such as Apache Kafka or streaming capabilities
of modern warehouses (e.g., Snowflake’s streaming ingestion) to support real-time analytics.
Maintenance Strategies
Internal - General Use BITS Pilani
3
Maintenance Strategies
 As all of you know the case study of retail sales, The Sales fact table measures the quantity, price, and total sales amount for a retail company. If a
business reported product sales and product returns using two different product tables it would be impossible to associate the resulting information
between sales and product returns.
i. Draw the product returns star schema with all measures and describe the business process.
ii. Identify the approach of conformed dimensions between retail sales and product return.
iii. Write an SQL Query for : Returns by product and Month
 i) Product Returns Star Schema
Business Process
The business process involves tracking Product Returns, which captures returned products' details such as quantity, return reason, and refund amount. It
allows the business to analyze return patterns and improve product quality or customer satisfaction.
Fact Table: ProductReturnFact
| Return ID (PK) | Product ID (FK) | Customer ID (FK) | Time ID (FK) | Store ID (FK) | Quantity Returned | Refund Amount | Return Reason |
Dimension Tables:
 Product Dimension: Describes the returned product. | Product ID (PK) | Product Name | Category | Brand |
 Customer Dimension: Describes the customer who returned the product. | Customer ID (PK) | Name | Email | Location |
 Time Dimension: Captures the return date and time. | Time ID (PK) | Date | Month | Year |
 Store Dimension: Details of the store where the return was made. | Store ID (PK) | Store Name | Location |
ii) Approach of Conformed Dimensions
Conformed dimensions are shared dimensions that ensure consistency across multiple fact tables (e.g., SalesFact and ProductReturnFact).
Approach:
Shared Dimensions: Both SalesFact and ProductReturnFact use the same Product Dimension, Customer Dimension, Time Dimension, and Store
Dimension.
Benefits: * Ensures consistency across reports.
* Allows combining sales and return data for a holistic view of performance.
Internal - General Use BITS Pilani
4
Maintenance Strategies
 iii) SQL Query: Returns by Product and Month
SELECT
P.ProductName,
T.Month,
SUM(F.QuantityReturned) AS TotalReturns,
SUM(F.RefundAmount) AS TotalRefunds
FROM
ProductReturnFact F
JOIN
ProductDimension P ON F.ProductID = P.ProductID
JOIN
TimeDimension T ON F.TimeID = T.TimeID
GROUP BY
P.ProductName, T.Month
ORDER BY
T.Month, P.ProductName;
 Discuss the following:
I. Periodic Snapshot Fact Tables
II. Consolidated Fact Tables
III. Role-Playing Dimensions
Slowly Changing Dimension Techniques up to SCD-4
 Discussion on Data Warehousing Concepts
I. Periodic Snapshot Fact Tables
 Definition: Capture data at regular time intervals (e.g., daily, monthly).
 Use Case: Track inventory levels at the end of each day.
 Advantages: Provides historical trends and performance over time.
 Limitation: Larger storage requirements due to frequent snapshots.
II. Consolidated Fact Tables
 Definition: Combines multiple fact tables into one to provide a unified view.
 Use Case: Merging sales and returns facts for overall revenue insights.
 Advantages: Simplifies querying and reporting.
 Limitation: Can increase complexity and size of the fact table.
III. Role-Playing Dimensions
 Definition: A single dimension table that plays different roles in different contexts.
 Example: A Time Dimension can be used as Order Date, Ship Date, and Return Date in
the same schema.
 Advantages: Reduces redundancy by reusing the same dimension.
 Challenge: Clear documentation is required to avoid confusion.
IV. Slowly Changing Dimension Techniques (SCD-1 to SCD-4)
SCD-1 (Overwrite): Updates dimension data, overwriting old values.
o Simple but loses historical data.
SCD-2 (Versioning): Keeps historical data by adding new rows with version or date stamps.
o Preserves history.
SCD-3 (Partial History): Maintains limited history using additional columns.
o Balances history tracking and table size.
SCD-4 (Hybrid): Combines SCD-1 and SCD-2, storing current data in one table and historical
data in a separate table.
o Useful for detailed historical analysis without affecting operational data.
These techniques ensure the accurate tracking of changes in dimension data.
Internal - General Use BITS Pilani
5
Maintenance Strategies
 To improve response time, data warehouse administrator casually uses Indexing techniques. The index should able to operate with other indexes to
filtering out the records before accessing original data. There are many advantages of indexing such as:
* Faster key-based access to table data, *Reduced storage requirements and *Efficient retrieval
Consider you are a Data Warehouse administrator, discuss below specified indexing techniques one by one with advantages and dis-advantages and give
conclusion for each indexing technique.
* Bit- Mapped indexing techniques, *Cluster indexing Techniques and*Hash-based index B-Tree Index
 I. Bit-Mapped Indexing Techniques
Description:
 Uses bitmaps (0s and 1s) to represent the presence or absence of values in a column.
 Particularly useful for columns with low cardinality (few unique values).
Advantages:
Efficient for Low Cardinality: Excellent for columns with a small number of distinct values (e.g., gender, status).
Fast Query Performance: Performs well in complex queries involving AND, OR, and NOT operations.
Space Efficient: Consumes less storage compared to other indexing methods for low-cardinality data.
Disadvantages:
Not Suitable for High Cardinality: Performance decreases with increasing unique values.
Slow for Updates: Changes to data require rebuilding the bitmap index.
Conclusion:
Bit-mapped indexing is highly effective for analytical queries on low-cardinality columns but is less suitable for transactional data or frequently updated
tables.
Internal - General Use BITS Pilani
6
Maintenance Strategies
II. Cluster Indexing Techniques
Description:
 Data is stored in the table according to the order of one or more
columns (cluster key).
 The index points directly to data in the order it is clustered.
Advantages:
Improved Range Query Performance: Faster retrieval of rows
within a range of values.
Reduced I/O: Data physically stored in order reduces the number of
disk accesses.
Efficient Data Access: Ideal for queries that involve sorting or
grouping.
Disadvantages:
Expensive Maintenance: Clustered indexes are costly to maintain
during insert, update, or delete operations.
Limited to One per Table: Only one clustered index can be created
per table.
Conclusion:
Cluster indexing is excellent for range queries and ordered data but
can be resource-intensive for write-heavy workloads.
III. Hash-Based Index
Description:
 Uses a hash function to map keys to a fixed location in the index.
 Suitable for equality searches (e.g., finding a specific key).
Advantages:
Fast Equality Searches: Provides constant-time complexity for retrieving
exact matches.
Efficient in Disk Access: Minimal I/O operations as data is accessed directly.
Disadvantages:
Inefficient for Range Queries: Not suitable for queries requiring sorted data
or ranges.
Collision Handling Overhead: Hash collisions require additional handling
mechanisms (e.g., chaining or open addressing).
Not Optimal for Data Warehousing: Since data warehousing often involves
range or aggregation queries, hash indexing is less effective.
Conclusion:
Hash-based indexing is effective for exact lookups but is not ideal for
analytical queries in data warehouses that involve range scans or sorting.
Internal - General Use BITS Pilani
7
Maintenance Strategies
IV. B-Tree Index
Description:
 A balanced tree structure where all leaf nodes are at the same depth.
 Allows efficient searching, insertion, and deletion.
Advantages:
Balanced Performance: Works well for both equality and range queries.
Efficient Updates: B-trees adjust dynamically with minimal
rebalancing.
Multi-Level Indexing: Suitable for large datasets as it reduces disk I/O.
Disadvantages:
Higher Storage Overhead: Requires more storage compared to simpler
indexing methods.
Slower for High Update Rates: Frequent updates can slow performance
due to rebalancing.
Conclusion:
B-Tree indexing is versatile and effective for most data warehouse
workloads, making it the go-to choice for both transactional and
analytical queries
Summary Conclusion
Each indexing technique has its strengths and weaknesses, and their
applicability depends on the workload:
 Bit-Mapped Indexing: Best for low-cardinality data and read-heavy
workloads.
 Cluster Indexing: Ideal for range queries and ordered data.
 Hash-Based Indexing: Suited for equality searches but less useful for
analytical workloads.
 B-Tree Indexing: A balanced and flexible option for various types of
queries.
In a data warehouse environment, B-Tree and Bit-Mapped indexes are often
the most effective, depending on the query patterns and cardinality of the
data.
 a. Describe the components of Data Warehouse Architecture.
b. Discuss the differences between the three main types of data warehouse usage: Information processing, analytical processing, and data mining.
Internal - General Use BITS Pilani
8
Maintenance Strategies
 a. Components of Data Warehouse Architecture
A data warehouse architecture typically consists of the following key
components:
1. Data Sources
o Collects data from various sources such as operational databases,
external sources, and transactional systems.
o Sources may include relational databases, flat files, APIs, or third-
party data streams.
2. ETL (Extract, Transform, Load) Layer
o Extract: Gathers data from various sources.
o Transform: Cleanses, transforms, and integrates data into a unified
format.
o Load: Loads the processed data into the data warehouse.
o Tools like Talend, Informatica, or Apache Nifi are commonly used
for this process.
3. Data Storage Layer
o Data Warehouse: Centralized repository for storing historical data
in an optimized schema (often star or snowflake schema).
o Includes detailed, summary, and metadata storage.
o Some implementations use distributed or cloud-based storage
solutions like Amazon Redshift or Snowflake.
4. Data Access Layer
o Provides users and applications access to the stored data
for reporting, querying, and analysis.
o This layer may include OLAP (Online Analytical
Processing) cubes for faster multi-dimensional analysis.
5. Data Presentation Layer
o Includes tools and interfaces for data visualization and
reporting.
o Common tools: Power BI, Tableau, or Looker.
6. Metadata Management
o Stores information about data (data definitions, data
lineage, transformations).
o Helps users and administrators understand and manage the
data effectively.
7. Data Governance and Security
o Ensures data quality, compliance, and secure access.
o Includes role-based access control and auditing
mechanisms.
Internal - General Use BITS Pilani
9
Maintenance Strategies
 b. Differences Between Types of Data Warehouse Usage
1. Information Processing
o Purpose: Supports querying and reporting of historical data for operational insights.
o Examples: Standard reports, dashboards, and routine queries (e.g., monthly sales reports).
o Tools: SQL queries, BI tools like Power BI, Tableau.
o Characteristics:
 Focus on pre-defined queries.
 Quick and straightforward analysis.
o Usage: Primarily for decision-makers needing routine insights.
2. Analytical Processing (OLAP)
o Purpose: Enables complex analytical queries involving aggregation, slicing, dicing, and drill-down.
o Examples: Multi-dimensional analysis of sales data across regions, time, and products.
o Tools: OLAP tools like Microsoft Analysis Services, Apache Kylin.
o Characteristics:
 Supports multi-dimensional analysis.
 Focus on trends, comparisons, and performance metrics.
o Usage: Used by analysts to derive insights for strategic planning.
3. Data Mining
o Purpose: Discovers hidden patterns, correlations, and anomalies in large datasets.
o Examples: Customer segmentation, fraud detection, and predictive modeling.
o Tools: Machine learning platforms like KNIME, RapidMiner, or Python libraries (Scikit-learn).
o Characteristics:
 Uses algorithms to find patterns.
 Focus on predictive and prescriptive analytics.
o Usage: Often employed by data scientists to forecast trends and uncover insights.
Feature
Information
Processing
Analytical
Processing
(OLAP)
Data Mining
Purpose Query &
reporting
Multi-
dimensional
analysis
Pattern
discovery &
prediction
Tools SQL, BI tools OLAP tools
Machine
learning
platforms
Output Reports
Trends,
comparisons
Predictive
models
Usage
Operational
decisions
Strategic
planning
Advanced
analytics
Summary of Differences

More Related Content

PPTX
PPT SESIÓN 2.pptx
PDF
guia de tramites de las sub alcaldias 2015
DOC
Carta compromiso
DOC
LIQUIDACIÓN DE OBRA
DOCX
INF. INFORME TECNICO PARA SUSTENTO DE CONTRATCION DE PERSONAL SGOSLTt - copia...
PDF
Implantacion De Establecimientos Penitenciarios
DOCX
Informe N° 47 conformidad adicional N° 02.docx
DOCX
Especificaciones tecnicas
PPT SESIÓN 2.pptx
guia de tramites de las sub alcaldias 2015
Carta compromiso
LIQUIDACIÓN DE OBRA
INF. INFORME TECNICO PARA SUSTENTO DE CONTRATCION DE PERSONAL SGOSLTt - copia...
Implantacion De Establecimientos Penitenciarios
Informe N° 47 conformidad adicional N° 02.docx
Especificaciones tecnicas

Similar to DWH DWH DWH DWH DWH DWH DWH DWH- QP.pptx (20)

PPTX
19CS3052R-CO1-7-S7 ECE
PDF
(Lecture 4)Slowly Changing Dimensions.pdf
PPT
My2dw
PPTX
DATA VISUALIZATION USING TABLEAU PROGRAME
PPTX
Dataware house multidimensionalmodelling
PPTX
Data Warehousing for students educationpptx
PPT
Data Warehousing and Data Mining
PDF
Statistics for Business Decision Making and Analysis 3rd Edition Stine Test Bank
PPTX
Lecture 3:Introduction to Dimensional Modelling.pptx
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
PPT
introduction to datawarehouse
PDF
Multi dimensional modeling
PPTX
Artificial intelligence BCA 6th Sem Unit 3 notes
PDF
Excel Pivot Tables and Graphing for Auditors
PPTX
Rodney Matejek Portfolio
PDF
HANA Performance Efficient Speed and Scale-out for Real-time BI
PDF
Technical Research Document - Anurag
PPT
OLAP Cubes in Datawarehousing
PDF
Exploring Neo4j Graph Database as a Fast Data Access Layer
19CS3052R-CO1-7-S7 ECE
(Lecture 4)Slowly Changing Dimensions.pdf
My2dw
DATA VISUALIZATION USING TABLEAU PROGRAME
Dataware house multidimensionalmodelling
Data Warehousing for students educationpptx
Data Warehousing and Data Mining
Statistics for Business Decision Making and Analysis 3rd Edition Stine Test Bank
Lecture 3:Introduction to Dimensional Modelling.pptx
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
introduction to datawarehouse
Multi dimensional modeling
Artificial intelligence BCA 6th Sem Unit 3 notes
Excel Pivot Tables and Graphing for Auditors
Rodney Matejek Portfolio
HANA Performance Efficient Speed and Scale-out for Real-time BI
Technical Research Document - Anurag
OLAP Cubes in Datawarehousing
Exploring Neo4j Graph Database as a Fast Data Access Layer
Ad

Recently uploaded (20)

PDF
faiz-khans about Radiotherapy Physics-02.pdf
PDF
Health aspects of bilberry: A review on its general benefits
PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
PDF
Review of Related Literature & Studies.pdf
PPTX
operating_systems_presentations_delhi_nc
PPTX
Q2 Week 1.pptx Lesson on Kahalagahan ng Pamilya sa Edukasyon
PPTX
BSCE 2 NIGHT (CHAPTER 2) just cases.pptx
PDF
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
PPTX
PLASMA AND ITS CONSTITUENTS 123.pptx
PDF
0520_Scheme_of_Work_(for_examination_from_2021).pdf
PDF
Chevening Scholarship Application and Interview Preparation Guide
PDF
FYJC - Chemistry textbook - standard 11.
PDF
Laparoscopic Imaging Systems at World Laparoscopy Hospital
PDF
Compact First Student's Book Cambridge Official
PDF
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
PPTX
Neurological complocations of systemic disease
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PPTX
Theoretical for class.pptxgshdhddhdhdhgd
PPTX
Designing Adaptive Learning Paths in Virtual Learning Environments
PPT
hemostasis and its significance, physiology
faiz-khans about Radiotherapy Physics-02.pdf
Health aspects of bilberry: A review on its general benefits
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
Review of Related Literature & Studies.pdf
operating_systems_presentations_delhi_nc
Q2 Week 1.pptx Lesson on Kahalagahan ng Pamilya sa Edukasyon
BSCE 2 NIGHT (CHAPTER 2) just cases.pptx
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
PLASMA AND ITS CONSTITUENTS 123.pptx
0520_Scheme_of_Work_(for_examination_from_2021).pdf
Chevening Scholarship Application and Interview Preparation Guide
FYJC - Chemistry textbook - standard 11.
Laparoscopic Imaging Systems at World Laparoscopy Hospital
Compact First Student's Book Cambridge Official
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
Neurological complocations of systemic disease
Disorder of Endocrine system (1).pdfyyhyyyy
Theoretical for class.pptxgshdhddhdhdhgd
Designing Adaptive Learning Paths in Virtual Learning Environments
hemostasis and its significance, physiology
Ad

DWH DWH DWH DWH DWH DWH DWH DWH- QP.pptx

  • 1. Internal - General Use BITS Pilani Maintenance Strategies  a. An emergency encounter would be measured by patient, diagnosis, intervention, attending physician, emergency department bed location, date, time, and any other element used to define the encounter. Identifying the attributes that define the measure also identifies the dimensions for the star schema. Each attribute is important and may form the basis of a dimension or be an attribute of a dimension. Design a star schema with four dimension modelling steps. b. Discuss a Solution to the Limitations in a Kimball data warehouse.  a) Designing a Star Schema for an Emergency Encounter To design a star schema, follow the four-dimensional modeling steps: 1. Identify the Business Process: The process is Emergency Encounters in the Emergency Department (ED). 2. Identify the Measures (Facts): The fact table will store the numeric measures for analysis, such as:  Number of encounters  Duration of stay in the ED  Cost of treatment o Fact Table Name: EmergencyEncounterFact 3. Identify the Dimensions: Dimensions describe the context of the fact measures. Based on the attributes provided:  Patient Dimension: Contains patient details (e.g., patient ID, age, gender).  Diagnosis Dimension: Stores diagnosis-related information (e.g., diagnosis code, description).  Intervention Dimension: Captures intervention data (e.g., procedure code, intervention type).  Time Dimension: Tracks the date and time of the encounter.  Physician Dimension: Stores details of the attending physician (e.g., physician ID, specialty).  Location Dimension: Describes the ED bed location (e.g., bed number, ward).
  • 2. Internal - General Use BITS Pilani 2 4. Design the Star Schema The schema has a central fact table linked to each dimension table through foreign keys: Fact Table: EmergencyEncounterFact | Encounter ID (PK) | Patient ID (FK) | Diagnosis ID (FK) | Intervention ID (FK) | Physician ID (FK) | Time ID (FK) | Location ID (FK) | Duration | Cost | Dimension Tables: o Patient Dimension: | Patient ID (PK) | Name | Age | Gender | o Diagnosis Dimension: | Diagnosis ID (PK) | Diagnosis Code | Description | o Intervention Dimension: | Intervention ID (PK) | Procedure Code | Type | o Physician Dimension: | Physician ID (PK) | Name | Specialty | o Time Dimension: | Time ID (PK) | Date | Time | Day | Month | Year | o Location Dimension: | Location ID (PK) | Bed Number | Ward | b) Solution to the Limitations in a Kimball Data Warehouse Kimball’s Data Warehouse Architecture often faces the following limitations: 1. Complex ETL Process: Solution: Implement an automated ETL pipeline using modern data integration tools (e.g., Apache NiFi, Talend) to streamline data extraction, transformation, and loading. 2. Redundancy and Storage Costs: Solution: Use data compression and partitioning techniques to reduce storage costs. Additionally, leveraging cloud-based storage solutions like Amazon Redshift or Google BigQuery can minimize physical storage concerns. 3. Difficulty Handling Semi-structured Data: Solution: Integrate modern tools capable of handling semi-structured data (e.g., JSON, XML) into the data warehouse architecture, using technologies like Snowflake or Delta Lake, which allow flexible schema integration. 4. Limited Real-Time Processing: Solution: Incorporate real-time data processing frameworks such as Apache Kafka or streaming capabilities of modern warehouses (e.g., Snowflake’s streaming ingestion) to support real-time analytics. Maintenance Strategies
  • 3. Internal - General Use BITS Pilani 3 Maintenance Strategies  As all of you know the case study of retail sales, The Sales fact table measures the quantity, price, and total sales amount for a retail company. If a business reported product sales and product returns using two different product tables it would be impossible to associate the resulting information between sales and product returns. i. Draw the product returns star schema with all measures and describe the business process. ii. Identify the approach of conformed dimensions between retail sales and product return. iii. Write an SQL Query for : Returns by product and Month  i) Product Returns Star Schema Business Process The business process involves tracking Product Returns, which captures returned products' details such as quantity, return reason, and refund amount. It allows the business to analyze return patterns and improve product quality or customer satisfaction. Fact Table: ProductReturnFact | Return ID (PK) | Product ID (FK) | Customer ID (FK) | Time ID (FK) | Store ID (FK) | Quantity Returned | Refund Amount | Return Reason | Dimension Tables:  Product Dimension: Describes the returned product. | Product ID (PK) | Product Name | Category | Brand |  Customer Dimension: Describes the customer who returned the product. | Customer ID (PK) | Name | Email | Location |  Time Dimension: Captures the return date and time. | Time ID (PK) | Date | Month | Year |  Store Dimension: Details of the store where the return was made. | Store ID (PK) | Store Name | Location | ii) Approach of Conformed Dimensions Conformed dimensions are shared dimensions that ensure consistency across multiple fact tables (e.g., SalesFact and ProductReturnFact). Approach: Shared Dimensions: Both SalesFact and ProductReturnFact use the same Product Dimension, Customer Dimension, Time Dimension, and Store Dimension. Benefits: * Ensures consistency across reports. * Allows combining sales and return data for a holistic view of performance.
  • 4. Internal - General Use BITS Pilani 4 Maintenance Strategies  iii) SQL Query: Returns by Product and Month SELECT P.ProductName, T.Month, SUM(F.QuantityReturned) AS TotalReturns, SUM(F.RefundAmount) AS TotalRefunds FROM ProductReturnFact F JOIN ProductDimension P ON F.ProductID = P.ProductID JOIN TimeDimension T ON F.TimeID = T.TimeID GROUP BY P.ProductName, T.Month ORDER BY T.Month, P.ProductName;  Discuss the following: I. Periodic Snapshot Fact Tables II. Consolidated Fact Tables III. Role-Playing Dimensions Slowly Changing Dimension Techniques up to SCD-4  Discussion on Data Warehousing Concepts I. Periodic Snapshot Fact Tables  Definition: Capture data at regular time intervals (e.g., daily, monthly).  Use Case: Track inventory levels at the end of each day.  Advantages: Provides historical trends and performance over time.  Limitation: Larger storage requirements due to frequent snapshots. II. Consolidated Fact Tables  Definition: Combines multiple fact tables into one to provide a unified view.  Use Case: Merging sales and returns facts for overall revenue insights.  Advantages: Simplifies querying and reporting.  Limitation: Can increase complexity and size of the fact table. III. Role-Playing Dimensions  Definition: A single dimension table that plays different roles in different contexts.  Example: A Time Dimension can be used as Order Date, Ship Date, and Return Date in the same schema.  Advantages: Reduces redundancy by reusing the same dimension.  Challenge: Clear documentation is required to avoid confusion. IV. Slowly Changing Dimension Techniques (SCD-1 to SCD-4) SCD-1 (Overwrite): Updates dimension data, overwriting old values. o Simple but loses historical data. SCD-2 (Versioning): Keeps historical data by adding new rows with version or date stamps. o Preserves history. SCD-3 (Partial History): Maintains limited history using additional columns. o Balances history tracking and table size. SCD-4 (Hybrid): Combines SCD-1 and SCD-2, storing current data in one table and historical data in a separate table. o Useful for detailed historical analysis without affecting operational data. These techniques ensure the accurate tracking of changes in dimension data.
  • 5. Internal - General Use BITS Pilani 5 Maintenance Strategies  To improve response time, data warehouse administrator casually uses Indexing techniques. The index should able to operate with other indexes to filtering out the records before accessing original data. There are many advantages of indexing such as: * Faster key-based access to table data, *Reduced storage requirements and *Efficient retrieval Consider you are a Data Warehouse administrator, discuss below specified indexing techniques one by one with advantages and dis-advantages and give conclusion for each indexing technique. * Bit- Mapped indexing techniques, *Cluster indexing Techniques and*Hash-based index B-Tree Index  I. Bit-Mapped Indexing Techniques Description:  Uses bitmaps (0s and 1s) to represent the presence or absence of values in a column.  Particularly useful for columns with low cardinality (few unique values). Advantages: Efficient for Low Cardinality: Excellent for columns with a small number of distinct values (e.g., gender, status). Fast Query Performance: Performs well in complex queries involving AND, OR, and NOT operations. Space Efficient: Consumes less storage compared to other indexing methods for low-cardinality data. Disadvantages: Not Suitable for High Cardinality: Performance decreases with increasing unique values. Slow for Updates: Changes to data require rebuilding the bitmap index. Conclusion: Bit-mapped indexing is highly effective for analytical queries on low-cardinality columns but is less suitable for transactional data or frequently updated tables.
  • 6. Internal - General Use BITS Pilani 6 Maintenance Strategies II. Cluster Indexing Techniques Description:  Data is stored in the table according to the order of one or more columns (cluster key).  The index points directly to data in the order it is clustered. Advantages: Improved Range Query Performance: Faster retrieval of rows within a range of values. Reduced I/O: Data physically stored in order reduces the number of disk accesses. Efficient Data Access: Ideal for queries that involve sorting or grouping. Disadvantages: Expensive Maintenance: Clustered indexes are costly to maintain during insert, update, or delete operations. Limited to One per Table: Only one clustered index can be created per table. Conclusion: Cluster indexing is excellent for range queries and ordered data but can be resource-intensive for write-heavy workloads. III. Hash-Based Index Description:  Uses a hash function to map keys to a fixed location in the index.  Suitable for equality searches (e.g., finding a specific key). Advantages: Fast Equality Searches: Provides constant-time complexity for retrieving exact matches. Efficient in Disk Access: Minimal I/O operations as data is accessed directly. Disadvantages: Inefficient for Range Queries: Not suitable for queries requiring sorted data or ranges. Collision Handling Overhead: Hash collisions require additional handling mechanisms (e.g., chaining or open addressing). Not Optimal for Data Warehousing: Since data warehousing often involves range or aggregation queries, hash indexing is less effective. Conclusion: Hash-based indexing is effective for exact lookups but is not ideal for analytical queries in data warehouses that involve range scans or sorting.
  • 7. Internal - General Use BITS Pilani 7 Maintenance Strategies IV. B-Tree Index Description:  A balanced tree structure where all leaf nodes are at the same depth.  Allows efficient searching, insertion, and deletion. Advantages: Balanced Performance: Works well for both equality and range queries. Efficient Updates: B-trees adjust dynamically with minimal rebalancing. Multi-Level Indexing: Suitable for large datasets as it reduces disk I/O. Disadvantages: Higher Storage Overhead: Requires more storage compared to simpler indexing methods. Slower for High Update Rates: Frequent updates can slow performance due to rebalancing. Conclusion: B-Tree indexing is versatile and effective for most data warehouse workloads, making it the go-to choice for both transactional and analytical queries Summary Conclusion Each indexing technique has its strengths and weaknesses, and their applicability depends on the workload:  Bit-Mapped Indexing: Best for low-cardinality data and read-heavy workloads.  Cluster Indexing: Ideal for range queries and ordered data.  Hash-Based Indexing: Suited for equality searches but less useful for analytical workloads.  B-Tree Indexing: A balanced and flexible option for various types of queries. In a data warehouse environment, B-Tree and Bit-Mapped indexes are often the most effective, depending on the query patterns and cardinality of the data.  a. Describe the components of Data Warehouse Architecture. b. Discuss the differences between the three main types of data warehouse usage: Information processing, analytical processing, and data mining.
  • 8. Internal - General Use BITS Pilani 8 Maintenance Strategies  a. Components of Data Warehouse Architecture A data warehouse architecture typically consists of the following key components: 1. Data Sources o Collects data from various sources such as operational databases, external sources, and transactional systems. o Sources may include relational databases, flat files, APIs, or third- party data streams. 2. ETL (Extract, Transform, Load) Layer o Extract: Gathers data from various sources. o Transform: Cleanses, transforms, and integrates data into a unified format. o Load: Loads the processed data into the data warehouse. o Tools like Talend, Informatica, or Apache Nifi are commonly used for this process. 3. Data Storage Layer o Data Warehouse: Centralized repository for storing historical data in an optimized schema (often star or snowflake schema). o Includes detailed, summary, and metadata storage. o Some implementations use distributed or cloud-based storage solutions like Amazon Redshift or Snowflake. 4. Data Access Layer o Provides users and applications access to the stored data for reporting, querying, and analysis. o This layer may include OLAP (Online Analytical Processing) cubes for faster multi-dimensional analysis. 5. Data Presentation Layer o Includes tools and interfaces for data visualization and reporting. o Common tools: Power BI, Tableau, or Looker. 6. Metadata Management o Stores information about data (data definitions, data lineage, transformations). o Helps users and administrators understand and manage the data effectively. 7. Data Governance and Security o Ensures data quality, compliance, and secure access. o Includes role-based access control and auditing mechanisms.
  • 9. Internal - General Use BITS Pilani 9 Maintenance Strategies  b. Differences Between Types of Data Warehouse Usage 1. Information Processing o Purpose: Supports querying and reporting of historical data for operational insights. o Examples: Standard reports, dashboards, and routine queries (e.g., monthly sales reports). o Tools: SQL queries, BI tools like Power BI, Tableau. o Characteristics:  Focus on pre-defined queries.  Quick and straightforward analysis. o Usage: Primarily for decision-makers needing routine insights. 2. Analytical Processing (OLAP) o Purpose: Enables complex analytical queries involving aggregation, slicing, dicing, and drill-down. o Examples: Multi-dimensional analysis of sales data across regions, time, and products. o Tools: OLAP tools like Microsoft Analysis Services, Apache Kylin. o Characteristics:  Supports multi-dimensional analysis.  Focus on trends, comparisons, and performance metrics. o Usage: Used by analysts to derive insights for strategic planning. 3. Data Mining o Purpose: Discovers hidden patterns, correlations, and anomalies in large datasets. o Examples: Customer segmentation, fraud detection, and predictive modeling. o Tools: Machine learning platforms like KNIME, RapidMiner, or Python libraries (Scikit-learn). o Characteristics:  Uses algorithms to find patterns.  Focus on predictive and prescriptive analytics. o Usage: Often employed by data scientists to forecast trends and uncover insights. Feature Information Processing Analytical Processing (OLAP) Data Mining Purpose Query & reporting Multi- dimensional analysis Pattern discovery & prediction Tools SQL, BI tools OLAP tools Machine learning platforms Output Reports Trends, comparisons Predictive models Usage Operational decisions Strategic planning Advanced analytics Summary of Differences