3._DWH_Architecture__Components.ppt

3
Agenda
 Data Warehouse architecture &
building blocks
 ER modeling review
 Need for Dimensional Modeling
 Dimensional modeling & its inside
 Comparison of ER with dimensional

5
Components
 Major components
 Source data component
 Data staging component
 Information delivery component
 Metadata component
 Management and control component

6
1. Source Data Components
 Source data can be grouped into 4 components
 Production data
 Comes from operational systems of enterprise
 Some segments are selected from it
 Narrow scope, e.g. order details
 Internal data
 Private datasheet, documents, customer profiles etc.
 E.g. Customer profiles for specific offering
 Special strategies to transform ‘it’ to DW (text document)
 Archived data
 Old data is archived
 DW have snapshots of historical data
 External data
 Executives depend upon external sources
 E.g. market data of competitors, car rental require new manufacturing.
Define conversion

8
2. Data Staging Components
 After data is extracted, data is to be prepared
 Data extracted from sources needs to be changed,
converted and made ready in suitable format
 Three major functions to make data ready
 Extract
 Transform
 Load
 Staging area provides a place and area with a set
of functions to
 Clean
 Change
 Combine
 Convert

10
3. Data Storage Components
 Separate repository
 Data structured for efficient processing
 Redundancy is increased
 Updated after specific periods
 Only read-only

12
4. Information Delivery Component
 Authentication issues
 Active monitoring services
Performance, DBA note selected
aggregates to change storage
User performance
Aggregate awareness
E.g. mining, OLAP etc

16
Background (ER Modeling)
 For ER modeling, entities are collected from the
environment
 Each entity act as a table
 Success reasons
 Normalized after ER, since it removes redundancy (to
handle update/delete anomalies)
 But number of tables is increased
 Is useful for fast access of small amount of data

ER Drawbacks for DW / Need of Dimensional
Modeling
 ER Hard to remember, due to increased number of tables
 Complex for queries with multiple tables (table joins)
 Conventional RDBMS optimized for small number of tables
whereas large number of tables might be required in DW
 Ideally no calculated attributes
 The DW does not require to update data like in OLTP system
so there is no need of normalization
 OLAP is not the only purpose of DW, we need a model that
facilitate integration of data, data mining, historically
consolidated data.
 Efficient indexing scheme to avoid screening of all data
 De-Normalization (in DW)
 Add primary key
 Direct relationships
 Re-introduce redundancy
17

18
Dimensional Modeling
 Dimensional Modeling focuses subject-
orientation, critical factors of business
 Critical factors are stored in facts
 Redundancy is no problem, achieve efficiency
 Logical design technique for high performance
 Is the modeling technique for storage

Dimensional Modeling (cont.)
 Two important concepts
Fact
 Numeric measurements, represent business activity/event
 Are pre-computed, redundant
 Example: Profit, quantity sold
Dimension
 Qualifying characteristics, perspective to a fact
 Example: date (Date, month, quarter, year)
19

20
Dimensional Modeling (cont.)
 Facts are stored in fact table
 Dimensions are represented by dimension tables
 Dimensions are degrees in which facts can be judged
 Each fact is surrounded by dimension tables
 Looks like a star so called Star Schema

21
Example
TIME
time_key
(PK)
SQL_date
day_of_wee
k
month
STORE
store_key
(PK)
store_ID
store_name
address
district
floor_type
CLERK
clerk_key
(PK)
clerk_id
PRODUCT
product_key
(PK)
SKU
description
brand
category
CUSTOMER
customer_key
(PK)
customer_nam
e
purchase_profi
le
credit_profile
address
PROMOTION
promotion_key
(PK)
promotion_nam
FACT
time_key (FK)
store_key (FK)
clerk_key (FK)
product_key
(FK)
customer_key
(FK)
promotion_key
(FK)
dollars_sold
units_sold

22
Inside Dimensional Modeling
 Inside Dimension table
 Key attribute of dimension table, for identification
 Large no of columns, wide table
 Non-calculated attributes, textual attributes
 Attributes are not directly related
 Un-normalized in Star schema
 Ability to drill-down and drill-up are two ways of
exploiting dimensions
 Can have multiple hierarchies
 Relatively small number of records

23
Inside Dimensional Modeling
 Have two types of attributes
 Key attributes, for connections
 Facts
 Inside fact table
 Concatenated key
 Grain or level of data identified
 Large number of records
 Limited attributes
 Sparse data set
 Degenerate dimensions (order number Average products per
order)
 Fact-less fact table

24
Star Schema Keys
 Primary keys
 Identifying attribute in dimension table
 Relationship attributes combine together to form P.K
 Surrogate keys
 Replacement of primary key
 System generated
 Foreign keys
 Collection of primary keys of dimension tables
 Primary key to fact table
 System generated
 Collection of P.Ks

25
Advantage of Star Schema
 Ease for users to understand
 Optimized for navigation (less joins fast)
 Most suitable for query processing
Karen Corral, et al. (2006) The impact of alternative
diagrams on the accuracy of recall: A comparison of
star-schema diagrams and entity-relationship diagrams,
Decision Support Systems, 42(1), 450-468.

DATA WAREHOUSES AND
DATA MARTS

DATA WAREHOUSES AND DATA MARTS
 Bill Inmon stated, “The single most important issue facing the IT manager
this year is whether to build the data warehouse first or the data mart first.”
 This statement is true even today. Let us examine this statement and take a
stand
 Before deciding to build a data warehouse for your organization, you need to
ask the
 Following basic and fundamental questions and address the relevant issues:
 Top-down or bottom-up approach?
 Enterprise-wide or departmental?
 Which first—data warehouse or data mart?
 Build pilot or go with a full-fledged implementation?
 Dependent or independent data marts?

Top Down Versus Bottom Approach

3._DWH_Architecture__Components.ppt

A Practical Approach
 In order to formulate an approach for your organization, you need to examine
what exactly
 Your organization wants. Is your organization looking for long-term results or
fast data
 Marts for only a few subjects for now? Does your organization want quick,
proof-of-concept,
 Throw-away implementations? Or, do you want to look into some other practical
approach?

 Although both the top-down and the bottom-up approaches each have their own
advantages and drawbacks, a compromise approach accommodating both views
appears to be practical.
 The chief proponent of this practical approach is Ralph Kimball, an eminent
author and data warehouse expert. The steps in this practical approach are as
follows:
1. Plan and define requirements at the overall corporate level
2. Create a surrounding architecture for a complete warehouse
3. Conform and standardize the data content
4. Implement the data warehouse as a series of supermarts, one at a time

METADATA IN THE DATA
WAREHOUSE
Types of Metadata
 Metadata in a data warehouse fall into three major categories:
 Operational Metadata
 Extraction and Transformation Metadata
 End-User Metadata

Operational Metadata
 As you know, data for the data warehouse comes from several operational
systems of the enterprise. These source systems contain different data structures.
 The data elements selected for the data warehouse have various field lengths and
data types.
 In selecting data from the source systems for the data warehouse, you split
records, combine parts of records from different source files, and deal with
multiple coding schemes and field lengths.
 When you deliver information to the end-users, you must be able to tie that back
to the original source data sets.
 Operational metadata contain all of this information about the operational data
sources.

Extraction and Transformation
Metadata
 Extraction and transformation metadata contain data about the extraction of data
from the source systems, namely, the extraction frequencies, extraction methods,
and business rules for the data extraction.
 Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.
End-User Metadata
 The end-user metadata is the navigational map of the data warehouse.
 It enables the end-users to find information from the data warehouse.
 The end-user metadata allows the end-users to use their own business
terminologies.

Significance
 Why is metadata especially important in a data warehouse?
 First, it acts as the glue that connects all parts of the data
warehouse.
 Next, it provides information about the contents and structures to
the developers.
 Finally, it opens the door to the end-users and makes the contents
recognizable in their own terms.

Exercise
 A data warehouse is subject-oriented. What would be the major critical
business subjects for the following companies?
 An international manufacturing company
 A local community bank
 A domestic hotel chain
 You are the data analyst on the project team building a data warehouse for
an insurance company. List the possible data sources from which you will
bring the data into your data warehouse. State your assumptions.
 For an airlines company, identify three operational applications that would
feed into the data warehouse. What would be the data load and refresh
cycles?
 Prepare a table showing all the potential users and informationdelivery
methods for a data warehouse supporting a large national grocery chain.
Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall Slide 2-37

3._DWH_Architecture__Components.ppt

More Related Content

Similar to 3._DWH_Architecture__Components.ppt (20)

Recently uploaded (20)

3._DWH_Architecture__Components.ppt