Data Warehousing (DW)
Origins
Data warehouses are the results of
two software solutions needing and
finding one another:
Data base firms developed data
warehouses and were looking for
applications
EIS and DSS software developers and
vendors needed to deal with ever-
increasing data bases
About 10 years ago, the two groups
started interacting with the results
described here.
Origins
Database developers long understood
that their software was required for
both transactional and analytic
processing
However, their principal
developments were directed to ever-
larger transactional data bases. This
process occurred even through
operational and analytic data are
separate with different requirements
Origins
Once these differences were
understood, new data bases were
created specifically for analysis use.
Today, data warehouses have 3 major
applications
On-line analytic processing for business
intelligence
Data Mining
Customer Relationship Management
Definition
A data warehouse is typically a
dedicated data base system for
decision making that is separate from
the production data base(s) used
operationally. It differs from
production system in that:
it covers a much longer time horizon than
transaction systems
it includes multiple data bases that have
been processed so that the warehouses
data are defined uniformly (i.e., clean
data)
it is optimized for answering complex
queries from managers and analysts
Definition
In the last 5 years, data warehousing has become
a major industry within computing which has
brought together the ideas of data bases and
decision support. It has also been the foundation
for efforts in data mining and in CRM
Data mining refers to finding answers about an
organization from the information in the data
warehouse that the executive or the analyst had
not thought to ask.
Data mining is made possible by the very
presence of large databases in the data
warehouse. It provides techniques that allow
managers to obtain managerial information from
their legacy systems. Its objective is to identify
valid, novel, potentially useful, and
understandable patterns in data.
Definition
The objective of a data warehouse is
to create a single truth
Data warehousing is a major new
application area. It rates extremely
high salaries (up to $100,000 for
specialists, $300,000 for consultants).
Definition
A data warehouse is a:
Subject oriented
Integrated
Time-variant
Non-volatile
Collection of data in support of
management decision processes.
Note:
Data warehouse is physically
separated from operational systems
and operational databases
Data warehouses hold both
aggregated and detailed data for
management separate from the
databases used for On-Line
Transaction Processing (OLTP)
Characteristics
Subject Data are organized by
oriented how users refer to it
Integrated Inconsistencies are
removed in both
nomenclature and
conflicting information;
(i.e. data are clean)
Non-volatile Read-only data. Data do
not change over time.
Time series Data are time series, not
current status
Characteristics
Summarized Operational data are
mapped into decision
usable form
Larger Time series implies much
more data is retained
Non Data can be redundant
normalized
Metadata =Data about data
Input Unintegrated, operational
en-vironment (legacy
systems)
Data Warehouse
Defined in many different ways, but not
rigorously.
A decision support database that is maintained
separately from the organizations operational
database
Support information processing by providing a
solid platform of consolidated, historical data for
analysis.
A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of
managements decision-making process.
W. H. Inmon
Data warehousing:
Data Warehouse
Subject Oriented
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
Data Warehouse
Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records
Data cleaning and data integration
techniques are applied.
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered,
etc.
When data is moved to the warehouse, it is
converted.
Data Warehouse
Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems.
Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain time element.
Data Warehouse
Non Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in
the data warehouse environment.
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
Data Warehouse VS.
Heterogeneous DBMS
Traditional heterogeneous DB integration:
Build wrappers/mediators on top of heterogeneous
databases
Query driven approach
When a query is posed to a client site, a meta-
dictionary is used to translate the query into queries
appropriate for individual heterogeneous sites
involved, and the results are integrated into a global
answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high
performance
Data Warehouse VS.
Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing,
inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
OLTP VS. OLAP
Why Separate
Data Warehouse???
High performance for both systems
DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
Data Warehouse
Design Process
Top-down, bottom-up approaches or a combination
of both
Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each step
before proceeding to the next
Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
Typical data warehouse design process
Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
Data Warehouse
Architecture
Monitor
& OLAP
other Metadata
Integrator Server
sourc
es
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Front-End Tools
3 Data Warehouse
Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data
mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
Reference
H.W. Inmon, The Data Warehouse
Environment: Building the Data
Warehouse
Efraim Turban et. al, Information
Technology for Management:
Transforming Organizations in the
Digital Economy (6th Edition), John
Wiley and Sons, 2008.