Data Warehose and Data Mining Unit I.docx

Data Warehousing and Data Mining (CS601)
Unit I: Data Warehouse and OLAP
Introduction to data Warehouse:
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to
produce statistical results that may help in decision-making.
Data Warehouse is a central place where data is stored from different data sources and
applications. A Data Warehouse is always kept separate from an Operational Database.
A data warehouse can also be viewed as a database for historical data from different functions
within a company. The term Data Warehouse was coined by Bill Inmon in 1990, which he
defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and
non-volatile collection of data in support of management's decision making process".
A Data Warehouse is used for reporting and analyzing of information and stores both historical
and current data. The data in DW system is used for Analytical reporting, which is later used by
Business Analysts, Sales Managers or Knowledge workers for decision-making.
In the above image, you can see that the data is coming from multiple heterogeneous
data sources to a Data Warehouse. Common data sources for a data warehouse includes −
Mrs. Ujjwala S. Patil- SITCOE Yadrav Page 1

 Operational databases
 SAP and non-SAP Applications
 Flat Files (xls, csv, txt files)
Data in data warehouse is accessed by BI (Business Intelligence) users for Analytical Reporting,
Data Mining and Analysis. This is used for decision making by Business Users, Sales Manager,
Analysts to define future strategy.
Difference between Operational Database System and Data Warehouse:
Operational Database Data Warehouse
Operational systems are designed to support
high-volume transaction processing.
Data warehousing systems are typically
designed to support high-volume analytical
processing (i.e., OLAP).
Operational systems are usually concerned with
current data.
Data warehousing systems are usually
concerned with historical data.
Data within operational systems are mainly
updated regularly according to need.
Non-volatile, new data may be added
regularly. Once Added rarely changed.
It is designed for real-time business dealing and
processes.
It is designed for analysis of business measures
by subject area, categories, and attributes.
It is optimized for a simple set of transactions,
generally adding or retrieving a single row at a
time per table.
It is optimized for extent loads and high,
complex, unpredictable queries that access
many rows per table.
It is optimized for validation of incoming
information during transactions, uses validation
data tables.
Loaded with consistent, valid information,
requires no real-time validation.
It supports thousands of concurrent clients.
It supports a few concurrent clients relative to
OLTP.
Operational systems are widely process-
oriented.
Data warehousing systems are widely subject-
oriented
Operational systems are usually optimized to
perform fast inserts and updates of associatively
small volumes of data.
Data warehousing systems are usually
optimized to perform fast retrievals of
relatively high volumes of data.

Data In Data Out
Less Number of data accessed. Large Number of data accessed.
Relational databases are created for on-line
transactional Processing (OLTP)
Data Warehouse designed for on-line
Analytical Processing (OLAP)
Data Warehouse Characteristics:
Integrated Data:
One of the key characteristics of a data warehouse is that it contains integrated data. This means
that the data is collected from various sources, such as transactional systems, and then cleaned,
transformed, and consolidated into a single, unified view. This allows for easy access and
analysis of the data, as well as the ability to track data over time.
Subject-Oriented:
A data warehouse is also subject-oriented, which means that the data is organized around
specific subjects, such as customers, products, or sales. This allows for easy access to the data
relevant to a specific subject, as well as the ability to track the data over time.
Non-Volatile:
Another characteristic of a data warehouse is that it is non-volatile. This means that the data in
the warehouse is never updated or deleted, only added to. This is important because it allows for
the preservation of historical data, making it possible to track trends and patterns over time.
Time-Variant:
A data warehouse is also time-variant, which means that the data is stored with a time dimension.
This allows for easy access to data for specific time periods, such as last quarter or last year. This
makes it possible to track trends and patterns over time.
Data Warehouse architecture and its components:
Data Warehouse architecture:

A data warehouse architecture is a method of defining the overall architecture of data
communication processing and presentation that exist for end-clients computing within the
enterprise. Each data warehouse is different, but all are characterized by standard vital
components.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an
activity recently dubbed online analytical processing (OLAP). These include applications such as
forecasting, profiling, summary reporting, and trend analysis.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to
process the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data

A set of data that defines and gives information about other data. Meta Data used in Data
Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed,
and file size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
The area of the data warehouse saves all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized
record is updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access
tools.
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place where
data is processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from
multiple source systems, especially for enterprise data warehouses where all relevant data of an
enterprise is consolidated.

Data Warehouse Staging Area is a temporary location where a record from source systems is
copied.
Data Warehouse Architecture: With Staging Area and Data Marts
We may want to customize our warehouse's architecture for multiple groups within our
organization.

We can do this by adding data marts. A data mart is a segment of a data warehouses that can
provided information for reporting and analysis on a section, unit, department or operation in the
company, e.g., sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this
example, a financial analyst wants to analyze historical data for purchases and sales or mine
historical information to make predictions about customer behavior.
Components of Data Warehouse:
Architecture is the proper arrangement of the elements. We build a data warehouse with software
and hardware components. To suit the requirements of our organizations, we arrange these
building we may want to boost up another part with extra tools and services. All of these depends
on our circumstances.

The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In the
middle, we see the Data Storage component that handles the data warehouses data. This element
not only stores and manages the data; it also keeps track of data using the metadata repository.
The Information Delivery component shows on the right consists of all the different ways of
making the information from the data warehouses available to the users.
Source Data Component
Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the
enterprise. Based on the data requirements in the data warehouse, we choose segments of the
data from the various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports,
customer profiles, and sometimes even department databases. This is the internal data, part of
which could be useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry produced
by the external department.

Data Staging Component
After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ
the appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination
of duplicates when we bring in the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources. We
combine data from single source record or related data parts from many source records.

On the other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place on a
large scale in the data staging area. When the data transformation function ends, we have a
collection of integrated data that is cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories include
the data structured in highly normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data warehouse
files and having it transferred to one or more destinations according to some customer-specified
scheduling algorithm.

Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current,
but not mainly up to the minute, although development in the data warehouse industry has made
standard and incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with
the database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
Extract-Transform- Loading:
The mechanism of extracting information from source systems and bringing it into the data
warehouse is commonly called ETL, which stands for Extraction, Transformation and
Loading.
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to change
with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse
system and needs to be agile, automated, and well documented.

1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML, and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be
much more difficult. Therefore, this is one of the most important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States, and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.

Data Modeling:
Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling is to
develop a schema describing the reality, or at least a part of the fact, which the data warehouse is
needed to support.
Data Modeling Life Cycle:
In this section, we define a data modeling life cycle. It is a straight forward process of
transforming the business requirements to fulfill the goals for storing, maintaining, and accessing
the data within IT systems. The result is a logical and physical data model for an enterprise data
warehouse.
The objective of the data modeling life cycle is primarily the creation of a storage area for
business information. That area comes from the logical and physical data modeling stages, as
shown in Figure:
Logical Data Model
A logical data model defines the information in as much structure as possible, without observing
how they will be physically achieved in the database. The primary objective of logical data
modeling is to document the business data structures, processes, rules, and relationships by a
single view - the logical data model.
Physical Data Model
Physical data model describes how the model will be presented in the database. A physical
database model demonstrates all table structures, column names, data types, constraints, primary
key, foreign key, and relationships between tables. The purpose of physical data modeling is the
mapping of the logical data model to the physical structures of the RDBMS system hosting the
data warehouse. This contains defining physical RDBMS structures, such as tables and data
types to use when storing the information. It may also include the definition of new data
structures for enhancing query performance.

Logical (Multi Dimensional) data Model:
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers
to access data in the form of queries. They allow users to rapidly receive answers to the
requests which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by
a fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.
Multidimensional Data Representation
OLAP:

At its core, an OLAP cube is a data structure designed for fast analysis of data based on multiple
dimensions.
OLAP cubes support various analytical operations that enhance data
exploration. These include:
 Slicing enables the selection of specific subsets of data based on one or more
dimensions.
 Dicing allows for the selection of specific combinations of dimension values. Drill-down
enables users to explore data at a more granular level by navigating hierarchies.
 Roll-up aggregates data to higher levels of summarization, facilitating broader analysis.
 Pivoting reorients the cube to view data from different dimensions, providing alternate
perspectives.
‍
Let's take an example to understand how an OLAP cube works. Imagine you are managing a
chain of retail stores, and you want to analyze sales data to gain insights into your business
performance. You have data about sales revenue, products, stores, and time periods (e.g., months
or quarters).

‍
To create an OLAP cube, you would start by identifying the dimensions of your data. In this
case, the dimensions could be:
 Time (e.g., months, quarters, years)
 Product (e.g., categories, brands, individual products)
 Store (e.g., locations, regions, individual stores)
‍
The cube would then be structured with these dimensions forming the axes of the cube. Each
intersection point within the cube represents a specific combination of dimension values. For
example, one intersection point might represent the sales revenue for a particular product in a
specific store during a specific month.
OLAP Operations:
OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-dimensional data
(eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes and
these cubes are known as Hyper-cubes.
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down:
In drill-down operation, the less detailed data is converted into highly detailed data. It can
be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up:
It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube.
It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
 In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).
3. Dice:

It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”
4. Slice:
It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.
5. Pivot:
It is also known as rotation operation as it rotates the current view to get a new view of
the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.

OLAP Servers:
Online Analytical Processing(OLAP) refers to a set of software tools used for data analysis in
order to make business decisions. OLAP provides a platform for gaining insights from
databases retrieved from multiple database systems at the same time. It is based on a
multidimensional data model, which enables users to extract and view data from various
perspectives. A multidimensional database is used to store OLAP data. Many Business
Intelligence (BI) applications rely on OLAP technology.
Type of OLAP servers:
The three major types of OLAP servers are as follows:
 ROLAP
 MOLAP
 HOLAP
Relational OLAP (ROLAP):
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as relational
tables. ROLAP servers are used to bridge the gap between the relational back-end server and
the client’s front-end tools. ROLAP servers store and manage warehouse data using RDBMS,
and OLAP middleware fills in the gaps.

Multidimensional OLAP (MOLAP):
Through array-based multidimensional storage engines, Multidimensional On-Line Analytical
Processing (MOLAP) supports multidimensional views of data. Storage utilization in
multidimensional data stores may be low if the data set is sparse.
MOLAP stores data on discs in the form of a specialized multidimensional array structure. It
is used for OLAP, which is based on the arrays’ random access capability. Dimension
instances determine array elements, and the data or measured value associated with each cell
is typically stored in the corresponding array element. The multidimensional array is typically
stored in MOLAP in a linear allocation based on nested traversal of the axes in some
predetermined order.
However, unlike ROLAP, which stores only records with non-zero facts, all array elements
are defined in MOLAP, and as a result, the arrays tend to be sparse, with empty elements
occupying a larger portion of them. MOLAP systems typically include provisions such as
advanced indexing and hashing to locate data while performing queries for handling sparse
arrays, because both storage and retrieval costs are important when evaluating online
performance. MOLAP cubes are ideal for slicing and dicing data and can perform complex
calculations. When the cube is created, all calculations are pre-generated.

Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than MOLAP.HOLAP
is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large amounts of
detailed data. On the one hand, HOLAP benefits from ROLAP’s greater scalability. HOLAP,
on the other hand, makes use of cube technology for faster performance and summary-type
information. Because detailed data is stored in a relational database, cubes are smaller than
MOLAP.

Data Warehose and Data Mining Unit I.docx

More Related Content

Similar to Data Warehose and Data Mining Unit I.docx (20)

Recently uploaded (20)

Data Warehose and Data Mining Unit I.docx