19CS3052R-CO1-7-S7 ECE

DATA WAREHOUSING and MINING
(19CS3052R)
Session No: CO1-7
Session Topic: Multi-Dimensional Modeling, Attribute
Oriented Induction and DW Implementation
Prepared & Presented by:
Dr. Vijaya Sri Kompalli

Session Objective
• Understand
• Prediction Cubes: Data Mining in Multi-Dimensional Cube Space

Data Mining in Cube Space
• Data cube greatly increases the analysis bandwidth
• Four ways to interact OLAP-styled analysis and data mining
• Using cube space to define data space for mining
• Using OLAP queries to generate features and targets for mining, e.g., multi-
feature cube
• Using data-mining models as building blocks in a multi-step mining process,
e.g., prediction cube
• Using data-cube computation techniques to speed up repeated model
construction
• Cube-space data mining may require building a model for each candidate data space
• Sharing computation across model-construction for different candidates may lead to
efficient mining

Prediction Cubes
•Prediction cube: A cube structure that stores prediction models in
multidimensional data space and supports prediction in OLAP
manner
•Prediction models are used as building blocks to define the
interestingness of subsets of data, i.e., to answer which subsets of
data indicate better prediction

How to Determine the Prediction Power of an Attribute?
Ex. A customer table D:
Two dimensions Z: Time (Month, Year ) and Location (State,
Country)
Two features X: Gender and Salary
One class-label attribute Y: Valued Customer
Q: “Are there times and locations in which the value of a customer
depended greatly on the customers gender (i.e., Gender:
predictiveness attribute V)?”
Idea:
Compute the difference between the model built on that using X
to predict Y and that built on using X – V to predict Y
If the difference is large, V must play an important role at
predicting Y

Efficient Computation of Prediction Cubes
• Naïve method: Fully materialize the prediction cube, i.e.,
exhaustively build models and evaluate them for each cell
and for each granularity
• Better approach: Explore score function decomposition that
reduces prediction cube computation to data cube
computation

Complex Aggregation at Multiple Granularities: Multi-
Feature Cubes
• Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving
multiple dependent aggregates at multiple granularities
• Ex. Grouping by all subsets of {item, region, month}, find the maximum price in
2010 for each group, and the total sales among all maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 2010
cube by item, region, month: R
such that R.price = max(price)
• Continuing the last example, among the max price tuples, find the min and max
shelf live, and find the fraction of the total sales due to tuple that have min shelf
life within the set of all max price tuples

Data Cube Technology: Summary
• Multidimensional Data Analysis in Cube Space
• Multi-feature Cubes
• Prediction Cubes

Data Generalization:
Attribute-Oriented Induction
Session-6.2

Attribute-Oriented Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures
• How it is done?
• Collect the task-relevant data (initial relation) using a relational database
query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating
their respective counts
• Interaction with users for knowledge presentation

Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate students in the
University database
• Step 1. Fetch relevant set of data using an SQL statement, e.g.,
Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
• Step 2. Perform attribute-oriented induction
• Step 3. Present results in generalized relation, cross-tab, or rule forms

Class Characterization: An Example
13
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Removed Retained Sci,Eng,
Bus
Country Age range City Removed Excl,
VG,..
Gender Major Birth_region Age_range Residence GPA Count
M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …
Birth_Region
Gender
Canada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62
Prime Generalized
Relation
Initial
Relation

Basic Principles of Attribute-Oriented Induction
Data focusing: task-relevant data, including dimensions, and the result is the
initial relation
Attribute-removal: remove attribute A if there is a large set of distinct values
for A but (1) there is no generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
Attribute-generalization: If there is a large set of distinct values for A, and
there exists a set of generalization operators on A, then select an operator and
generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation threshold control: control the final relation/rule size

Attribute-Oriented Induction: Basic Algorithm
InitialRel: Query processing of task-relevant data, deriving the initial
relation.
PreGen: Based on the analysis of the number of distinct values in
each attribute, determine generalization plan for each attribute:
removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
Presentation: User interaction: (1) adjust levels by drilling, (2)
pivoting, (3) mapping into rules, cross tabs, visualization
presentations.

Summary
• Data generalization: Attribute-oriented induction

Data Warehouse Implementation
Session-6.3

Objectives
• Understand need of efficient Data Cube Computations
• Perform indexing for given OLAP Data

Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
Materialization of data cube
Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)
1
1
( 



n
i
i
L
T

The “Compute Cube” Operator
Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
(item)
(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)

Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed
column not suitable for high cardinality domains
• A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it
work for high cardinality domain as well [Wu, et al. TODS’06]
Base table Index on Region Index on Type
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1

Concept Mapping ALM- Create Base Table and Bitmap
Index Tables
Problem Statement: Consider the below query and represent the Bitmap indexing by creating
base and index tables based on output of the query

Example Solution – Create for all columns

Indexing OLAP Data: Join Indices
• Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
• Traditional indices map the values to a list of record ids
• It materializes relational join in JI file and speeds up
relational join
• In data warehouses, join index relates the values of the
dimensions of a start schema to rows in the fact table.
• E.g. fact table: Sales and two dimensions city and
product
• A join index on city maintains for each distinct city a
list of R-IDs of the tuples recording the Sales in the
city
• Join indices can span multiple dimensions

Bitmap Join Index for Snowflake Schema
You can create a bitmap join index on more than one table, in which
the indexed column is joined to the indexed table by using another
table.
For example, you can build an index on countries.country_name, even
though the countries table is not joined directly to the sales table.
Instead, the countries table is joined to the customers table, which is
joined to the sales table.
This type of schema is commonly called a snowflake schema

CREATE BITMAP INDEX sales_co_country_name
ON sales(countries.country_name)
FROM sales, customers, countries
WHERE sales.cust_id = customers.cust_id
AND customers.country_id = countries.country_id
LOCAL NOLOGGING COMPUTE STATISTICS;

Efficient Processing OLAP Queries
• Determine which operations should be performed on the available cuboids
• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection +
projection
• Determine which materialized cuboid(s) should be selected for OLAP op.
• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and
there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
• Explore indexing structures and compressed vs. dense array structs in MOLAP

Parallel Executions
• Finding and presenting the right information in a timely fashion can
be a challenge.
• Parallel execution is the capability that addresses this challenge
• Using processes, parallel execution (also called parallelism), terabytes
of data can be processed in minutes, not hours or days.
• Parallelism is the idea of breaking down a task so that, instead of one
process doing all of the work in a query, many processes do part of
the work at the same time.
• An example of this is when four processes combine to calculate the
total sales for a year, each process handles one quarter of the year
instead of a single processing handling all four quarters by itself.
• The improvement in performance can be quite significant.

• Parallel execution improves processing for:
• Queries requiring large table scans, joins, or partitioned index scans
• Creations of large indexes
• Creation of large tables (including materialized views)
• Bulk inserts, updates, merges, and deletes
• You can also use parallel execution to access object types within an
Oracle database.
• For example, you can use parallel execution to access large objects
(LOBs).
• Large data warehouses should always use parallel execution to
achieve good performance.

Why Parallel Executions
• If you allocate twice the number of resources and achieve a
processing time that is half of what it was with the original amount of
resources, then the operation scales linearly.
• Scaling linearly is the ultimate goal of parallel processing in delivering
answers from a database query.

Automatic
Degree of
Parallelism and
Statement
Queuing

MATERIALIZED VIEWS
• Typically, data flows from one or more online transaction processing
(OLTP) database into a data warehouse on a monthly, weekly, or daily
basis.
• The data is normally processed in a staging file before being added to
the data warehouse.
• Data warehouses commonly range in size from tens of gigabytes to a
few terabytes.
• Usually, the vast majority of the data is stored in a few very large fact
tables.
• One technique employed in data warehouses to improve
performance is the creation of summaries.
• Summaries are special types of aggregate views

• The summaries or aggregates that are created in Oracle Database
using a schema object called A Materialized View.
• Materialized View is A pre-computed table comprising aggregated or
joined data from fact and possibly a dimensional table.
• Also known as a summary or aggregate table
• A materialized view eliminates the overhead associated with
expensive joins and aggregations for a large or important class of
queries.

Need for Materialized Views
• Queries to large databases often involve joins between tables,
aggregations such as SUM, or both.
• These operations are expensive in terms of time and processing
power.
• The type of materialized view you create determines how the
materialized view is refreshed and used by query rewrite.
• The query optimizer automatically recognizes when an existing
materialized view can and should be used to satisfy a request.
• It then transparently rewrites the request to use the materialized
view.
• Queries go directly to the materialized view and not to the underlying
detail tables.
• In general, rewriting queries to use materialized views rather than
detail tables improves response time

Types of Materialized Views
• The types of materialized views are:
• Materialized Views with Aggregates
• Materialized Views Containing Only Joins
• Nested Materialized Views

Materialized Views with Aggregates
• In data warehouses, materialized views normally contain aggregates
as shown in query.
• For fast refresh to be possible, the SELECT list must contain all of
the GROUP BY columns (if present), and there must be
a COUNT(*) and a COUNT(column) on any aggregated columns.
• Also, materialized view logs must be present on all tables referenced
in the query that defines the materialized view.
• The valid aggregate functions are:
• SUM, COUNT(x), COUNT(*), AVG, VARIANCE, STDDEV, MIN, and MAX,
and the expression to be aggregated can be any SQL value expression.

CREATE A MATERIALIZED VIEW
CREATE MATERIALIZED VIEW LOG ON products WITH SEQUENCE,
ROWID
(prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc,
prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure,
prod_pack_size, supplier_id, prod_status, prod_list_price,
prod_min_price)
INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW LOG ON sales
WITH SEQUENCE, ROWID
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold,
amount_sold)
INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW product_sales_mv
PCTFREE 0 TABLESPACE demo
STORAGE (INITIAL 8M)
BUILD IMMEDIATE
REFRESH FAST
ENABLE QUERY REWRITE
AS SELECT p.prod_name, SUM(s.amount_sold) AS dollar_sales,
COUNT(*) AS cnt, COUNT(s.amount_sold) AS cnt_amt
FROM sales s, products p
WHERE s.prod_id = p.prod_id GROUP BY p.prod_name;

• This example creates a materialized view product_sales_mv that
computes total number and value of sales for a product.
• It is derived by joining the tables sales and products on the
column prod_id.
• The materialized view is populated with data immediately because
the build method is immediate and it is available for use by query
rewrite.
• In this example, the default refresh method is FAST, which is allowed
because the appropriate materialized view logs have been created on
tables products and sales.

Summary
Implementation: Efficient computation of data cubes
Partial vs. full vs. no materialization
Indexing OALP data: Bitmap index and join index
OLAP query processing

19CS3052R-CO1-7-S7 ECE

More Related Content

Similar to 19CS3052R-CO1-7-S7 ECE (20)

Recently uploaded (20)

19CS3052R-CO1-7-S7 ECE