SlideShare a Scribd company logo
DATA WAREHOUSING and MINING
(19CS3052R)
Session No: CO1-7
Session Topic: Multi-Dimensional Modeling, Attribute
Oriented Induction and DW Implementation
Prepared & Presented by:
Dr. Vijaya Sri Kompalli
Session Objective
• Understand
• Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
Data Mining in Cube Space
• Data cube greatly increases the analysis bandwidth
• Four ways to interact OLAP-styled analysis and data mining
• Using cube space to define data space for mining
• Using OLAP queries to generate features and targets for mining, e.g., multi-
feature cube
• Using data-mining models as building blocks in a multi-step mining process,
e.g., prediction cube
• Using data-cube computation techniques to speed up repeated model
construction
• Cube-space data mining may require building a model for each candidate data space
• Sharing computation across model-construction for different candidates may lead to
efficient mining
Prediction Cubes
•Prediction cube: A cube structure that stores prediction models in
multidimensional data space and supports prediction in OLAP
manner
•Prediction models are used as building blocks to define the
interestingness of subsets of data, i.e., to answer which subsets of
data indicate better prediction
How to Determine the Prediction Power of an Attribute?
Ex. A customer table D:
Two dimensions Z: Time (Month, Year ) and Location (State,
Country)
Two features X: Gender and Salary
One class-label attribute Y: Valued Customer
Q: “Are there times and locations in which the value of a customer
depended greatly on the customers gender (i.e., Gender:
predictiveness attribute V)?”
Idea:
Compute the difference between the model built on that using X
to predict Y and that built on using X – V to predict Y
If the difference is large, V must play an important role at
predicting Y
Efficient Computation of Prediction Cubes
• Naïve method: Fully materialize the prediction cube, i.e.,
exhaustively build models and evaluate them for each cell
and for each granularity
• Better approach: Explore score function decomposition that
reduces prediction cube computation to data cube
computation
Complex Aggregation at Multiple Granularities: Multi-
Feature Cubes
• Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving
multiple dependent aggregates at multiple granularities
• Ex. Grouping by all subsets of {item, region, month}, find the maximum price in
2010 for each group, and the total sales among all maximum price tuples
select item, region, month, max(price), sum(R.sales)
from purchases
where year = 2010
cube by item, region, month: R
such that R.price = max(price)
• Continuing the last example, among the max price tuples, find the min and max
shelf live, and find the fraction of the total sales due to tuple that have min shelf
life within the set of all max price tuples
Data Cube Technology: Summary
• Multidimensional Data Analysis in Cube Space
• Multi-feature Cubes
• Prediction Cubes
Happy Learning
Data Generalization:
Attribute-Oriented Induction
Session-6.2
Attribute-Oriented Induction
• Proposed in 1989 (KDD ‘89 workshop)
• Not confined to categorical data nor particular measures
• How it is done?
• Collect the task-relevant data (initial relation) using a relational database
query
• Perform generalization by attribute removal or attribute generalization
• Apply aggregation by merging identical, generalized tuples and accumulating
their respective counts
• Interaction with users for knowledge presentation
Attribute-Oriented Induction: An Example
Example: Describe general characteristics of graduate students in the
University database
• Step 1. Fetch relevant set of data using an SQL statement, e.g.,
Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa)
from student
where student_status in {“Msc”, “MBA”, “PhD” }
• Step 2. Perform attribute-oriented induction
• Step 3. Present results in generalized relation, cross-tab, or rule forms
Class Characterization: An Example
13
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Removed Retained Sci,Eng,
Bus
Country Age range City Removed Excl,
VG,..
Gender Major Birth_region Age_range Residence GPA Count
M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
… … … … … … …
Birth_Region
Gender
Canada Foreign Total
M 16 14 30
F 10 22 32
Total 26 36 62
Prime Generalized
Relation
Initial
Relation
Basic Principles of Attribute-Oriented Induction
Data focusing: task-relevant data, including dimensions, and the result is the
initial relation
Attribute-removal: remove attribute A if there is a large set of distinct values
for A but (1) there is no generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes
Attribute-generalization: If there is a large set of distinct values for A, and
there exists a set of generalization operators on A, then select an operator and
generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation threshold control: control the final relation/rule size
Attribute-Oriented Induction: Basic Algorithm
InitialRel: Query processing of task-relevant data, deriving the initial
relation.
PreGen: Based on the analysis of the number of distinct values in
each attribute, determine generalization plan for each attribute:
removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
Presentation: User interaction: (1) adjust levels by drilling, (2)
pivoting, (3) mapping into rules, cross tabs, visualization
presentations.
Summary
• Data generalization: Attribute-oriented induction
Data Warehouse Implementation
Session-6.3
Objectives
• Understand need of efficient Data Cube Computations
• Perform indexing for given OLAP Data
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
Materialization of data cube
Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
)
1
1
( 



n
i
i
L
T
The “Compute Cube” Operator
Cube definition and computation in DMQL
define cube sales [item, city, year]: sum (sales_in_dollars)
compute cube sales
Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
()
(item)
(city)
()
(year)
(city, item) (city, year) (item, year)
(city, item, year)
Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed
column not suitable for high cardinality domains
• A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it
work for high cardinality domain as well [Wu, et al. TODS’06]
Base table Index on Region Index on Type
Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer
RecIDAsia Europe America
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
5 0 1 0
RecID Retail Dealer
1 1 0
2 0 1
3 0 1
4 1 0
5 0 1
Concept Mapping ALM- Create Base Table and Bitmap
Index Tables
Problem Statement: Consider the below query and represent the Bitmap indexing by creating
base and index tables based on output of the query
Example Solution – Create for all columns
Indexing OLAP Data: Join Indices
• Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
• Traditional indices map the values to a list of record ids
• It materializes relational join in JI file and speeds up
relational join
• In data warehouses, join index relates the values of the
dimensions of a start schema to rows in the fact table.
• E.g. fact table: Sales and two dimensions city and
product
• A join index on city maintains for each distinct city a
list of R-IDs of the tuples recording the Sales in the
city
• Join indices can span multiple dimensions
19CS3052R-CO1-7-S7 ECE
Bitmap Join Index for Snowflake Schema
You can create a bitmap join index on more than one table, in which
the indexed column is joined to the indexed table by using another
table.
For example, you can build an index on countries.country_name, even
though the countries table is not joined directly to the sales table.
Instead, the countries table is joined to the customers table, which is
joined to the sales table.
This type of schema is commonly called a snowflake schema
CREATE BITMAP INDEX sales_co_country_name
ON sales(countries.country_name)
FROM sales, customers, countries
WHERE sales.cust_id = customers.cust_id
AND customers.country_id = countries.country_id
LOCAL NOLOGGING COMPUTE STATISTICS;
Efficient Processing OLAP Queries
• Determine which operations should be performed on the available cuboids
• Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection +
projection
• Determine which materialized cuboid(s) should be selected for OLAP op.
• Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and
there are 4 materialized cuboids available:
1) {year, item_name, city}
2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2004
Which should be selected to process the query?
• Explore indexing structures and compressed vs. dense array structs in MOLAP
Parallel Executions
• Finding and presenting the right information in a timely fashion can
be a challenge.
• Parallel execution is the capability that addresses this challenge
• Using processes, parallel execution (also called parallelism), terabytes
of data can be processed in minutes, not hours or days.
• Parallelism is the idea of breaking down a task so that, instead of one
process doing all of the work in a query, many processes do part of
the work at the same time.
• An example of this is when four processes combine to calculate the
total sales for a year, each process handles one quarter of the year
instead of a single processing handling all four quarters by itself.
• The improvement in performance can be quite significant.
• Parallel execution improves processing for:
• Queries requiring large table scans, joins, or partitioned index scans
• Creations of large indexes
• Creation of large tables (including materialized views)
• Bulk inserts, updates, merges, and deletes
• You can also use parallel execution to access object types within an
Oracle database.
• For example, you can use parallel execution to access large objects
(LOBs).
• Large data warehouses should always use parallel execution to
achieve good performance.
Why Parallel Executions
• If you allocate twice the number of resources and achieve a
processing time that is half of what it was with the original amount of
resources, then the operation scales linearly.
• Scaling linearly is the ultimate goal of parallel processing in delivering
answers from a database query.
Automatic
Degree of
Parallelism and
Statement
Queuing
MATERIALIZED VIEWS
• Typically, data flows from one or more online transaction processing
(OLTP) database into a data warehouse on a monthly, weekly, or daily
basis.
• The data is normally processed in a staging file before being added to
the data warehouse.
• Data warehouses commonly range in size from tens of gigabytes to a
few terabytes.
• Usually, the vast majority of the data is stored in a few very large fact
tables.
• One technique employed in data warehouses to improve
performance is the creation of summaries.
• Summaries are special types of aggregate views
• The summaries or aggregates that are created in Oracle Database
using a schema object called A Materialized View.
• Materialized View is A pre-computed table comprising aggregated or
joined data from fact and possibly a dimensional table.
• Also known as a summary or aggregate table
• A materialized view eliminates the overhead associated with
expensive joins and aggregations for a large or important class of
queries.
Need for Materialized Views
• Queries to large databases often involve joins between tables,
aggregations such as SUM, or both.
• These operations are expensive in terms of time and processing
power.
• The type of materialized view you create determines how the
materialized view is refreshed and used by query rewrite.
• The query optimizer automatically recognizes when an existing
materialized view can and should be used to satisfy a request.
• It then transparently rewrites the request to use the materialized
view.
• Queries go directly to the materialized view and not to the underlying
detail tables.
• In general, rewriting queries to use materialized views rather than
detail tables improves response time
Types of Materialized Views
• The types of materialized views are:
• Materialized Views with Aggregates
• Materialized Views Containing Only Joins
• Nested Materialized Views
Materialized Views with Aggregates
• In data warehouses, materialized views normally contain aggregates
as shown in query.
• For fast refresh to be possible, the SELECT list must contain all of
the GROUP BY columns (if present), and there must be
a COUNT(*) and a COUNT(column) on any aggregated columns.
• Also, materialized view logs must be present on all tables referenced
in the query that defines the materialized view.
• The valid aggregate functions are:
• SUM, COUNT(x), COUNT(*), AVG, VARIANCE, STDDEV, MIN, and MAX,
and the expression to be aggregated can be any SQL value expression.
CREATE A MATERIALIZED VIEW
CREATE MATERIALIZED VIEW LOG ON products WITH SEQUENCE,
ROWID
(prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc,
prod_category, prod_category_desc, prod_weight_class,
prod_unit_of_measure,
prod_pack_size, supplier_id, prod_status, prod_list_price,
prod_min_price)
INCLUDING NEW VALUES;
CREATE MATERIALIZED VIEW LOG ON sales
WITH SEQUENCE, ROWID
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold,
amount_sold)
INCLUDING NEW VALUES;
CREATE MATERIALIZED VIEW product_sales_mv
PCTFREE 0 TABLESPACE demo
STORAGE (INITIAL 8M)
BUILD IMMEDIATE
REFRESH FAST
ENABLE QUERY REWRITE
AS SELECT p.prod_name, SUM(s.amount_sold) AS dollar_sales,
COUNT(*) AS cnt, COUNT(s.amount_sold) AS cnt_amt
FROM sales s, products p
WHERE s.prod_id = p.prod_id GROUP BY p.prod_name;
• This example creates a materialized view product_sales_mv that
computes total number and value of sales for a product.
• It is derived by joining the tables sales and products on the
column prod_id.
• The materialized view is populated with data immediately because
the build method is immediate and it is available for use by query
rewrite.
• In this example, the default refresh method is FAST, which is allowed
because the appropriate materialized view logs have been created on
tables products and sales.
Summary
Implementation: Efficient computation of data cubes
Partial vs. full vs. no materialization
Indexing OALP data: Bitmap index and join index
OLAP query processing
Summary
Implementation: Efficient computation of data cubes
Partial vs. full vs. no materialization
Indexing OALP data: Bitmap index and join index
OLAP query processing
19CS3052R-CO1-7-S7 ECE

More Related Content

PDF
print mod 2.pdf
lathass5
 
PPTX
data mining and data warehousing PPT module 2
premajain3
 
PPTX
data generalization and summarization
janani thirupathi
 
PDF
Data Warehouse Implementation
omayva
 
PPT
Data ware housing- Introduction to olap .
Vibrant Technologies & Computers
 
PPT
Data preprocessing
Manikandan Tamilselvan
 
PPT
Data preprocessing in Data Mining
DHIVYADEVAKI
 
PPT
Cssu dw dm
sumit621
 
print mod 2.pdf
lathass5
 
data mining and data warehousing PPT module 2
premajain3
 
data generalization and summarization
janani thirupathi
 
Data Warehouse Implementation
omayva
 
Data ware housing- Introduction to olap .
Vibrant Technologies & Computers
 
Data preprocessing
Manikandan Tamilselvan
 
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Cssu dw dm
sumit621
 

Similar to 19CS3052R-CO1-7-S7 ECE (20)

PPT
Data preprocessing
Manikandan Tamilselvan
 
PPT
Preprocessing.ppt
chatbot9
 
PPT
Preprocessing.ppt
Roshan575917
 
PPT
Preprocessing.ppt
Arumugam Prakash
 
PPT
Preprocessing.ppt
waseemchaudhry13
 
PPTX
Data warehouse
sudhir Pawar
 
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
PDF
2 olap operaciones
Claudia Gomez
 
PDF
mod 2.pdf
ShivaprasadGouda3
 
PPTX
OLAP Basics and Fundamentals by Bharat Kalia
Bharat Kalia
 
PPT
data clean.ppt
chatbot9
 
PPT
Characterization and Comparison
Benjamin Franklin
 
PDF
On multi dimensional cubes of census data: designing and querying
Jaspreet Issaj
 
PDF
Characterization
Aiswaryadevi Jaganmohan
 
PPT
DWO -Pertemuan 1
Abrianto Nugraha
 
PPT
Datapreprocessing
Chandrika Sweety
 
PPTX
Data Mining: Data cube computation and data generalization
DataminingTools Inc
 
Data preprocessing
Manikandan Tamilselvan
 
Preprocessing.ppt
chatbot9
 
Preprocessing.ppt
Roshan575917
 
Preprocessing.ppt
Arumugam Prakash
 
Preprocessing.ppt
waseemchaudhry13
 
Data warehouse
sudhir Pawar
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
2 olap operaciones
Claudia Gomez
 
OLAP Basics and Fundamentals by Bharat Kalia
Bharat Kalia
 
data clean.ppt
chatbot9
 
Characterization and Comparison
Benjamin Franklin
 
On multi dimensional cubes of census data: designing and querying
Jaspreet Issaj
 
Characterization
Aiswaryadevi Jaganmohan
 
DWO -Pertemuan 1
Abrianto Nugraha
 
Datapreprocessing
Chandrika Sweety
 
Data Mining: Data cube computation and data generalization
DataminingTools Inc
 

Recently uploaded (20)

PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 

19CS3052R-CO1-7-S7 ECE

  • 1. DATA WAREHOUSING and MINING (19CS3052R) Session No: CO1-7 Session Topic: Multi-Dimensional Modeling, Attribute Oriented Induction and DW Implementation Prepared & Presented by: Dr. Vijaya Sri Kompalli
  • 2. Session Objective • Understand • Prediction Cubes: Data Mining in Multi-Dimensional Cube Space
  • 3. Data Mining in Cube Space • Data cube greatly increases the analysis bandwidth • Four ways to interact OLAP-styled analysis and data mining • Using cube space to define data space for mining • Using OLAP queries to generate features and targets for mining, e.g., multi- feature cube • Using data-mining models as building blocks in a multi-step mining process, e.g., prediction cube • Using data-cube computation techniques to speed up repeated model construction • Cube-space data mining may require building a model for each candidate data space • Sharing computation across model-construction for different candidates may lead to efficient mining
  • 4. Prediction Cubes •Prediction cube: A cube structure that stores prediction models in multidimensional data space and supports prediction in OLAP manner •Prediction models are used as building blocks to define the interestingness of subsets of data, i.e., to answer which subsets of data indicate better prediction
  • 5. How to Determine the Prediction Power of an Attribute? Ex. A customer table D: Two dimensions Z: Time (Month, Year ) and Location (State, Country) Two features X: Gender and Salary One class-label attribute Y: Valued Customer Q: “Are there times and locations in which the value of a customer depended greatly on the customers gender (i.e., Gender: predictiveness attribute V)?” Idea: Compute the difference between the model built on that using X to predict Y and that built on using X – V to predict Y If the difference is large, V must play an important role at predicting Y
  • 6. Efficient Computation of Prediction Cubes • Naïve method: Fully materialize the prediction cube, i.e., exhaustively build models and evaluate them for each cell and for each granularity • Better approach: Explore score function decomposition that reduces prediction cube computation to data cube computation
  • 7. Complex Aggregation at Multiple Granularities: Multi- Feature Cubes • Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities • Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2010 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 2010 cube by item, region, month: R such that R.price = max(price) • Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples
  • 8. Data Cube Technology: Summary • Multidimensional Data Analysis in Cube Space • Multi-feature Cubes • Prediction Cubes
  • 11. Attribute-Oriented Induction • Proposed in 1989 (KDD ‘89 workshop) • Not confined to categorical data nor particular measures • How it is done? • Collect the task-relevant data (initial relation) using a relational database query • Perform generalization by attribute removal or attribute generalization • Apply aggregation by merging identical, generalized tuples and accumulating their respective counts • Interaction with users for knowledge presentation
  • 12. Attribute-Oriented Induction: An Example Example: Describe general characteristics of graduate students in the University database • Step 1. Fetch relevant set of data using an SQL statement, e.g., Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa) from student where student_status in {“Msc”, “MBA”, “PhD” } • Step 2. Perform attribute-oriented induction • Step 3. Present results in generalized relation, cross-tab, or rule forms
  • 13. Class Characterization: An Example 13 Name Gender Major Birth-Place Birth_date Residence Phone # GPA Jim Woodman M CS Vancouver,BC, Canada 8-12-76 3511 Main St., Richmond 687-4598 3.67 Scott Lachance M CS Montreal, Que, Canada 28-7-75 345 1st Ave., Richmond 253-9106 3.70 Laura Lee … F … Physics … Seattle, WA, USA … 25-8-70 … 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Removed Retained Sci,Eng, Bus Country Age range City Removed Excl, VG,.. Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … … Birth_Region Gender Canada Foreign Total M 16 14 30 F 10 22 32 Total 26 36 62 Prime Generalized Relation Initial Relation
  • 14. Basic Principles of Attribute-Oriented Induction Data focusing: task-relevant data, including dimensions, and the result is the initial relation Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A Attribute-threshold control: typical 2-8, specified/default Generalized relation threshold control: control the final relation/rule size
  • 15. Attribute-Oriented Induction: Basic Algorithm InitialRel: Query processing of task-relevant data, deriving the initial relation. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.
  • 16. Summary • Data generalization: Attribute-oriented induction
  • 18. Objectives • Understand need of efficient Data Cube Computations • Perform indexing for given OLAP Data
  • 19. Efficient Data Cube Computation Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels? Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize Based on size, sharing, access frequency, etc. ) 1 1 (     n i i L T
  • 20. The “Compute Cube” Operator Cube definition and computation in DMQL define cube sales [item, city, year]: sum (sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) () (item) (city) () (year) (city, item) (city, year) (item, year) (city, item, year)
  • 21. Indexing OLAP Data: Bitmap Index • Index on a particular column • Each value in the column has a bit vector: bit-op is fast • The length of the bit vector: # of records in the base table • The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains • A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well [Wu, et al. TODS’06] Base table Index on Region Index on Type Cust Region Type C1 Asia Retail C2 Europe Dealer C3 Asia Dealer C4 America Retail C5 Europe Dealer RecIDAsia Europe America 1 1 0 0 2 0 1 0 3 1 0 0 4 0 0 1 5 0 1 0 RecID Retail Dealer 1 1 0 2 0 1 3 0 1 4 1 0 5 0 1
  • 22. Concept Mapping ALM- Create Base Table and Bitmap Index Tables Problem Statement: Consider the below query and represent the Bitmap indexing by creating base and index tables based on output of the query
  • 23. Example Solution – Create for all columns
  • 24. Indexing OLAP Data: Join Indices • Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …) • Traditional indices map the values to a list of record ids • It materializes relational join in JI file and speeds up relational join • In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table. • E.g. fact table: Sales and two dimensions city and product • A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city • Join indices can span multiple dimensions
  • 26. Bitmap Join Index for Snowflake Schema You can create a bitmap join index on more than one table, in which the indexed column is joined to the indexed table by using another table. For example, you can build an index on countries.country_name, even though the countries table is not joined directly to the sales table. Instead, the countries table is joined to the customers table, which is joined to the sales table. This type of schema is commonly called a snowflake schema
  • 27. CREATE BITMAP INDEX sales_co_country_name ON sales(countries.country_name) FROM sales, customers, countries WHERE sales.cust_id = customers.cust_id AND customers.country_id = countries.country_id LOCAL NOLOGGING COMPUTE STATISTICS;
  • 28. Efficient Processing OLAP Queries • Determine which operations should be performed on the available cuboids • Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection • Determine which materialized cuboid(s) should be selected for OLAP op. • Let the query to be processed be on {brand, province_or_state} with the condition “year = 2004”, and there are 4 materialized cuboids available: 1) {year, item_name, city} 2) {year, brand, country} 3) {year, brand, province_or_state} 4) {item_name, province_or_state} where year = 2004 Which should be selected to process the query? • Explore indexing structures and compressed vs. dense array structs in MOLAP
  • 29. Parallel Executions • Finding and presenting the right information in a timely fashion can be a challenge. • Parallel execution is the capability that addresses this challenge • Using processes, parallel execution (also called parallelism), terabytes of data can be processed in minutes, not hours or days. • Parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. • An example of this is when four processes combine to calculate the total sales for a year, each process handles one quarter of the year instead of a single processing handling all four quarters by itself. • The improvement in performance can be quite significant.
  • 30. • Parallel execution improves processing for: • Queries requiring large table scans, joins, or partitioned index scans • Creations of large indexes • Creation of large tables (including materialized views) • Bulk inserts, updates, merges, and deletes • You can also use parallel execution to access object types within an Oracle database. • For example, you can use parallel execution to access large objects (LOBs). • Large data warehouses should always use parallel execution to achieve good performance.
  • 31. Why Parallel Executions • If you allocate twice the number of resources and achieve a processing time that is half of what it was with the original amount of resources, then the operation scales linearly. • Scaling linearly is the ultimate goal of parallel processing in delivering answers from a database query.
  • 33. MATERIALIZED VIEWS • Typically, data flows from one or more online transaction processing (OLTP) database into a data warehouse on a monthly, weekly, or daily basis. • The data is normally processed in a staging file before being added to the data warehouse. • Data warehouses commonly range in size from tens of gigabytes to a few terabytes. • Usually, the vast majority of the data is stored in a few very large fact tables. • One technique employed in data warehouses to improve performance is the creation of summaries. • Summaries are special types of aggregate views
  • 34. • The summaries or aggregates that are created in Oracle Database using a schema object called A Materialized View. • Materialized View is A pre-computed table comprising aggregated or joined data from fact and possibly a dimensional table. • Also known as a summary or aggregate table • A materialized view eliminates the overhead associated with expensive joins and aggregations for a large or important class of queries.
  • 35. Need for Materialized Views • Queries to large databases often involve joins between tables, aggregations such as SUM, or both. • These operations are expensive in terms of time and processing power. • The type of materialized view you create determines how the materialized view is refreshed and used by query rewrite. • The query optimizer automatically recognizes when an existing materialized view can and should be used to satisfy a request. • It then transparently rewrites the request to use the materialized view. • Queries go directly to the materialized view and not to the underlying detail tables. • In general, rewriting queries to use materialized views rather than detail tables improves response time
  • 36. Types of Materialized Views • The types of materialized views are: • Materialized Views with Aggregates • Materialized Views Containing Only Joins • Nested Materialized Views
  • 37. Materialized Views with Aggregates • In data warehouses, materialized views normally contain aggregates as shown in query. • For fast refresh to be possible, the SELECT list must contain all of the GROUP BY columns (if present), and there must be a COUNT(*) and a COUNT(column) on any aggregated columns. • Also, materialized view logs must be present on all tables referenced in the query that defines the materialized view. • The valid aggregate functions are: • SUM, COUNT(x), COUNT(*), AVG, VARIANCE, STDDEV, MIN, and MAX, and the expression to be aggregated can be any SQL value expression.
  • 38. CREATE A MATERIALIZED VIEW CREATE MATERIALIZED VIEW LOG ON products WITH SEQUENCE, ROWID (prod_id, prod_name, prod_desc, prod_subcategory, prod_subcategory_desc, prod_category, prod_category_desc, prod_weight_class, prod_unit_of_measure, prod_pack_size, supplier_id, prod_status, prod_list_price, prod_min_price) INCLUDING NEW VALUES;
  • 39. CREATE MATERIALIZED VIEW LOG ON sales WITH SEQUENCE, ROWID (prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold) INCLUDING NEW VALUES;
  • 40. CREATE MATERIALIZED VIEW product_sales_mv PCTFREE 0 TABLESPACE demo STORAGE (INITIAL 8M) BUILD IMMEDIATE REFRESH FAST ENABLE QUERY REWRITE AS SELECT p.prod_name, SUM(s.amount_sold) AS dollar_sales, COUNT(*) AS cnt, COUNT(s.amount_sold) AS cnt_amt FROM sales s, products p WHERE s.prod_id = p.prod_id GROUP BY p.prod_name;
  • 41. • This example creates a materialized view product_sales_mv that computes total number and value of sales for a product. • It is derived by joining the tables sales and products on the column prod_id. • The materialized view is populated with data immediately because the build method is immediate and it is available for use by query rewrite. • In this example, the default refresh method is FAST, which is allowed because the appropriate materialized view logs have been created on tables products and sales.
  • 42. Summary Implementation: Efficient computation of data cubes Partial vs. full vs. no materialization Indexing OALP data: Bitmap index and join index OLAP query processing
  • 43. Summary Implementation: Efficient computation of data cubes Partial vs. full vs. no materialization Indexing OALP data: Bitmap index and join index OLAP query processing