SlideShare a Scribd company logo
ISOM
Data Mining and Warehousing
ISOM
Outline
• Objectives/Motivation for Data Mining
• Data mining technique: Classification
• Data mining technique: Association
• Data Warehousing
• Summary – Effect on Society
ISOM
Why Data mining?
• Data Growth Rate
• Twice as much information was created
in 2002 as in 1999 (~30% growth rate)
• Other growth rate estimates even higher
• Very little data will ever be looked at by a
human
• Knowledge Discovery is NEEDED to
make sense and use of data.
ISOM
Data Mining for Customer Modeling
• Customer Tasks:
attrition prediction
targeted marketing:
• cross-sell, customer acquisition
credit-risk
fraud detection
• Industries
banking, telecom, retail sales, …
ISOM
Customer Attrition: Case Study
• Situation: Attrition rate at for mobile phone
customers is around 25-30% a year!
Task:
• Given customer information for the past N
months, predict who is likely to attrite next
month.
• Also, estimate customer value and what is the
cost-effective offer to be made to this
customer.
ISOM
Customer Attrition Results
• Verizon Wireless built a customer data
warehouse
• Identified potential attriters
• Developed multiple, regional models
• Targeted customers with high propensity to
accept the offer
• Reduced attrition rate from over 2%/month
to under 1.5%/month (huge impact, with
>30 M subscribers)
(Reported in 2003)
ISOM
Assessing Credit Risk: Case Study
• Situation: Person applies for a loan
• Task: Should a bank approve the
loan?
• Note: People who have the best
credit don’t need the loans, and
people with worst credit are not
likely to repay. Bank’s best
customers are in the middle
ISOM
Credit Risk - Results
• Banks develop credit models using
variety of machine learning
methods.
• Mortgage and credit card
proliferation are the results of being
able to successfully predict if a
person is likely to default on a loan
• Widely deployed in many countries
ISOM
Successful e-commerce – Case
Study
• A person buys a book (product) at Amazon.com.
• Task: Recommend other books (products) this
person is likely to buy
• Amazon does clustering based on books
bought:
 customers who bought “Advances in Knowledge
Discovery and Data Mining”, also bought “Data
Mining: Practical Machine Learning Tools and
Techniques with Java Implementations”
• Recommendation program is quite successful
ISOM
Major Data Mining Tasks
• Classification: predicting an item class
• Clustering: finding clusters in data
• Associations: e.g. A & B & C occur frequently
• Visualization: to facilitate human discovery
• Summarization: describing a group
• Deviation Detection: finding changes
• Estimation: predicting a continuous value
• Link Analysis: finding relationships
• …
ISOM
Outline
• Objectives/Motivation for Data Mining
• Data mining technique: Classification
• Data mining technique: Association
• Data Warehousing
• Summary – Effect on Society
ISOM
Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Given a set of points from classes
what is the class of new point ?
ISOM
Classification: Linear Regression
• Linear Regression
w0 + w1 x + w2 y >= 0
• Regression
computes wi from
data to minimize
squared error to
‘fit’ the data
• Not flexible enough
ISOM
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
52
3
ISOM
Classification: Neural Nets
• Can select more
complex regions
• Can be more
accurate
• Also can overfit the
data – find patterns
in random noise
ISOM
Example:The weather problem
Outlook Temperature Humidity Windy Play
sunny 85 85 false no
sunny 80 90 true no
overcast 83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast 64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast 72 90 true yes
overcast 81 75 false yes
rainy 71 91 true no
Given past data,
Can you come up
with the rules for
Play/Not Play ?
What is the game?
ISOM
The weather problem
• Conditions for playing
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
witten&eibe
ISOM
Weather data with mixed attributes
• Some attributes have numeric values
Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
witten&eibe
ISOM
A decision tree for this problem
witten&eibe
outlook
humidity windyyes
no yes
no yes
sunny overcast
rainy
TRUE
FALSE
high normal
ISOM
Building Decision Tree
• Top-down tree construction
At start, all training examples are at
the root.
Partition the examples recursively by
choosing one attribute each time.
• Bottom-up tree pruning
Remove subtrees or branches, in a
bottom-up manner, to improve the
estimated accuracy on new cases.
ISOM
Choosing the Splitting Attribute
• At each node, available attributes
are evaluated on the basis of
separating the classes of the
training examples. A Goodness
function is used for this purpose.
• Typical goodness functions:
information gain (ID3/C4.5)
information gain ratio
gini index
witten&eibe
ISOM
Which attribute to select?
witten&eibe
ISOM
A criterion for attribute selection
• Which is the best attribute?
The one which will result in the smallest tree
Heuristic: choose the attribute that produces
the “purest” nodes
• Popular impurity criterion: information
gain
Information gain increases with the average
purity of the subsets that an attribute
produces
• Strategy: choose attribute that results in
greatest information gain
witten&eibe
ISOM
Outline
• Objectives/Motivation for Data Mining
• Data mining technique: Classification
• Data mining technique: Association
• Data Warehousing
• Summary – Effect on Society
ISOM
Transactions Example
TID Produce
1 MILK, BREAD, EGGS
2 BREAD, SUGAR
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
ISOM
Transaction database: Example
TID Products
1 A, B, E
2 B, D
3 B, C
4 A, B, D
5 A, C
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
ITEMS:
A = milk
B= bread
C= cereal
D= sugar
E= eggs
Instances = Transactions
ISOM
Transaction database: Example
TID A B C D E
1 1 1 0 0 1
2 0 1 0 1 0
3 0 1 1 0 0
4 1 1 0 1 0
5 1 0 1 0 0
6 0 1 1 0 0
7 1 0 1 0 0
8 1 1 1 0 1
9 1 1 1 0 0
TID Products
1 A, B, E
2 B, D
3 B, C
4 A, B, D
5 A, C
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
Attributes converted to binary flags
ISOM
Definitions
• Item: attribute=value pair or simply value
usually attributes are converted to binary
flags for each value, e.g. product=“A” is
written as “A”
• Itemset I : a subset of possible items
Example: I = {A,B,E} (order unimportant)
• Transaction: (TID, itemset)
TID is transaction ID
ISOM
Support and Frequent Itemsets
• Support of an itemset
sup(I ) = no. of transactions t that
support (i.e. contain) I
• In example database:
sup ({A,B,E}) = 2, sup ({B,C}) = 4
• Frequent itemset I is one with at
least the minimum support count
sup(I ) >= minsup
ISOM
SUBSET PROPERTY
• Every subset of a frequent set isEvery subset of a frequent set is
frequent!frequent!
• Q: Why is it so?Q: Why is it so?
• A: Example: Suppose {A,B} is frequent.A: Example: Suppose {A,B} is frequent.
Since each occurrence of A,B includesSince each occurrence of A,B includes
both A and B, then both A and B mustboth A and B, then both A and B must
also be frequentalso be frequent
• Similar argument for larger itemsetsSimilar argument for larger itemsets
• Almost all association rule algorithms areAlmost all association rule algorithms are
based on this subset propertybased on this subset property
ISOM
Association Rules
• Association rule R : Itemset1 =>
Itemset2
Itemset1, 2 are disjoint and Itemset2
is non-empty
meaning: if transaction includes
Itemset1 then it also has Itemset2
• Examples
A,B => E,C
A => B,C
ISOM
From Frequent Itemsets to
Association Rules
• Q: Given frequent set {A,B,E},
what are possible association
rules?
A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B
__ => A,B,E (empty rule), or true
=> A,B,E
ISOM
Classification vs Association Rules
Classification Rules
• Focus on one
target field
• Specify class in all
cases
• Measures:
Accuracy
Association Rules
• Many target fields
• Applicable in
some cases
• Measures:
Support,
Confidence, Lift
ISOM
Rule Support and Confidence
• Suppose R : I => J is an association
rule
sup (R) = sup (I ∪ J) is the support count
• support of itemset I ∪ J (I or J)
conf (R) = sup(J) / sup(R) is the confidence of R
• fraction of transactions with I ∪ J that have J
• Association rules with minimum support and count
are sometimes called “strong” rules
ISOM
Association Rules Example:
• Q: Given frequent set {A,B,E},
what association rules have
minsup = 2 and minconf= 50% ?
A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%
Don’t qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
TID List of items
1 A, B, E
2 B, D
3 B, C
4 A, B, D
5 A, C
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
ISOM
Find Strong Association Rules
• A rule has the parameters minsup
and minconf:
sup(R) >= minsup and conf (R) >=
minconf
• Problem:
Find all association rules with given
minsup and minconf
• First, find all frequent itemsets
ISOM
Finding Frequent Itemsets
• Start by finding one-item sets (easy)
• Q: How?
• A: Simply count the frequencies of all
items
ISOM
Finding itemsets: next level
• Apriori algorithm (Agrawal & Srikant)
• Idea: use one-item sets to generate two-
item sets, two-item sets to generate
three-item sets, …
If (A B) is a frequent item set, then (A) and
(B) have to be frequent item sets as well!
In general: if X is frequent k-item set, then all
(k-1)-item subsets of X are also frequent
⇒Compute k-item set by merging (k-1)-item
sets
ISOM
An example
• Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
• Lexicographic order improves efficiency
• Candidate four-item sets:
(A B C D) Q: OK?
A: yes, because all 3-item subsets are frequent
(A C D E) Q: OK?
A: No, because (C D E) is not frequent
ISOM
Generating Association Rules
• Two stage process:
Determine frequent itemsets e.g. with
the Apriori algorithm.
For each frequent item set I
• for each subset J of I
–determine all association rules of
the form: I-J => J
• Main idea used in both stages :
subset property
ISOM
Example: Generating Rules
from an Itemset
• Frequent itemset from golf data:
• Seven potential rules:
Humidity = Normal, Windy = False, Play = Yes (4)
If Humidity = Normal and Windy = False then Play = Yes
If Humidity = Normal and Play = Yes then Windy = False
If Windy = False and Play = Yes then Humidity = Normal
If Humidity = Normal then Windy = False and Play = Yes
If Windy = False then Humidity = Normal and Play = Yes
If Play = Yes then Humidity = Normal and Windy = False
If True then Humidity = Normal and Windy = False and Play = Yes
4/4
4/6
4/6
4/7
4/8
4/9
4/12
ISOM
Rules for the weather data
• Rules with support > 1 and confidence = 100%:
• In total: 3 rules with support four, 5 with support
three, and 50 with support two
Association rule Sup. Conf.
1 Humidity=Normal Windy=False ⇒Play=Yes 4 100%
2 Temperature=Cool ⇒Humidity=Normal 4 100%
3 Outlook=Overcast ⇒Play=Yes 4 100%
4 Temperature=Cold Play=Yes ⇒Humidity=Normal 3 100%
... ... ... ... ...
58 Outlook=Sunny Temperature=Hot ⇒Humidity=High 2 100%
ISOM
Outline
• Objectives/Motivation for Data Mining
• Data mining technique: Classification
• Data mining technique: Association
• Data Warehousing
• Summary – Effect on Society
ISOM
Overview
• Traditional database systems are tuned
to many, small, simple queries.
• Some new applications use fewer, more
time-consuming, complex queries.
• New architectures have been developed
to handle complex “analytic” queries
efficiently.
ISOM
The Data Warehouse
• The most common form of data
integration.
Copy sources into a single DB
(warehouse) and try to keep it up-to-
date.
Usual method: periodic reconstruction
of the warehouse, perhaps overnight.
Frequently essential for analytic
queries.
ISOM
OLTP
• Most database operations involve On-
Line Transaction Processing (OTLP).
Short, simple, frequent queries and/or
modifications, each involving a small
number of tuples.
Examples: Answering queries from a Web
interface, sales at cash registers, selling
airline tickets.
ISOM
OLAP
• Of increasing importance are On-
Line Application Processing
(OLAP) queries.
Few, but complex queries --- may run
for hours.
Queries do not depend on having an
absolutely up-to-date database.
ISOM
OLAP Examples
1. Amazon analyzes purchases by its
customers to come up with an
individual screen with products of
likely interest to the customer.
2. Analysts at Wal-Mart look for items
with increasing sales in some
region.
ISOM
Common Architecture
• Databases at store branches handle
OLTP.
• Local store databases copied to a
central warehouse overnight.
• Analysts use the warehouse for
OLAP.
ISOM
Approaches to Building
Warehouses
1. ROLAP = “relational OLAP”: Tune
a relational DBMS to support star
schemas.
2. MOLAP = “multidimensional
OLAP”: Use a specialized DBMS
with a model such as the “data
cube.”
ISOM
Outline
• Objectives/Motivation for Data Mining
• Data mining technique: Classification
• Data mining technique: Association
• Data Warehousing
• Summary – Effect on Society
ISOM
Controversial Issues
• Data mining (or simple analysis) on people may come with a
profile that would raise controversial issues of
 Discrimination
 Privacy
 Security
• Examples:
 Should males between 18 and 35 from countries that produced
terrorists be singled out for search before flight?
 Can people be denied mortgage based on age, sex, race?
 Women live longer. Should they pay less for life insurance?
ISOM
Data Mining and Discrimination
• Can discrimination be based on features
like sex, age, national origin?
• In some areas (e.g. mortgages,
employment), some features cannot be
used for decision making
• In other areas, these features are needed
to assess the risk factors
E.g. people of African descent are more
susceptible to sickle cell anemia
ISOM
Data Mining and Privacy
• Can information collected for one purpose be used
for mining data for another purpose
 In Europe, generally no, without explicit consent
 In US, generally yes
• Companies routinely collect information about
customers and use it for marketing, etc.
• People may be willing to give up some of their
privacy in exchange for some benefits
 See Data Mining And Privacy Symposium,
www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
ISOM
Data Mining with Privacy
• Data Mining looks for patterns, not people!
• Technical solutions can limit privacy
invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
• return salary + random()
• …
• See Bayardo & Srikant, Technological
Solutions for Protecting Privacy, IEEE
Computer, Sep 2003
ISOM
Criticism of analytic approach
to Threat Detection:
Data Mining will
• invade privacy
• generate millions of false positives
But can it be effective?
ISOM
Is criticism sound ?
• Criticism: Databases have 5% errors, so
analyzing 100 million suspects will
generate 5 million false positives
• Reality: Analytical models correlate many
items of information to reduce false
positives.
• Example: Identify one biased coin from
1,000.
 After one throw of each coin, we cannot
 After 30 throws, one biased coin will stand out
with high probability.
 Can identify 19 biased coins out of 100 million
with sufficient number of throws
ISOM
Analytic technology can be effective
• Combining multiple models and link
analysis can reduce false positives
• Today there are millions of false
positives with manual analysis
• Data mining is just one additional
tool to help analysts
• Analytic technology has the
potential to reduce the current high
rate of false positives
ISOM
Data Mining and Society
• No easy answers to controversial
questions
• Society and policy-makers need to
make an educated choice
Benefits and efficiency of data mining
programs vs. cost and erosion of
privacy
ISOM
Data Mining Future Directions
• Currently, most data mining is on flat tables
• Richer data sources
text, links, web, images, multimedia, knowledge
bases
• Advanced methods
Link mining, Stream mining, …
• Applications
Web, Bioinformatics, Customer modeling, …

More Related Content

PPSX
KOKPIT CPM for IT - Kurumsal Performans Yönetim Platformu
Erkan Çiftçi
 
PPT
An example of discovering simple patterns using basic data mining
Eoin Brazil
 
PPT
Kushal Data Warehousing PPT
Kushal Singh
 
PDF
Data mining & data warehousing (ppt)
Harish Chand
 
PPT
Data Warehousing, Data Mining & Data Visualisation
Sunderland City Council
 
PPT
Gulabs Ppt On Data Warehousing And Mining
gulab sharma
 
PPTX
DATA WAREHOUSING
King Julian
 
PPT
Data Mining and Data Warehousing
Amdocs
 
KOKPIT CPM for IT - Kurumsal Performans Yönetim Platformu
Erkan Çiftçi
 
An example of discovering simple patterns using basic data mining
Eoin Brazil
 
Kushal Data Warehousing PPT
Kushal Singh
 
Data mining & data warehousing (ppt)
Harish Chand
 
Data Warehousing, Data Mining & Data Visualisation
Sunderland City Council
 
Gulabs Ppt On Data Warehousing And Mining
gulab sharma
 
DATA WAREHOUSING
King Julian
 
Data Mining and Data Warehousing
Amdocs
 

Viewers also liked (6)

PPT
DATA WAREHOUSING AND DATA MINING
Lovely Professional University
 
PDF
Big Data v Data Mining
University of Hertfordshire
 
PPT
Data mining slides
smj
 
PPT
Data Mining Concepts
Dung Nguyen
 
PPT
Data Warehousing and Data Mining
idnats
 
PPT
Free Download Powerpoint Slides
George
 
DATA WAREHOUSING AND DATA MINING
Lovely Professional University
 
Big Data v Data Mining
University of Hertfordshire
 
Data mining slides
smj
 
Data Mining Concepts
Dung Nguyen
 
Data Warehousing and Data Mining
idnats
 
Free Download Powerpoint Slides
George
 
Ad

Similar to datamining and warehousing ppt (20)

PPTX
01-data mining-introduction-bayero-u.pptx
DavidClement34
 
PPTX
Predicting the NBA MVP
Thinkful
 
PDF
Predict oscars (4:17)
Thinkful
 
PPTX
Macine learning algorithms - K means, KNN
aiswaryasathwik
 
PDF
Predict oscars (5:11)
Thinkful
 
PDF
Predict the Oscars with Data Science
Carlos Edo
 
PDF
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
PPTX
MODULE 5 _ Mining frequent patterns and associations.pptx
nikshaikh786
 
PPTX
Introduction to machine learning and model building using linear regression
Girish Gore
 
PPTX
Machine learning
Aarthi Srinivasan
 
PPSX
Data Refinement: The missing link between data collection and decisions
Vivastream
 
PPTX
slide-02-data-mining-Input_output-1.pptx
DavidClement34
 
PPT
Hpd 1
dikshagupta111
 
PDF
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
PPTX
3. Tree Models in machine learning
Kv Sagar
 
PPTX
MIning association rules and frequent patterns.pptx
gebremichael0777
 
PDF
Barga Galvanize Sept 2015
Roger Barga
 
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
PDF
2023 Supervised_Learning_Association_Rules
FEG
 
PPTX
Pre_requisties of ML Lect 1.pptxvcbvcbvcbvcb
zmulani8
 
01-data mining-introduction-bayero-u.pptx
DavidClement34
 
Predicting the NBA MVP
Thinkful
 
Predict oscars (4:17)
Thinkful
 
Macine learning algorithms - K means, KNN
aiswaryasathwik
 
Predict oscars (5:11)
Thinkful
 
Predict the Oscars with Data Science
Carlos Edo
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
MODULE 5 _ Mining frequent patterns and associations.pptx
nikshaikh786
 
Introduction to machine learning and model building using linear regression
Girish Gore
 
Machine learning
Aarthi Srinivasan
 
Data Refinement: The missing link between data collection and decisions
Vivastream
 
slide-02-data-mining-Input_output-1.pptx
DavidClement34
 
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
3. Tree Models in machine learning
Kv Sagar
 
MIning association rules and frequent patterns.pptx
gebremichael0777
 
Barga Galvanize Sept 2015
Roger Barga
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
2023 Supervised_Learning_Association_Rules
FEG
 
Pre_requisties of ML Lect 1.pptxvcbvcbvcbvcb
zmulani8
 
Ad

Recently uploaded (20)

PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PPTX
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
DOCX
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
Chad Ayach - A Versatile Aerospace Professional
Chad Ayach
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
IoT_Smart_Agriculture_Presentations.pptx
poojakumari696707
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
database slide on modern techniques for optimizing database queries.pptx
aky52024
 
SAR - EEEfdfdsdasdsdasdasdasdasdasdasdasda.docx
Kanimozhi676285
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Inventory management chapter in automation and robotics.
atisht0104
 
Information Retrieval and Extraction - Module 7
premSankar19
 

datamining and warehousing ppt

  • 1. ISOM Data Mining and Warehousing
  • 2. ISOM Outline • Objectives/Motivation for Data Mining • Data mining technique: Classification • Data mining technique: Association • Data Warehousing • Summary – Effect on Society
  • 3. ISOM Why Data mining? • Data Growth Rate • Twice as much information was created in 2002 as in 1999 (~30% growth rate) • Other growth rate estimates even higher • Very little data will ever be looked at by a human • Knowledge Discovery is NEEDED to make sense and use of data.
  • 4. ISOM Data Mining for Customer Modeling • Customer Tasks: attrition prediction targeted marketing: • cross-sell, customer acquisition credit-risk fraud detection • Industries banking, telecom, retail sales, …
  • 5. ISOM Customer Attrition: Case Study • Situation: Attrition rate at for mobile phone customers is around 25-30% a year! Task: • Given customer information for the past N months, predict who is likely to attrite next month. • Also, estimate customer value and what is the cost-effective offer to be made to this customer.
  • 6. ISOM Customer Attrition Results • Verizon Wireless built a customer data warehouse • Identified potential attriters • Developed multiple, regional models • Targeted customers with high propensity to accept the offer • Reduced attrition rate from over 2%/month to under 1.5%/month (huge impact, with >30 M subscribers) (Reported in 2003)
  • 7. ISOM Assessing Credit Risk: Case Study • Situation: Person applies for a loan • Task: Should a bank approve the loan? • Note: People who have the best credit don’t need the loans, and people with worst credit are not likely to repay. Bank’s best customers are in the middle
  • 8. ISOM Credit Risk - Results • Banks develop credit models using variety of machine learning methods. • Mortgage and credit card proliferation are the results of being able to successfully predict if a person is likely to default on a loan • Widely deployed in many countries
  • 9. ISOM Successful e-commerce – Case Study • A person buys a book (product) at Amazon.com. • Task: Recommend other books (products) this person is likely to buy • Amazon does clustering based on books bought:  customers who bought “Advances in Knowledge Discovery and Data Mining”, also bought “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” • Recommendation program is quite successful
  • 10. ISOM Major Data Mining Tasks • Classification: predicting an item class • Clustering: finding clusters in data • Associations: e.g. A & B & C occur frequently • Visualization: to facilitate human discovery • Summarization: describing a group • Deviation Detection: finding changes • Estimation: predicting a continuous value • Link Analysis: finding relationships • …
  • 11. ISOM Outline • Objectives/Motivation for Data Mining • Data mining technique: Classification • Data mining technique: Association • Data Warehousing • Summary – Effect on Society
  • 12. ISOM Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks, ... Given a set of points from classes what is the class of new point ?
  • 13. ISOM Classification: Linear Regression • Linear Regression w0 + w1 x + w2 y >= 0 • Regression computes wi from data to minimize squared error to ‘fit’ the data • Not flexible enough
  • 14. ISOM Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 52 3
  • 15. ISOM Classification: Neural Nets • Can select more complex regions • Can be more accurate • Also can overfit the data – find patterns in random noise
  • 16. ISOM Example:The weather problem Outlook Temperature Humidity Windy Play sunny 85 85 false no sunny 80 90 true no overcast 83 86 false yes rainy 70 96 false yes rainy 68 80 false yes rainy 65 70 true no overcast 64 65 true yes sunny 72 95 false no sunny 69 70 false yes rainy 75 80 false yes sunny 75 70 true yes overcast 72 90 true yes overcast 81 75 false yes rainy 71 91 true no Given past data, Can you come up with the rules for Play/Not Play ? What is the game?
  • 17. ISOM The weather problem • Conditions for playing Outlook Temperature Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild Normal False Yes … … … … … If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes witten&eibe
  • 18. ISOM Weather data with mixed attributes • Some attributes have numeric values Outlook Temperature Humidity Windy Play Sunny 85 85 False No Sunny 80 90 True No Overcast 83 86 False Yes Rainy 75 80 False Yes … … … … … If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = yes If none of the above then play = yes witten&eibe
  • 19. ISOM A decision tree for this problem witten&eibe outlook humidity windyyes no yes no yes sunny overcast rainy TRUE FALSE high normal
  • 20. ISOM Building Decision Tree • Top-down tree construction At start, all training examples are at the root. Partition the examples recursively by choosing one attribute each time. • Bottom-up tree pruning Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.
  • 21. ISOM Choosing the Splitting Attribute • At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. • Typical goodness functions: information gain (ID3/C4.5) information gain ratio gini index witten&eibe
  • 22. ISOM Which attribute to select? witten&eibe
  • 23. ISOM A criterion for attribute selection • Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the “purest” nodes • Popular impurity criterion: information gain Information gain increases with the average purity of the subsets that an attribute produces • Strategy: choose attribute that results in greatest information gain witten&eibe
  • 24. ISOM Outline • Objectives/Motivation for Data Mining • Data mining technique: Classification • Data mining technique: Association • Data Warehousing • Summary – Effect on Society
  • 25. ISOM Transactions Example TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL
  • 26. ISOM Transaction database: Example TID Products 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C ITEMS: A = milk B= bread C= cereal D= sugar E= eggs Instances = Transactions
  • 27. ISOM Transaction database: Example TID A B C D E 1 1 1 0 0 1 2 0 1 0 1 0 3 0 1 1 0 0 4 1 1 0 1 0 5 1 0 1 0 0 6 0 1 1 0 0 7 1 0 1 0 0 8 1 1 1 0 1 9 1 1 1 0 0 TID Products 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Attributes converted to binary flags
  • 28. ISOM Definitions • Item: attribute=value pair or simply value usually attributes are converted to binary flags for each value, e.g. product=“A” is written as “A” • Itemset I : a subset of possible items Example: I = {A,B,E} (order unimportant) • Transaction: (TID, itemset) TID is transaction ID
  • 29. ISOM Support and Frequent Itemsets • Support of an itemset sup(I ) = no. of transactions t that support (i.e. contain) I • In example database: sup ({A,B,E}) = 2, sup ({B,C}) = 4 • Frequent itemset I is one with at least the minimum support count sup(I ) >= minsup
  • 30. ISOM SUBSET PROPERTY • Every subset of a frequent set isEvery subset of a frequent set is frequent!frequent! • Q: Why is it so?Q: Why is it so? • A: Example: Suppose {A,B} is frequent.A: Example: Suppose {A,B} is frequent. Since each occurrence of A,B includesSince each occurrence of A,B includes both A and B, then both A and B mustboth A and B, then both A and B must also be frequentalso be frequent • Similar argument for larger itemsetsSimilar argument for larger itemsets • Almost all association rule algorithms areAlmost all association rule algorithms are based on this subset propertybased on this subset property
  • 31. ISOM Association Rules • Association rule R : Itemset1 => Itemset2 Itemset1, 2 are disjoint and Itemset2 is non-empty meaning: if transaction includes Itemset1 then it also has Itemset2 • Examples A,B => E,C A => B,C
  • 32. ISOM From Frequent Itemsets to Association Rules • Q: Given frequent set {A,B,E}, what are possible association rules? A => B, E A, B => E A, E => B B => A, E B, E => A E => A, B __ => A,B,E (empty rule), or true => A,B,E
  • 33. ISOM Classification vs Association Rules Classification Rules • Focus on one target field • Specify class in all cases • Measures: Accuracy Association Rules • Many target fields • Applicable in some cases • Measures: Support, Confidence, Lift
  • 34. ISOM Rule Support and Confidence • Suppose R : I => J is an association rule sup (R) = sup (I ∪ J) is the support count • support of itemset I ∪ J (I or J) conf (R) = sup(J) / sup(R) is the confidence of R • fraction of transactions with I ∪ J that have J • Association rules with minimum support and count are sometimes called “strong” rules
  • 35. ISOM Association Rules Example: • Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% TID List of items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C
  • 36. ISOM Find Strong Association Rules • A rule has the parameters minsup and minconf: sup(R) >= minsup and conf (R) >= minconf • Problem: Find all association rules with given minsup and minconf • First, find all frequent itemsets
  • 37. ISOM Finding Frequent Itemsets • Start by finding one-item sets (easy) • Q: How? • A: Simply count the frequencies of all items
  • 38. ISOM Finding itemsets: next level • Apriori algorithm (Agrawal & Srikant) • Idea: use one-item sets to generate two- item sets, two-item sets to generate three-item sets, … If (A B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent ⇒Compute k-item set by merging (k-1)-item sets
  • 39. ISOM An example • Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) • Lexicographic order improves efficiency • Candidate four-item sets: (A B C D) Q: OK? A: yes, because all 3-item subsets are frequent (A C D E) Q: OK? A: No, because (C D E) is not frequent
  • 40. ISOM Generating Association Rules • Two stage process: Determine frequent itemsets e.g. with the Apriori algorithm. For each frequent item set I • for each subset J of I –determine all association rules of the form: I-J => J • Main idea used in both stages : subset property
  • 41. ISOM Example: Generating Rules from an Itemset • Frequent itemset from golf data: • Seven potential rules: Humidity = Normal, Windy = False, Play = Yes (4) If Humidity = Normal and Windy = False then Play = Yes If Humidity = Normal and Play = Yes then Windy = False If Windy = False and Play = Yes then Humidity = Normal If Humidity = Normal then Windy = False and Play = Yes If Windy = False then Humidity = Normal and Play = Yes If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yes 4/4 4/6 4/6 4/7 4/8 4/9 4/12
  • 42. ISOM Rules for the weather data • Rules with support > 1 and confidence = 100%: • In total: 3 rules with support four, 5 with support three, and 50 with support two Association rule Sup. Conf. 1 Humidity=Normal Windy=False ⇒Play=Yes 4 100% 2 Temperature=Cool ⇒Humidity=Normal 4 100% 3 Outlook=Overcast ⇒Play=Yes 4 100% 4 Temperature=Cold Play=Yes ⇒Humidity=Normal 3 100% ... ... ... ... ... 58 Outlook=Sunny Temperature=Hot ⇒Humidity=High 2 100%
  • 43. ISOM Outline • Objectives/Motivation for Data Mining • Data mining technique: Classification • Data mining technique: Association • Data Warehousing • Summary – Effect on Society
  • 44. ISOM Overview • Traditional database systems are tuned to many, small, simple queries. • Some new applications use fewer, more time-consuming, complex queries. • New architectures have been developed to handle complex “analytic” queries efficiently.
  • 45. ISOM The Data Warehouse • The most common form of data integration. Copy sources into a single DB (warehouse) and try to keep it up-to- date. Usual method: periodic reconstruction of the warehouse, perhaps overnight. Frequently essential for analytic queries.
  • 46. ISOM OLTP • Most database operations involve On- Line Transaction Processing (OTLP). Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.
  • 47. ISOM OLAP • Of increasing importance are On- Line Application Processing (OLAP) queries. Few, but complex queries --- may run for hours. Queries do not depend on having an absolutely up-to-date database.
  • 48. ISOM OLAP Examples 1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2. Analysts at Wal-Mart look for items with increasing sales in some region.
  • 49. ISOM Common Architecture • Databases at store branches handle OLTP. • Local store databases copied to a central warehouse overnight. • Analysts use the warehouse for OLAP.
  • 50. ISOM Approaches to Building Warehouses 1. ROLAP = “relational OLAP”: Tune a relational DBMS to support star schemas. 2. MOLAP = “multidimensional OLAP”: Use a specialized DBMS with a model such as the “data cube.”
  • 51. ISOM Outline • Objectives/Motivation for Data Mining • Data mining technique: Classification • Data mining technique: Association • Data Warehousing • Summary – Effect on Society
  • 52. ISOM Controversial Issues • Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination  Privacy  Security • Examples:  Should males between 18 and 35 from countries that produced terrorists be singled out for search before flight?  Can people be denied mortgage based on age, sex, race?  Women live longer. Should they pay less for life insurance?
  • 53. ISOM Data Mining and Discrimination • Can discrimination be based on features like sex, age, national origin? • In some areas (e.g. mortgages, employment), some features cannot be used for decision making • In other areas, these features are needed to assess the risk factors E.g. people of African descent are more susceptible to sickle cell anemia
  • 54. ISOM Data Mining and Privacy • Can information collected for one purpose be used for mining data for another purpose  In Europe, generally no, without explicit consent  In US, generally yes • Companies routinely collect information about customers and use it for marketing, etc. • People may be willing to give up some of their privacy in exchange for some benefits  See Data Mining And Privacy Symposium, www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
  • 55. ISOM Data Mining with Privacy • Data Mining looks for patterns, not people! • Technical solutions can limit privacy invasion Replacing sensitive personal data with anon. ID Give randomized outputs • return salary + random() • … • See Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
  • 56. ISOM Criticism of analytic approach to Threat Detection: Data Mining will • invade privacy • generate millions of false positives But can it be effective?
  • 57. ISOM Is criticism sound ? • Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives • Reality: Analytical models correlate many items of information to reduce false positives. • Example: Identify one biased coin from 1,000.  After one throw of each coin, we cannot  After 30 throws, one biased coin will stand out with high probability.  Can identify 19 biased coins out of 100 million with sufficient number of throws
  • 58. ISOM Analytic technology can be effective • Combining multiple models and link analysis can reduce false positives • Today there are millions of false positives with manual analysis • Data mining is just one additional tool to help analysts • Analytic technology has the potential to reduce the current high rate of false positives
  • 59. ISOM Data Mining and Society • No easy answers to controversial questions • Society and policy-makers need to make an educated choice Benefits and efficiency of data mining programs vs. cost and erosion of privacy
  • 60. ISOM Data Mining Future Directions • Currently, most data mining is on flat tables • Richer data sources text, links, web, images, multimedia, knowledge bases • Advanced methods Link mining, Stream mining, … • Applications Web, Bioinformatics, Customer modeling, …