SlideShare a Scribd company logo
CS501: DATABASE AND
DATA MINING
Mining Frequent Patterns1
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
2
WHY IS FREQ. PATTERN MINING
IMPORTANT?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications 3
BASIC CONCEPTS: FREQUENT
PATTERNS AND ASSOCIATION RULES
 Itemset X = {x1, …, xk}
 Find all the rules X  Y with
minimum support and confidence
support, s, probability that a
transaction contains X ∪ Y
confidence, c, conditional
probability that a
transaction having X also
contains Y
4
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A  D (60%, 100%)
D  A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
CLOSED PATTERNS AND MAX-
PATTERNS
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) =
2100
– 1 = 1.27*1030
sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y X, with the same support as X
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y X
 Closed pattern is a lossless compression of freq.
patterns
Reducing the # of patterns and rules 5
SCALABLE METHODS FOR MINING
FREQUENT PATTERNS
 The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
Apriori
Freq. pattern growth
Vertical data format approach
6
APRIORI: A CANDIDATE GENERATION-AND-TEST
APPROACH
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
7
THE APRIORI ALGORITHM—AN
EXAMPLE
8
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L33rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
IMPORTANT DETAILS OF APRIORI
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
9
HOW TO COUNT SUPPORTS OF
CANDIDATES?
 Why counting supports of candidates a problem?
The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets
and counts
Interior node contains a hash table
Subset function: finds all the candidates contained
in a transaction
10
BOTTLENECK OF FREQUENT-
PATTERN MINING
 Multiple database scans are costly
 Mining long patterns needs many passes of scanning
and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) = 2100
-1 = 1.27*1030
!
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
11
FREQUENT ITEMSET GENERATION
 Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
 Generation of candidate itemsets is expensive(in both
space and time)
 Support counting is expensive
 Subset checking (computationally expensive)
 Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery
without candidate itemset generation. Two step
approach:
 Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.
 Step 2: Extracts frequent itemsets directly from the FP-
tree 12
STEP 1: FP-TREE CONSTRUCTION
 FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
13
STEP 1: FP-TREE CONSTRUCTION
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it
to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the same
prfix ).
 In this case, counters are incremented
1. Pointers are maintained between nodes containing
the same item, creating singly linked lists (dotted
lines)
 The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
1. Frequent itemsets extracted from the FP-Tree. 14
STEP 1: FP-TREE CONSTRUCTION (EXAMPLE)
15
FP-TREE SIZE
 The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of
items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of
items (no items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store
the pointers between the nodes and the counters.
 The size of the FP-tree depends on how the items are
ordered
 Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic). 16
STEP 2: FREQUENT ITEMSET
GENERATION
 FP-Growth extracts frequent itemsets from the
FP-tree.
 Bottom-up algorithm - from the leaves towards
the root
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then d, then
cd, etc. . .
 First, extract prefix path sub-trees ending in an
item(set). (hint: use the linked lists)
17
PREFIX PATH SUB-TREES
(EXAMPLE)
18
STEP 2: FREQUENT ITEMSET
GENERATION
 Each prefix path sub-tree is processed
recursively to extract the frequent
itemsets. Solutions are then merged.
 E.g. the prefix path sub-tree for e will be used
to extract frequent itemsets ending in e, then
in de, ce, be and ae, then in cde, bde, cde, etc.
 Divide and conquer approach
19
CONDITIONAL FP-TREE
 The FP-Tree that would be built if we only consider
transactions containing a particular itemset (and then
removing that itemset from all transactions).
 Example: FP-Tree conditional on e.
20
EXAMPLE
Let minSup = 2 and extract all frequent itemsets
containing e.
 1. Obtain the prefix path sub-tree for e:
21
EXAMPLE
 2. Check if e is a frequent item by adding the
counts along the linked list (dotted line). If so,
extract it.
 Yes, count =3 so {e} is extracted as a frequent
itemset.
 3. As e is frequent, find frequent itemsets ending
in e. i.e. de, ce, be and ae.
22
EXAMPLE
 4. Use the the conditional FP-tree for e to find
frequent itemsets ending in de, ce and ae
 Note that be is not considered as b is not in the
conditional FP-tree for e.
 For each of them (e.g. de), find the prefix paths
from the conditional tree for e, extract frequent
itemsets, generate conditional FP-tree, etc...
(recursive)
23
EXAMPLE
 Example: e -> de -> ade ({d,e}, {a,d,e} are found to be
frequent)
•Example: e -> ce ({c,e} is found to be frequent)
24
RESULT
Frequent itemsets found (ordered by sufix and order in
which they are found):
25
DISCUSION
 Advantages of FP-Growth
 only 2 passes over data-set
 “compresses” data-set
 no candidate generation
 much faster than Apriori
 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build
26
ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 ECLAT: for each item, store a list of transaction ids
(tids); vertical data layout
TID-list
27
ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
 Advantage: very fast support counting
 Disadvantage: intermediate tid-lists may become too
large for memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
∧ →
AB
1
5
7
8
28
INTERESTINGNESS MEASURE: CORRELATIONS
(LIFT)
 play basketball ⇒ eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
29
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
LIFT
 Measure of dependent/correlated events: lift
 Lift = 1 , A & B are independent
 Lift > 1, A & B are positively correlated
 Lift<1, A & B are negatively correlated
30
)()(
)(
BPAP
BAP
lift
∪
=
89.0
5000/3750*5000/3000
5000/2000
),( ==CBlift
33.1
5000/1250*5000/3000
5000/1000
),( ==¬CBlift

More Related Content

What's hot (20)

PDF
Trees, Binary Search Tree, AVL Tree in Data Structures
Gurukul Kangri Vishwavidyalaya - Faculty of Engineering and Technology
 
PDF
Binary Trees
Sriram Raj
 
PPT
BINARY SEARCH TREE
ER Punit Jain
 
PPTX
THREADED BINARY TREE AND BINARY SEARCH TREE
Siddhi Shrivas
 
PPTX
Trees (data structure)
Trupti Agrawal
 
PPT
1.1 binary tree
Krish_ver2
 
PPT
Trees
9590133127
 
PPTX
Tree traversal techniques
Syed Zaid Irshad
 
PPT
Binary Search Tree and AVL
Katang Isip
 
PDF
Tree Data Structure by Daniyal Khan
Daniyal Khan
 
PPT
Binary search trees
Dwight Sabio
 
PPTX
Binary Search Tree
Abhishek L.R
 
PPTX
Binary tree and Binary search tree
Mayeesha Samiha
 
PPTX
trees in data structure
shameen khan
 
PPTX
Trees data structure
Mahmoud Alfarra
 
PPTX
Tree in data structure
ghhgj jhgh
 
PPT
1.5 binary search tree
Krish_ver2
 
PPT
Ch13 Binary Search Tree
leminhvuong
 
PPT
Binary Search Tree
Zafar Ayub
 
Trees, Binary Search Tree, AVL Tree in Data Structures
Gurukul Kangri Vishwavidyalaya - Faculty of Engineering and Technology
 
Binary Trees
Sriram Raj
 
BINARY SEARCH TREE
ER Punit Jain
 
THREADED BINARY TREE AND BINARY SEARCH TREE
Siddhi Shrivas
 
Trees (data structure)
Trupti Agrawal
 
1.1 binary tree
Krish_ver2
 
Trees
9590133127
 
Tree traversal techniques
Syed Zaid Irshad
 
Binary Search Tree and AVL
Katang Isip
 
Tree Data Structure by Daniyal Khan
Daniyal Khan
 
Binary search trees
Dwight Sabio
 
Binary Search Tree
Abhishek L.R
 
Binary tree and Binary search tree
Mayeesha Samiha
 
trees in data structure
shameen khan
 
Trees data structure
Mahmoud Alfarra
 
Tree in data structure
ghhgj jhgh
 
1.5 binary search tree
Krish_ver2
 
Ch13 Binary Search Tree
leminhvuong
 
Binary Search Tree
Zafar Ayub
 

Similar to Cs501 mining frequentpatterns (20)

PPT
Fp growth algorithm
Pradip Kumar
 
PPTX
FP-growth.pptx
selvifitria1
 
PPT
My6asso
ketan533
 
PPTX
Unit 3.pptx
AdwaitLaud
 
PPTX
Data Mining
NilaNila16
 
PDF
Chapter5 ML BASED FREQUENT ITEM SETS.pdf
PRABHUCECC
 
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
PPTX
PYTHON ALGORITHMS, DATA STRUCTURE, SORTING TECHNIQUES
vanithasivdc
 
PPT
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
PPTX
Improved aproiri algorithm by FP tree.pptx
khaledrahman15
 
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
ijsrd.com
 
PDF
6 module 4
tafosepsdfasg
 
PPT
Association Rule.ppt
SowmyaJyothi3
 
PPT
Association Rule.ppt
SowmyaJyothi3
 
PDF
Pattern Discovery Using Apriori and Ch-Search Algorithm
ijceronline
 
PPT
pattern mninng.ppt
kirankumar268455
 
PPTX
Mining frequent patterns association
DeepaR42
 
PPT
Frequent itemset mining using pattern growth method
Shani729
 
PPT
Mining Frequent Itemsets.ppt
NBACriteria2SICET
 
PPTX
Association Rule Mining, Correlation,Clustering
RupaRaj6
 
Fp growth algorithm
Pradip Kumar
 
FP-growth.pptx
selvifitria1
 
My6asso
ketan533
 
Unit 3.pptx
AdwaitLaud
 
Data Mining
NilaNila16
 
Chapter5 ML BASED FREQUENT ITEM SETS.pdf
PRABHUCECC
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
PYTHON ALGORITHMS, DATA STRUCTURE, SORTING TECHNIQUES
vanithasivdc
 
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Improved aproiri algorithm by FP tree.pptx
khaledrahman15
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
ijsrd.com
 
6 module 4
tafosepsdfasg
 
Association Rule.ppt
SowmyaJyothi3
 
Association Rule.ppt
SowmyaJyothi3
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
ijceronline
 
pattern mninng.ppt
kirankumar268455
 
Mining frequent patterns association
DeepaR42
 
Frequent itemset mining using pattern growth method
Shani729
 
Mining Frequent Itemsets.ppt
NBACriteria2SICET
 
Association Rule Mining, Correlation,Clustering
RupaRaj6
 
Ad

More from Kamal Singh Lodhi (16)

PDF
Introduction to Data Structure
Kamal Singh Lodhi
 
PDF
Stack Algorithm
Kamal Singh Lodhi
 
PPT
Data Structure (MC501)
Kamal Singh Lodhi
 
PDF
Cs501 trc drc
Kamal Singh Lodhi
 
PDF
Cs501 transaction
Kamal Singh Lodhi
 
PDF
Cs501 rel algebra
Kamal Singh Lodhi
 
PDF
Cs501 intro
Kamal Singh Lodhi
 
PDF
Cs501 fd nf
Kamal Singh Lodhi
 
PDF
Cs501 dm intro
Kamal Singh Lodhi
 
PDF
Cs501 data preprocessingdw
Kamal Singh Lodhi
 
PDF
Cs501 concurrency
Kamal Singh Lodhi
 
PPT
Cs501 cluster analysis
Kamal Singh Lodhi
 
PPT
Cs501 classification prediction
Kamal Singh Lodhi
 
PPTX
Attribute Classification
Kamal Singh Lodhi
 
PDF
Real Time ImageVideo Processing with Applications in Face Recognition
Kamal Singh Lodhi
 
PPTX
Flow diagram
Kamal Singh Lodhi
 
Introduction to Data Structure
Kamal Singh Lodhi
 
Stack Algorithm
Kamal Singh Lodhi
 
Data Structure (MC501)
Kamal Singh Lodhi
 
Cs501 trc drc
Kamal Singh Lodhi
 
Cs501 transaction
Kamal Singh Lodhi
 
Cs501 rel algebra
Kamal Singh Lodhi
 
Cs501 intro
Kamal Singh Lodhi
 
Cs501 fd nf
Kamal Singh Lodhi
 
Cs501 dm intro
Kamal Singh Lodhi
 
Cs501 data preprocessingdw
Kamal Singh Lodhi
 
Cs501 concurrency
Kamal Singh Lodhi
 
Cs501 cluster analysis
Kamal Singh Lodhi
 
Cs501 classification prediction
Kamal Singh Lodhi
 
Attribute Classification
Kamal Singh Lodhi
 
Real Time ImageVideo Processing with Applications in Face Recognition
Kamal Singh Lodhi
 
Flow diagram
Kamal Singh Lodhi
 
Ad

Recently uploaded (20)

PPTX
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PDF
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PPTX
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
Geographical Diversity of India 100 Mcq.pdf/ 7th class new ncert /Social/Samy...
Sandeep Swamy
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PDF
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
PDF
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
PPTX
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
PPTX
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
PPTX
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
PPTX
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 
GRADE-3-PPT-EVE-2025-ENG-Q1-LESSON-1.pptx
EveOdrapngimapNarido
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
Chapter-V-DED-Entrepreneurship: Institutions Facilitating Entrepreneurship
Dayanand Huded
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Stereochemistry-Optical Isomerism in organic compoundsptx
Tarannum Nadaf-Mansuri
 
Dimensions of Societal Planning in Commonism
StefanMz
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
Geographical Diversity of India 100 Mcq.pdf/ 7th class new ncert /Social/Samy...
Sandeep Swamy
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
The Constitution Review Committee (CRC) has released an updated schedule for ...
nservice241
 
0725.WHITEPAPER-UNIQUEWAYSOFPROTOTYPINGANDUXNOW.pdf
Thomas GIRARD, MA, CDP
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
Stokey: A Jewish Village by Rachel Kolsky
History of Stoke Newington
 
How to Handle Salesperson Commision in Odoo 18 Sales
Celine George
 
How to Create a PDF Report in Odoo 18 - Odoo Slides
Celine George
 
care of patient with elimination needs.pptx
Rekhanjali Gupta
 
I AM MALALA The Girl Who Stood Up for Education and was Shot by the Taliban...
Beena E S
 

Cs501 mining frequentpatterns

  • 1. CS501: DATABASE AND DATA MINING Mining Frequent Patterns1
  • 2. WHAT IS FREQUENT PATTERN ANALYSIS?  Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining  Motivation: Finding inherent regularities in data  What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents?  Applications  Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 2
  • 3. WHY IS FREQ. PATTERN MINING IMPORTANT?  Discloses an intrinsic and important property of data sets  Forms the foundation for many essential data mining tasks  Association, correlation, and causality analysis  Sequential, structural (e.g., sub-graph) patterns  Pattern analysis in spatiotemporal, multimedia, time- series, and stream data  Classification: associative classification  Cluster analysis: frequent pattern-based clustering  Data warehousing: iceberg cube and cube-gradient  Semantic data compression: fascicles  Broad applications 3
  • 4. BASIC CONCEPTS: FREQUENT PATTERNS AND ASSOCIATION RULES  Itemset X = {x1, …, xk}  Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probability that a transaction having X also contains Y 4 Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
  • 5. CLOSED PATTERNS AND MAX- PATTERNS  A long pattern contains a combinatorial number of sub- patterns, e.g., {a1, …, a100} contains (1 100 ) + (2 100 ) + … + (1 1 0 0 0 0 ) = 2100 – 1 = 1.27*1030 sub-patterns!  Solution: Mine closed patterns and max-patterns instead  An itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X  An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X  Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules 5
  • 6. SCALABLE METHODS FOR MINING FREQUENT PATTERNS  The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Scalable mining methods: Three major approaches Apriori Freq. pattern growth Vertical data format approach 6
  • 7. APRIORI: A CANDIDATE GENERATION-AND-TEST APPROACH  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!  Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 7
  • 8. THE APRIORI ALGORITHM—AN EXAMPLE 8 Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L33rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 9. IMPORTANT DETAILS OF APRIORI  How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  How to count supports of candidates?  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3  abcd from abc and abd  acde from acd and ace  Pruning:  acde is removed because ade is not in L3  C4={abcd} 9
  • 10. HOW TO COUNT SUPPORTS OF CANDIDATES?  Why counting supports of candidates a problem? The total number of candidates can be very huge  One transaction may contain many candidates  Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 10
  • 11. BOTTLENECK OF FREQUENT- PATTERN MINING  Multiple database scans are costly  Mining long patterns needs many passes of scanning and generates lots of candidates  To find frequent itemset i1i2…i100  # of scans: 100  # of Candidates: (1 100 ) + (2 100 ) + … + (1 1 0 0 0 0 ) = 2100 -1 = 1.27*1030 !  Bottleneck: candidate-generation-and-test  Can we avoid candidate generation? 11
  • 12. FREQUENT ITEMSET GENERATION  Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent  Generation of candidate itemsets is expensive(in both space and time)  Support counting is expensive  Subset checking (computationally expensive)  Multiple Database scans (I/O)  FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach:  Step 1: Build a compact data structure called the FP-tree  Built using 2 passes over the data-set.  Step 2: Extracts frequent itemsets directly from the FP- tree 12
  • 13. STEP 1: FP-TREE CONSTRUCTION  FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support. Use this order when building the FP-Tree, so common prefixes can be shared. 13
  • 14. STEP 1: FP-TREE CONSTRUCTION Pass 2: Nodes correspond to items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ).  In this case, counters are incremented 1. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines)  The more paths that overlap, the higher the compression. FP-tree may fit in memory. 1. Frequent itemsets extracted from the FP-Tree. 14
  • 15. STEP 1: FP-TREE CONSTRUCTION (EXAMPLE) 15
  • 16. FP-TREE SIZE  The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes).  Best case scenario: all transactions contain the same set of items.  1 path in the FP-tree  Worst case scenario: every transaction has a unique set of items (no items in common)  Size of the FP-tree is at least as large as the original data.  Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters.  The size of the FP-tree depends on how the items are ordered  Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). 16
  • 17. STEP 2: FREQUENT ITEMSET GENERATION  FP-Growth extracts frequent itemsets from the FP-tree.  Bottom-up algorithm - from the leaves towards the root  Divide and conquer: first look for frequent itemsets ending in e, then de, etc. . . then d, then cd, etc. . .  First, extract prefix path sub-trees ending in an item(set). (hint: use the linked lists) 17
  • 19. STEP 2: FREQUENT ITEMSET GENERATION  Each prefix path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then merged.  E.g. the prefix path sub-tree for e will be used to extract frequent itemsets ending in e, then in de, ce, be and ae, then in cde, bde, cde, etc.  Divide and conquer approach 19
  • 20. CONDITIONAL FP-TREE  The FP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions).  Example: FP-Tree conditional on e. 20
  • 21. EXAMPLE Let minSup = 2 and extract all frequent itemsets containing e.  1. Obtain the prefix path sub-tree for e: 21
  • 22. EXAMPLE  2. Check if e is a frequent item by adding the counts along the linked list (dotted line). If so, extract it.  Yes, count =3 so {e} is extracted as a frequent itemset.  3. As e is frequent, find frequent itemsets ending in e. i.e. de, ce, be and ae. 22
  • 23. EXAMPLE  4. Use the the conditional FP-tree for e to find frequent itemsets ending in de, ce and ae  Note that be is not considered as b is not in the conditional FP-tree for e.  For each of them (e.g. de), find the prefix paths from the conditional tree for e, extract frequent itemsets, generate conditional FP-tree, etc... (recursive) 23
  • 24. EXAMPLE  Example: e -> de -> ade ({d,e}, {a,d,e} are found to be frequent) •Example: e -> ce ({c,e} is found to be frequent) 24
  • 25. RESULT Frequent itemsets found (ordered by sufix and order in which they are found): 25
  • 26. DISCUSION  Advantages of FP-Growth  only 2 passes over data-set  “compresses” data-set  no candidate generation  much faster than Apriori  Disadvantages of FP-Growth  FP-Tree may not fit in memory!!  FP-Tree is expensive to build 26
  • 27. ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION  ECLAT: for each item, store a list of transaction ids (tids); vertical data layout TID-list 27
  • 28. ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION  Determine support of any k-itemset by intersecting tid- lists of two of its (k-1) subsets.  Advantage: very fast support counting  Disadvantage: intermediate tid-lists may become too large for memory A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 ∧ → AB 1 5 7 8 28
  • 29. INTERESTINGNESS MEASURE: CORRELATIONS (LIFT)  play basketball ⇒ eat cereal [40%, 66.7%] is misleading  The overall % of students eating cereal is 75% > 66.7%.  play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence 29 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000
  • 30. LIFT  Measure of dependent/correlated events: lift  Lift = 1 , A & B are independent  Lift > 1, A & B are positively correlated  Lift<1, A & B are negatively correlated 30 )()( )( BPAP BAP lift ∪ = 89.0 5000/3750*5000/3000 5000/2000 ),( ==CBlift 33.1 5000/1250*5000/3000 5000/1000 ),( ==¬CBlift