Feequent Item Mining - Data Mining - Pattern Mining

What is data mining?
• Pattern Mining
• What patterns
• Why are they useful

3
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Frequent Itemsets Mining
TID Transactions
100 { A, B, E }
200 { B, D }
300 { A, B, E }
400 { A, C }
500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }
• Minimum support level
50%
– {A},{B},{C},{A,B}, {A,C}
• How to link this to Data
Cube?

Three Different Views of FIM
• Transactional Database
– How we do store a transactional
database?
• Horizontal, Vertical, Transaction-Item
Pair
• Binary Matrix
• Bipartite Graph
• How does the FIM formulated in
these different settings?
5
TID Items
1 Bread, Milk

6
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there
are 2d possible
candidate itemsets

7
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
1 Bread, Milk
N
Transactions List of
Candidates
M
w

8
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
• Apriori principle holds due to the following property
of the support measure:
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
)
(
)
(
)
(
:
, Y
s
X
s
Y
X
Y
X 




9
Illustrating Apriori Principle
Found to be
Infrequent
null
A B C D E
ABCDE
null
A B C D E
ABCDE
Pruned
supersets

10
Illustrating Apriori Principle
Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
Itemset Count
{Bread,Milk} 3
{Bread,Beer} 2
{Bread,Diaper} 3
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Itemset Count
{Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6 + 6 + 1 = 13

Apriori
R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499, 1994

Feequent Item Mining - Data Mining - Pattern Mining

13
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

14
Challenges of Frequent Itemset Mining
• Challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
• Improving Apriori: general ideas
– Reduce passes of transaction database scans
– Shrink number of candidates
– Facilitate support counting of candidates

15
Alternative Methods for Frequent Itemset
Generation
• Representation of Database
– horizontal vs vertical data layout
TID Items
1 A,B,E
2 B,C,D
3 C,E
4 A,C,D
5 A,B,C,D
6 A,E
7 A,B
8 A,B,C
9 A,C,D
10 B
Horizontal
Data Layout
A B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
8 10
9
Vertical Data Layout

16
ECLAT
• For each item, store a list of transaction ids
(tids)
TID Items
1 A,B,E
2 B,C,D
3 C,E
4 A,C,D
5 A,B,C,D
6 A,E
7 A,B
8 A,B,C
9 A,C,D
10 B
Horizontal
Data Layout
A B C D E
1 1 2 2 1
4 2 3 4 3
5 5 4 5 6
6 7 8 9
7 8 9
8 10
9
Vertical Data Layout
TID-list

17
ECLAT
• Determine support of any k-itemset by intersecting tid-lists of
two of its (k-1) subsets.
• 3 traversal approaches:
– top-down, bottom-up and hybrid
• Advantage: very fast support counting
• Disadvantage: intermediate tid-lists may become too large for
memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
 
AB
1
5
7
8

20
FP-growth Algorithm
• Use a compressed representation of the
database using an FP-tree
• Once an FP-tree has been constructed, it uses
a recursive divide-and-conquer approach to
mine the frequent itemsets

21
FP-tree construction
TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
null
A:1
B:1
null
A:1
B:1
B:1
C:1
D:1
After reading TID=1:
After reading TID=2:

22
FP-Tree Construction
null
A:7
B:5
B:3
C:3
D:1
C:1
D:1
C:3
D:1
D:1
E:1
E:1
TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
Pointers are used to assist
frequent itemset generation
D:1
E:1
Transaction
Database
Item Pointer
A
B
C
D
E
Header table

23
FP-growth
null
A:7
B:5
B:1
C:1
D:1
C:1
D:1
C:3
D:1
D:1
Conditional Pattern base
for D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FP-
growth on P
Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD
D:1

25
Compact Representation of Frequent Itemsets
• Some itemsets are redundant because they have identical support
as their supersets
• Number of frequent itemsets
• Need a compact representation
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1










10
1
10
3 k
k

26
Maximal Frequent Itemset
null
A B C D E
ABCD
E
Border
Infrequent
Itemsets
Maximal
Itemsets
An itemset is maximal frequent if none of its immediate supersets
is frequent

27
Closed Itemset
• An itemset is closed if none of its immediate supersets has the
same support as the itemset
TID Items
1 {A,B}
2 {B,C,D}
3 {A,B,C,D}
4 {A,B,D}
5 {A,B,C,D}
Itemset Support
{A} 4
{B} 5
{C} 3
{D} 4
{A,B} 4
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Itemset Support
{A,B,C} 2
{A,B,D} 3
{A,C,D} 2
{B,C,D} 3
{A,B,C,D} 2

28
Maximal vs Closed Itemsets
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
null
A B C D E
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Transaction Ids
Not supported by
any transactions

29
Maximal vs Closed Frequent Itemsets
null
A B C D E
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and
maximal
Closed but
not maximal

30
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

Beyond Itemsets
• Sequence Mining
– Finding frequent subsequences from a collection of sequences
• Graph Mining
– Finding frequent (connected) subgraphs from a collection of
graphs
• Tree Mining
– Finding frequent (embedded) subtrees from a set of
trees/graphs
• Geometric Structure Mining
– Finding frequent substructures from 3-D or 2-D geometric
graphs
• Among others…

Frequent Pattern Mining
B
A
E
A B
C
C
F
B
D
F
F
D
E
A B
A
C
A
E
D
C
F
D
A
B
A
C
E
A
D
A B
D C
A
A B
B
D
D
C
C
A B
D C

Why Frequent Pattern Mining is So
Important?
• Application Domains
– Business, biology, chemistry, WWW, computer/networing security, …
• Summarizing the underlying datasets, providing key insights
• Basic tools for other data mining tasks
– Assocation rule mining
– Classification
– Clustering
– Change Detection
– etc…

Network motifs: recurring patterns that
occur significantly more than in
randomized nets
• Do motifs have specific roles in the network?
• Many possible distinct subgraphs

The 13 three-node connected
subgraphs

199 4-node directed connected subgraphs
And it grows fast for larger subgraphs : 9364 5-node subgraphs,
1,530,843 6-node…

Finding network motifs –
an overview
• Generation of a suitable random ensemble (reference
networks)
• Network motifs detection process:
 Count how many times each subgraph
appears
 Compute statistical significance for each
subgraph – probability of appearing in
random as much as in real network
(P-val or Z-score)

Real = 5 Rand=0.5±0.6
Zscore (#Standard Deviations)=7.5
Ensemble
of networks

39
References
• R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in
large databases. SIGMOD, 207-216, 1993.
• R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. VLDB, 487-499,
1994.
• R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD, 85-93, 1998.

References:
• Christian Borgelt, Efficient Implementations of
Apriori and Eclat, FIMI’03
• Ferenc Bodon, A fast APRIORI implementation,
FIMI’03
• Ferenc Bodon, A Survey on Frequent Itemset
Mining, Technical Report, Budapest University
of Technology and Economic, 2006

Important websites:
• FIMI workshop
– Not only Apriori and FIM
• FP-tree, ECLAT, Closed, Maximal
– http://fimi.cs.helsinki.fi/
• Christian Borgelt’s website
– http://www.borgelt.net/software.html
• Ferenc Bodon’s website
– http://www.cs.bme.hu/~bodon/en/apriori/

Feequent Item Mining - Data Mining - Pattern Mining

More Related Content

Similar to Feequent Item Mining - Data Mining - Pattern Mining (20)

More from Jason J Pulikkottil (20)

Recently uploaded (20)

Feequent Item Mining - Data Mining - Pattern Mining