SlideShare a Scribd company logo
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 64 ISSN (Online): 2203-1731
GeneticMax: An Efficient Approach to
Mining Maximal Frequent Itemsets Based on
Genetic Algorithms
Mir Md. Jahangir Kabir, Shuxiang Xu, Byeong Ho Kang, Zongyuan Zhao
School of Engineering and ICT
University of Tasmania
Launceston, Australia
{mmjkabir, Shuxiang.Xu, Byeong.Kang, Zongyuan.Zhao}@utas.edu.au
Abstract—This paper presents a new approach based on
genetic algorithms (GAs) to generate maximal frequent itemsets
(MFIs) from large datasets. This new algorithm, GeneticMax, is
heuristic which mimics natural selection approaches for finding
MFIs in an efficient way. The search strategy of this algorithm
uses a lexicographic tree that avoids level by level searching
which reduces the time required to mine the MFIs in a linear
way. Our implementation of the search strategy includes bitmap
representation of the nodes in a lexicographic tree and
identifying frequent itemsets (FIs) from superset-subset
relationships of nodes. This new algorithm uses the principles of
GAs to perform global searches. The time complexity is less than
many of the other algorithms since it uses a non-deterministic
approach. We separate the effect of each step of this algorithm by
experimental analysis on real datasets such as Tic-Tac-Toe, Zoo,
and a 10000×8 dataset. Our experimental results showed that this
approach is efficient and scalable for different sizes of itemsets. It
accesses a major dataset to calculate a support value for fewer
number of nodes to find the FIs even when the search space is
very large, dramatically reducing the search time. The proposed
algorithm shows how evolutionary method can be used on real
datasets to find all the MFIs in an efficient way.
Keywords—data mining; genetic algorithm; maximal frequent
itemset; lexicographic tree
I. INTRODUCTION
Mining frequent itemsets is one of the fundamental and
essential issues in various data mining applications such as
consumer market-basket problem, discovery of association
rules, deducing patterns and correlations, network intrusion
detection and other important data mining tasks. The problem
is formulated as follows. Given a set of items and a large
collection of transactions, each transaction is a subset of these
items, find all frequent itemsets. The number of frequent
itemsets is defined by a user-specified percentage value
(support value) of the datasets.
This problem of association rule mining includes two steps:
first, mining frequent itemsets from a large dataset, and second,
generating association rules or correlation among a large set of
data items. Nowadays, huge amounts of data are collected and
stored by industries, who are interested in mining frequent
itemsets from large datasets. The discovery of association rules
among a large number of business transactions helps industries
make business decisions [1–4]. A common real life application
is the market basket analysis, in which retailers seek to
understand the purchase behavior of customers. The data
analysis attempts to find interesting hidden relationships
among products purchased by customers through association
rule mining or frequent pattern mining. For example, by using
frequent pattern mining, a shop manager may discover that
butter, milk and bread are frequently purchased together by
customers. In another example, one association rule may
indicate that when a customer buys coffee he would also buy
milk. Such information can then be used for cross-selling and
up-selling, sales promotions, store design, and discount plans.
Similarly, a video shop manager can use frequent pattern
mining and/or association rule mining to recommend related
videos or games when a customer has hired or bought a
specific video or game, to attract the customer to return. Web
administrators can use frequent pattern or association rule
mining to understand particular collections of web pages that
are viewed together by a group of web users. This sort of
interesting relationship (e.g., correlation, association) among
data items helps managers make relevant decisions [5].
Let D = {t1, t2, t3,…,ti,…,tn} is a dataset, where t1, t2, t3, …,tn
are transactions. There are n transactions in total in the dataset.
Each transaction ti contains a subset of items I = {i1, i2,
i3,…,ik,…,im}, where i1 is item number 1, i2 is item number 2
and so on. Transaction ti is represented as a binary vector. If
ti[k] = 1 then it means that ti bought item ik, otherwise, ti[k] = 0.
Let X be a subset of items in I i.e. X ⊆ I .The set ti(X) ⊆ I is
true for all items in itemset X for transaction ti. The support
value of an item is how many times the item appears in the
transaction datasets as a subset. The support value of an itemset
is denoted by δ(X) = |{t1(X) + t2(X)+ t3(X) + …+ tn-1(X) +
tn(X)}| / |D|
Here ti(X) gives the binary value. If the examined itemset X
appears as a subset in a transaction ti, then ti(X) = 1, otherwise
ti(X) = 0. An itemset with 1 item is called a 1-itemset, and an
itemset with k-items is called a k-itemset. An itemset is called
frequent if its support value is more than or equal to a user-
defined threshold value, which is denoted by min_supp
(minimum support) i.e. δ(X) ≥ min_supp, where δ(X) is a
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 65 ISSN (Online): 2203-1731
support value of an examined itemset. We denote the frequent
itemsets by FI. If an itemset X is frequent, and no superset of X
is frequent, then we can claim that X is a maximal frequent
itemset and we denote the sets of all maximal frequent itemsets
by MFI [6].
To generate the MFIs from a large dataset is time
consuming. In this paper, we present a novel approach to
finding MFIs from large datasets by using a GA. The length of
an FI depends on its relationship among the itemsets. There are
several advantages of a GA based approach that performs
global searches. The time complexity of GA-based approach is
significantly less than that of many other algorithms. Another
advantage is that a GA-based approach is able to generate
frequent itemsets from a large dataset. This work differs from
the existing research in the following aspects:
1. Our GeneticMax uses a lexicographic tree as a search
tree, and it does not need to enumerate frequent
itemsets level by level.
2. This approach uses the principles of GA which
randomly generates chromosomes. If a generated
chromosome is frequent, then all the subsets of this
chromosome are automatically pruned. If a generated
chromosome is infrequent, then all the supersets of this
chromosome are automatically pruned. This technique
dramatically reduces the time for accessing a large
datasets to calculate the support value of unnecessary
chromosomes to find frequent itemsets.
II. RELATED WORK
In data mining research, frequent pattern mining is one of
the challenging and focused areas for over a decade. A large
number of literature and research works have been dedicated to
this research area, and significant progress has been made
because of these efforts. Progress has been made in sequential
pattern mining, correlation and structured pattern mining,
scalable and efficient algorithms are designed for mining
frequent itemsets and so on [6]. The scope of data analysis has
been expanded by the research of frequent pattern mining and
have profound impact on the methodologies and applications of
data mining for further exploration. Though there has been
plenty of progress made in frequent pattern mining, there are
still some challenging issues need to be resolved. These critical
research issues include:
Scalable mining methods, which are focused topics in
frequent pattern mining research, have been extensively
studied. Current mining methods are used to derive sets of
frequent patterns. These sets of frequent patterns are too huge
to use effectively. To reduce these huge sets, researchers have
proposed several methods such as maximal patterns,
representative patterns, closed patterns, condensed patterns and
so on. But it is still undefined for a specific application which
pattern set provides compactness and the quality of
representation. Much investigation is needed to reduce the size
of derived sets of patterns and to increase the quality of
preserved patterns.
Some applications prefer approximate frequent patterns
though current studies show that efficient methods are
available for mining a complete and explicit set of frequent
patterns. In bioinformatics to match with the biological entities,
one could be interested in searching long sequence patterns in
DNA analysis. Much investigation is needed to design efficient
methods to make this mining more competent than the present
tools available in bioinformatics.
Classification is another major task in data mining. In data
mining, classifications using frequent patterns means which
frequent patterns are more adequate over another. In the future,
researcher should design a method in such a way that effective
frequent patterns are mined directly from data [6].
Some applications require in depth understanding of
patterns and interpretation of those patterns. Most of the
researchers have focused on discovering frequent patterns but
have given less attention to analysing and interpreting those
patterns. The semantic analysis of a pattern includes the
meaning of that pattern, the typical transactions that pattern
considers and so on. Finding the reasons behind the frequency
of a specific pattern is termed as contextual analysis of a
frequent pattern. For example, a pattern could be frequent
depending on specific time duration, location, weather and so
on. To improve the interpretability, effectiveness and usability
of a frequent pattern, it is necessary to have a deep
understanding of frequent patterns [7].
It is well known that the Apriori algorithm generates a
candidate set and tests it in a breadth fast manner. It discovers
all the frequent itemsets at level k before moving to its next
level (k+1). It counts the support value of each node at level k
and prunes those nodes if the support values of those nodes do
not satisfy a user-defined support value. It generates candidate
itemsets at each level and scans the datasets so frequently that
it becomes costly, especially when there exists a long pattern
[1].
The Pincer-search algorithm [8] traverses a lattice through a
bi-directional method that follows both top-down and bottom-
up approaches. To find the maximal frequent itemsets, pruning
is applied based on the following rules:
1. All the subsets of frequent itemsets are pruned
2. All the supersets of infrequent itemsets are pruned.
The breadth-first traversal strategy (a level by level search
strategy on the search space) was used in the MaxMiner search
algorithm. To prune the branches of a tree, MaxMiner employs
a look-ahead method. It uses the breadth-first approach to limit
the number of passes over the datasets but with look-ahead
which involves superset pruning, works better for depth-first
search methods [9].
The DepthProject performs depth-first traversal on a
lexicographic tree along with variations of superset pruning. To
order child nodes, it applies dynamic reordering methods. By
trimming infrequent items out of each node’s tail, it reduces the
size of the search space. To eliminate non-maximal frequent
itemsets the DepthProject would require post-pruning methods
[10].
MAFIA, proposed by Burdick, Calimlim, and Gehrke [11],
extends the idea of DepthProject. Similar to the DepthProject,
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 66 ISSN (Online): 2203-1731
MAFIA also uses vertical bitmap representation where the
support value/count of an itemset is based on the bitwise AND
operations among the itemsets.
An example of four items in a data tuple and the datasets
are given in Fig. 1. Bitvectors for the 1-itemset A, B, C, D of
Fig. 1.are 10111, 01001, 11110, and 11011, respectively. To
get the support value/count of the itemset it needs to apply
bitwise AND (&) operation between the bitvectors of the
itemsets. For the above example, the result of the bitwise AND
operation of bitvectors of itemsets A and C is 10111 & 11110
= 10110. The support value or count of an item is the number
of 1’s in the bitvector. Here the support value of the 2-itemset
{A, C} is 3. To bitwise AND another bitvector D with the
previous result of the bitwise AND operation of {A, C}, is
10110 & 11011= 10010. The support value of the 3-itemset
{A, C, D} is 2. The search strategy of MAFIA integrates depth
first method to traverse the tree to find maximal frequent
itemsets along with effective pruning methodology. The look-
ahead pruning methodology, first used by MaxMiner, was also
used by MAFIA. The last checking method of MAFIA is easy
to test. Without counting A ∪ C, it allows us to conclude that
{A, C} is frequent. This technique is defined as Parent
Equivalence Pruning in [11].
Gouda and Zaki proposed a novel approach called
GENMAX to find maximal itemsets in [12]. In their approach,
they used a novel technique called Progressive Focusing. This
technique maintains local maximal frequent itemsets (LMFI)
which are used for making comparison with newly found
frequent itemsets (FI). Non-maximal frequent itemsets are
identified through this step, and it decreases the number of
subset testing. GENMAX uses a vertical representation of a
datasets and stores a transaction identifier set (TIS) for each
itemset instead of a bitvector. The support value of an itemset
is defined by the cardinality of an itemset’s TIS. Researchers of
GENMAX concluded that through experimental results the
algorithm performs better than existing algorithms for different
types of datasets.
Alataş and Akin designed an efficient genetic algorithm as
a search strategy to mine both positive and negative
quantitative association rules in [13]. Association rules are
deduced from frequent patterns. Different from other methods,
their approach is used to mine association rules without
generating frequent itemsets. The proposed genetic algorithm
does not depend on minimum support and confidence value
which is hard to define for a dataset. A new genetic operator
named uniform operator is used in this approach to ensure
genetic diversity.
Another interesting problem in data mining is classification.
Different lengths of the itemsets are classified into different
groups based on the frequencies of the itemsets. Concise
A B C D
1 0 1 1
0 1 1 1
1 0 1 0
1 0 1 1
1 1 0 1
Fig. 1. Vertical bitmap representation.
symbolic rules with higher accuracies are mined using neural
networks. To get the required accuracy, the network is initially
trained. Network pruning algorithm is used to prune redundant
connections. Classification rules are generated through the
result of the analysis of activation values of the hidden layers.
Researchers noticed that the main drawback of using neural
networks in different data mining test problems was the
training time. Though it provides lower classification error rate
than decision trees, it requires long training time [14, 15].
A quick response data mining model based on genetic
algorithm has been designed in [16]. This approach gives more
flexibility to the user to mine frequent itemsets. Long frequent
itemsets are generated because of the higher relationships
among data tuples. If the Apriori algorithm is used to mine
frequent itemsets from these tuples, it could take a huge
amount of time due to frequent access to the datasets and a
large number of candidate itemsets generated. This approach
avoids considering huge candidate itemsets. It only scans the
datasets for those frequent itemsets that users are more
interested. This system uses a GA to mine itemsets and then
shows them to the user. If the users are interested, then it scans
the datasets to get the support values of those itemsets.
To mine quantitative association rules, a GA-based
algorithm named QUANTMINER was proposed in [17, 18].
By optimizing the support and confidence values, this system
dynamically identifies good intervals in association rules.
Researchers applied this algorithm to different datasets and
showed the usefulness of this algorithm as a data mining tool.
R. J. Kuo and C. W. Shih used a new meta-heuristic
technique, the ant colony system [19] to mine large databases
for efficient searching of association rules. Multi-dimensional
constraints are considered in this approach. In addition, this
approach also considered user’s assign constraint. The results
showed that it gave more condensed rules than the Apriori
algorithm. The computational time of this approach is less than
that of the Apriori algorithm. Though this system provides
promising results, it still faces some issues that need to be
resolved. After analysing the results, it was found that lots of
similar rules were generated so a fuzzy approach was applied
to merge those similar rules into one class.
To mine association rules, most researchers focused on
ameliorating computational efficiency. To determine the
threshold values of support and confidence which are the key
factors for the association rule mining task, the researchers
proposed a new approach based on the particle swarm
optimization technique. Suitable fitness values and their
corresponding support and confidence values of the identified
swarms are searched through this approach. Their result
showed that particle swarm optimization algorithm quickly
finds suitable threshold fitness values of itemsets and quality
rules are obtained in this way. Users can mine specific rules
from a large database by setting support or fitness value. Since
this technique free from support constraint, the main problem
with this approach is users have no control over mining
techniques. Apart from this their result only showed two or
three-dimensional rules instead of more dimensional rules that
could be interesting for the policy makers of the industry [20].
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 67 ISSN (Online): 2203-1731
III. THE IDEA OF FAST RESPONSE
If the data tuples contain long itemsets, huge candidate
itemsets will be generated, and that reduces the efficiency of a
solution. A long itemset enumerates a combinatorial number of
shorter, frequent sub-itemsets. For example, a data tuple
contains 50 itemsets, {i1, i2, i3, … ,i50}, enumerates (50
1
)
frequent 1-itemsets (i1, i2, … ,i50), (50
2
) frequent 2-itemsets: (i1,
i2), (i1, i3), … , (i1, i50), (i2, i3), (i2, i4), … ,(i2, i50), and so on.
Lemma 1: If the length of an itemset is n, then it enumerates
2n
-1 frequent sub-itemsets.
This can become too huge for a computer to compute and
store if the length of an itemset is long. For each sub-itemset,
the Apriori algorithm is used to scan the datasets and calculate
the support value of that itemset. However, the Apriori
algorithm increases the computational time of the algorithm
and decreases the efficiency of it. To overcome this low
efficiency of the Apriori algorithm we introduce a new
approach that takes into consideration the superset-subset
relationships. An itemset I is called maximal frequent itemset if
the super itemset of I, denoted by 𝐈́, is not frequent such that
𝐈 ⊆ 𝐈́. Here 𝐈́ is an infrequent itemset based on a user-defined
support value.
Lemma 2: If an itemset I is a frequent itemset, then all the
subsets of I are frequent, based on the user-defined support
value.
For example, if an itemset I = {1,2,3} in set S = {1,2,3,4} is
frequent, i.e., δ(I) ≥ min_supp then all the subsets of I, i.e.,{1},
{2}, {3},{1,2}, {1,3}, {2,3} are frequent itemsets, based on the
support value defined by the user. The Apriori algorithm scans
the datasets for all the subsets of X to get the support value. It
takes huge computational time if the length of an itemset is
long. In our GeneticMax approach, if the generated
chromosome is I = {1,2,3}, and it satisfies the user-defined
support value then it will not test all the subsets of which
dramatically reduces the computational time for scanning the
datasets.
IV. PRELIMINARIES
In this section, we will introduce some notations and
conceptual diagrams that will be used throughout this paper.
Initially, we describe datasets through bitmap and bipartite
graphs. Then we graphically represent subsets of items using
lexicographic trees.
A. Bipartite Graph and Bitmap Representation
If U and V are two disjoint sets of vertices and E is the set
of edges which connect the vertices U and V, then we can
represent a bipartite graph as a triple, i.e. G = (U, V, E) where
E ⊆ U×V.
In a binary matrix, all elements are either 0 or 1. The
mapping between binary matrices and datasets of transactions
is straight forward. Consider a dataset D which consists of m
transactions, {t1, t2, …, tm-1, tm}, corresponding to rows; and n
items, {i1, i2, …, in-1, in}, corresponding to columns. The
datasets D is an m×n matrix, where each entry is defined as aij.
The value of aij is 1 if transaction ti contains item ij; otherwise,
𝑖1 𝑖2 𝑖3 𝑖4
𝑡1 𝑡2 𝑡3 𝑡4 𝑡5
Fig. 2. Bipartite graph representation of the dataset D.
i1 i2 i3 i4
t1 1 1 1 0
t2 1 1 1 1
t3 1 0 1 1
t4 1 0 1 1
t5 1 1 1 1
Fig. 3. Binary matrix representation of the dataset D.
it is 0. Now we are mapping each transaction as a set of items
from the binary matrices. For example, a dataset D which
consists of the following transactions t1, t2, t3, t4, t5 and items i1,
i2, i3, i4 where t1 = {i1, i2, i3}, t2 = {i1, i2, i3, i4}, t3 = {i1, i3, i4}
and t5 = {i1, i2, i3, i4}. Here all the items are different. Fig. 2.
and Fig. 3. show the bipartite graph and the binary matrix of
the dataset D.
From Fig. 2, if we map each transaction by items then the
transactions are as follows:
t1 = {1, 1, 1, 0} = 1110
t2 = {1, 1, 1, 1} = 1111
t3 = {1, 0, 1, 1} = 1011
t4 = {1, 0, 1, 1} = 1011
t5 = {1, 1, 1, 1} = 1111
B. Lexicographic Tree
Our research problem here is to find the maximal frequent
itemsets from large datasets using a GA. Itemset I consists of n
items, i.e. I = {i1, i2, i3, … ,in}. Xk represents an itemset
containing k-items, where k = 1, 2, …, n and Xk ⊆ I. If k = 1,
then Xk contains a 1-itemset, i.e. Xk = {i1}. If k = 2, then Xk
contains a 2-itemset, i.e. Xk = {i3, i4}, and so on. An itemset is
called frequent if its support value satisfies a user-defined
support value, and is denoted by FI. An itemset X is called
maximal frequent itemset if it is frequent, and no superset of X
satisfies any user defined support value, and is denoted by
MFI.
In this paper, we will consider the search space that
includes all feasible solutions. A lexicographic tree [11, 21] is
the search space for GeneticMax. A lexicographic tree
maintains lexicographic ordering of items of I in a datasets D.
If item i occurs before item j in a datasets D, then it maintains
lexicographic ordering, i.e. 𝑖 ≤ 𝐿 𝑗. If two subsets S1 and S2,
where S1 ⊆ S2 and S1, S2 ⊆ S then it maintains the following
lexicographic order: S1 ≤ 𝐿 S2. There is no lexicographic
ordering relationship between the two subsets S1 and S2, If S1
and S2 are disjoint subsets.
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 68 ISSN (Online): 2203-1731
Fig. 4. Lexicographic tree of four items.
Fig. 5. Lexicographic tree of four items based on a user-defined support
value.
Fig. 4 shows an example of a lexicographic tree that
considers lexicographic ordering for four items. The root of the
tree is an empty set and each k-level contains k-items. In each
level, k-itemsets maintain lexicographic ordering with the tail
nodes containing items lexicographically larger than elements
of the head node. The support value of the head node is more
than that of the tail node. It can be seen that the nodes closer to
the root are more frequent than those far from the root. There is
a non-linear line (called a cut) in the tree which separates
frequent itemsets from infrequent ones. The nodes that are
above the cut are frequent itemsets and the elements below this
cut are infrequent ones.
For GeneticMax, we introduce a new tree based on user-
defined support values. The line is defined by a user-defined
support value and the area above the line is referred to as
positive area and below negative area. All the nodes in the
positive area are frequent whereas those in the negative area
are infrequent. In Fig. 5, the nodes within the positive boundary
area have a minimum support value of 30%. GeneticMax
introduces an array to store the frequent itemsets (called FIs).
Among the frequent itemsets, the set containing the largest
number of items is called the maximal frequent itemset. This
maximal frequent itemset (stored in another special array) is
called MFI. This algorithm searches frequent nodes within a
positive area and tries to converge to a solution, i.e., finding
maximal frequent itemsets as early as possible. Fig. 5 verifies
Lemma 1, where there are 4 items, and it enumerates (24
-1) =
15 nodes including the root node. With Apriori algorithm, one
would test all the nodes at a specific level and generate a
candidate set. This candidate set generation needs a long time
for finding maximal frequent itemsets. For example, in Fig. 5 it
tests the itemsets {1}, {2}, {3}, {4} in level 0 and find that all
the itemsets are frequent since these nodes meet the minimum
support value. Then it goes to the next level to scan the datasets
to get the support values of {1, 2}, {1, 3}, {1, 4}, {2, 3},{2, 4},
{3, 4} and so on. On the next level, it prunes the itemsets {1,
4}, {2, 4}, {3, 4} since these nodes have support values that are
less than the user-defined support value. On the other hand,
unlike apriori algorithm, with GeneticMax we do not need to
test all the nodes, which saves a huge amount of time even
when the datasets is very large. For the current example, if the
initially generated itemset is {1, 2, 3} then it scans the datasets
and calculates the support value. If the support value of the
generated itemset {1, 2, 3} is ≥ 30%, then it stores this itemset
in a frequent itemset array called FI_Superset_Member.
In the future it will not scan the datasets for {1}, {2}, {3}, {1,
2}, and {1, 3} since these itemsets are the subsets of the
previously generated itemset {1, 2, 3}. If the generated itemsets
are {1}, {2}, {3} or {1, 2} then it always checks the array
FI_Superset_Member. If it finds any superset in
FI_Superset_Member, GeneticMax will discard theses
subsets, which substantially reduces the time for scanning the
datasets to calculate the support values correspondingly.
Lemma 3: If Y is a superset of an itemset X, i.e., X ⊆ Y and if
Y is a frequent itemset, then it can be claimed that X is a
frequent itemset.
For example, {1, 2, 3} is a superset of itemset {1}, {2},
{3}, {1, 2} and {1, 3}. GeneticMax uses the principles of GA
and follows the global search mechanism; therefore, a superset
could be generated before generating a subset. In this example,
if {1, 2, 3} is generated before its subsets (and stored in the
array FI_Superset_Member), then all other generated
subsets will be discarded.
Lemma 4: If Y is a superset of an itemset X, i.e., X ⊆ Y, and if
X is an infrequent itemset, then it can be claimed that Y is an
infrequent itemset.
For example, if the initially generated chromosome is
{1, 4}, and the support value of this itemset is < 30%, then it is
stored in a non-frequent itemset array called NFI. If the next
generated itemset is {1, 3, 4}, the algorithm will check the NFI
array, and if it finds any subset in this array, GeneticMax will
discard itemset {1, 3, 4} for any future calculations.
Lemma 5: If Z is a superset of an itemset X, Y, i.e., X, Y ⊆ Z
and if Z is an infrequent itemset, then we cannot determine
whether X or Y is an infrequent itemset.
Lemma 5 is slightly different from Lemma 3. With the
previous example, if {1, 2, 3} is a frequent itemset then all of
its subsets must be frequent, i.e., {1},{2},{3},{1,2},{1,3},
{2,3} are all frequent itemsets. But if {1, 2, 3} is an infrequent
itemset then we cannot conclude that all of its subset are
infrequent. In the above Fig. 5, {1, 4} is an infrequent itemset
but its subsets {1} and {4} are frequent itemsets.
Lemma 6: If Z is a subset of itemsets X and Y, i.e. Z ⊆ X, Z ⊆
Y; and if Z is a frequent itemset, then its supersets X and Y
could be either frequent or infrequent itemsets.
For example, itemset {1} in Fig. 5 is frequent. Although its
superset {1, 3} is frequent, its superset {1, 4} is an infrequent
itemset.
The main idea of GeneticMax is to find maximal frequent
itemsets while converging to a solution as fast as possible. It
Level
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 69 ISSN (Online): 2203-1731
sub-divides a whole lexicographic tree into two sub-areas
based on a user-defined support value. GeneticMax can
generate any chromosome in any sub-region. If it finds any
superset in a positive boundary area, then it follows Lemma 3
and prunes all of its subsets. But if it finds any subset in a
negative boundary area, then it follows Lemma 4 and prunes
all of its supersets.
The main advantage of GeneticMax is its ability to
converge quickly to a solution, and find all the supersets in a
positive boundary area closer to the cut as fast as possible. In
the above example, if {1, 2, 3} is generated before all of its
subsets ({1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}) and found to be a
frequent itemset, then it will discard those subsets (which are
also frequent itemsets). If the next generated itemset is
{1, 3, 4} it will check the NFI_Subset_Member array and
does not find any subset there. GeneticMax will scan the
datasets for this itemset to find its support value and store it in
NFI. In the future, all the supersets of {1, 3, 4} will be
discarded.
V. DESCRIPTION OF GENETICMAX
A. Itemsets Mapping to Chromosomes
GeneticMax maps itemsets onto a chromosome code. Each
node in the lexicographic tree represents different itemsets and
all the nodes in the tree get a unique chromosome code. The
main features of chromosome coding:
1. It is easy to calculate the support values since
GeneticMax uses bitmap representation of the datasets.
2. Generate all the possible nodes. If there are n items, it
enumerates (2n
-1) itemsets or nodes in the
lexicographic tree. If it needs, GeneticMax can
generate (2n
-1) nodes in its lifetime.
The length of a chromosome is fixed. If a dataset contains n
items, then the length of all the generated chromosomes are
always n, as shown in Fig. 6.
B. Lifetime of GeneticMax
The lifetime of GeneticMax depends on user’s selection of
a generation. The higher the generation number, the higher the
probability for getting a correct solution. But there is a
threshold value for a generation: after the threshold is reached
the solution remains the same.
VI. GENETICMAX FOR EFFICIENT MFI MINING
There are five main requirements in developing an efficient
maximal frequent itemsets mining algorithm. We need a set of
techniques that fulfill these requirements:
1. It will not scan a dataset more than once for a specific
itemset.
2. If X is an itemset in a positive boundary area, and there
are no supersets of X and it has already been tested,
then all the subsets of X are pruned and defined as
invalid datasets.
3. If X is an itemset in a negative boundary area, and
there are no subsets of X and it has already been tested,
𝑉𝑖𝑡𝑒𝑚1
𝑉𝑖𝑡𝑒𝑚2
… 𝑉𝑖𝑡𝑒𝑚 𝑛
Fig. 6. Mapping items onto chromosomes 𝑉𝑖𝑡𝑒𝑚1…𝑛
∈ [0,1].
Fig. 7. The number of generations and maximal frequent itemsets.
then all the supersets of X are pruned and defined as
invalid datasets.
4. It should maintain an interactive mining process, where
users can change the threshold to get different sets of
MFI.
5. It gives correct solutions for different sizes of datasets.
Apriori algorithm and FP-Tree do not satisfy requirements
1, 2, 3 and 4 respectively. GeneticMax fulfills all the above
requirements.
Genetic Algorithms, developed by Holland in 1975, are
random search algorithms that generate populations iteratively
[22]. This algorithm has been used to find approximate
solutions for further optimization. A GA is a search heuristic,
considers adaptive methods that are used to solve search as
well as optimization problems. This algorithm is inspired by
natural selection and the “survival of the fittest” mechanisms
which were clearly stated by Charles Darwin in his book “The
Origin of Species”. The main concept of “survival of the
fittest” mechanism is based on the fitness value; only the
stronger individuals will survive in a competitive environment.
A GA simulates the processes in natural population which are
essential for evolution. A large number of researchers worked
on GAs [23–26]. Naturally, individuals are competing for their
food, shelter, water, clothes and so on. Even members of the
same class often compete to attract partners. Those individuals
are referred to as strong if they are successful in surviving and
attracting a partner. A large number of offspring will be
produced by strong individuals. On the other hand poorly
performing individuals are referred to as weak and have less
probability to produce newer offspring. “Superfit” offspring
can be produced by the combination of good attributes from
different parents. That is the fitness of this offspring is higher
than the fitness of the parents. In this fashion, species
becoming more and more well suited in the present
environment.
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 70 ISSN (Online): 2203-1731
GA plays a vital role in this study. Evolutionary algorithm
based techniques are robust and can be used to solve a wide
range of problems including those which can be difficult to
solve by other methods. It is well known that GA cannot
guarantee optimum solutions to any problem but rather it can
find “acceptably good” solutions to a problem “quickly”.
Existing methods that are working well as a solution for a
particular problem, improvement of those methods can be done
by hybridizing with GA.
A traditional GA generates an initial population and then
computes the fitness value of that population. Two individuals
are selected from the old generation and crossover and
mutation operators are applied to produce two offsprings.
Survivors who have the best fitness values are inserted in the
new generation. If the population is converged to a solution,
then the algorithm is terminated. In this algorithm, fitness
function provides the fitness value of an offspring which is a
specification of the offspring.
The main aim of using a GA for this problem is to reach
good results by discarding bad solutions during generation of
populations [27]. The basic steps of GeneticMax algorithm are
as follows:
A. Procedures of GeneticMax
There are eight steps in the GeneticMax algorithm. They
are listed as follows:
1. Set number of generations.
2. Generate a population.
3. Check the FI_Superset_Member and
NFI_Subset_Member array for superset and subset
checking of this generated chromosome.
4. If it finds any superset in FI_Superset_Member,
or subset in NFI_Subset_Member, then go to Step
2.
5. Compute a fitness value of individuals according to
their support values in dataset D.
6. Perform FI_Member_Add, and if any frequent item
sets are found then update FI_Superset_Member.
7. Perform NFI_Member_Add, and if any infrequent
itemsets are found then update
NFI_Subset_Member.
8. Go to Step 3 with newly generated chromosome until it
exceeds the generation number set in Step 1.
B. Mining the Superset in a Positive Boundary Area
For an itemset X, if there is any subset of X in
FI_Superset_Member, then this method (Fig. 8) is called to
replace that subset by its superset X. This method is also
applicable if X is a new frequent item with no subset in
FI_Superset_Member.
C. Mining the Subset in Negative Boundary Area
For an itemset X, if there is any superset of X in
NFI_Subset_Member, then this method (Fig. 9) is called to
replace that superset by its subset X. This method is also
applicable if X is a new infrequent item, and it has no superset
in NFI_Subset_Member.
D. GeneticMax Pruning Methods
The Check_Member_for_Item function (Fig. 10)
incorporates three techniques:
1) Superset Checking Techniques
Checking to see whether a given chromosome is a superset
in a positive boundary area. Further pruning happens if a given
itemset is not a superset in the positive boundary area.
2) Subset Checking Techniques
Checking to see whether a given chromosome is a subset in
a negative boundary area. Further pruning happens if a given
itemset is not a subset in the negative boundary area.
3) Unchecked itemset checking techniques
If an itemset is neither a superset in a positive boundary
area and nor a subset in a negative boundary area, then this
itemset is referred to as an “unchecked” itemset and needs to be
tested. For this unchecked itemset, GeneticMax scans the
datasets and sets the itemset in FI_Superset_Member or
NFI_Subset_Member according to the user-defined support
value.
//Invocation: FI_Member_Add(IF I_Superset_Member)
1. If any subset of IF is in I_Superset_Member
2. Delete the Subset of IF
3. Add IF in FI_Superset_Member
4. Else add IF in FI_Superset_Member
Fig. 8. The FI_Member_Add function.
//Invocation: NFI_Member_Add(I1F NFI_Subset_Member)
1. If any superset of I1F is in NFI_Subset_Member
2. Delete the Superset of I1F
3. Add I1F in NFI_Subset_Member
4. Else add I1F in NFI_Subset_Member
Fig. 9. The NFI_Member_Add function.
//Invocation: Check_Member_for_Item (I,
FI_Superset_Member NFI_Subset_Member)
1. If any superset of I is in FI_Superset_Member
2. Discard I
3. Else if any subset of I is in
NFI_Subset_Member
4. Discard I
5. Else scan the database to calculate support
value for I
6. If support value ≥ user-defined support
value
7. Invoke FI_Member_Add
8. Else invoke NFI_Member_Add
Fig. 10. The Check_Member_for_Item function.
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 71 ISSN (Online): 2203-1731
VII. EXPERIMENTAL RESULTS
The experiments were performed on an Intel(R) Core i5-
3210M CPU @2.50GHz, with 4 GB of RAM running on
Windows 7 Enterprise. Microsoft Visual Studio 2012 was used
to compile the code of GeneticMax. Three datasets including
Tic Tac Toe, 10000×8, and Zoo were used to test GeneticMax.
Different support values were applied to these datasets to check
how many nodes have been tested and the numbers of
chromosomes have been generated to get the exact number of
maximal frequent itemsets, run times, and so on. Here run time
is the total execution time. GeneticMax embeds two main
features: i) superset-subset relationship in both positive and
negative boundaries in a lexicographic tree for pruning invalid
chromosomes, and ii) use of GA which uses a global search
mechanism. The purpose of this new approach is to achieve
convergence to a solution as fast as possible. A full experiment
of GeneticMax on these datasets was conducted, demonstrating
GenticMax’s ability to yield solutions rapidly by accessing the
datasets for a few number of nodes in a lexicographic tree.
As we see from the previous discussions, the Apriori
algorithm tests all of the nodes in each level and prunes those
nodes that do not satisfy a minimum support value. In
GeneticMax, if it generates a chromosome X in any level that
satisfies a minimum support value, then all the other subsets of
X in any level will be automatically pruned. This approach
dramatically reduces the time for accessing a large dataset.
This is also true the other way around. If GeneticMax generates
a chromosome Y in any level that does not satisfy a minimum
support value, then all the other supersets of Y in any level will
be automatically pruned.
We tested the algorithm on different datasets such as Tic-
Tac-Toe, Zoo, 10000×8 and so on. These datasets were taken
from the University of California, Irvine (UCI) machine
learning repository (http://archive.ics.uci.edu/ml/datasets.html).
From the experimental results as shown in Table 1, we can see
that if the number of generations is increased, then it increases
the frequent itemsets. For example, for the 10000×8 datasets,
generation 100 produced 9 frequent itemsets whereas
generation 140 produced 8 frequent itemsets. In other words,
generation 100 resulted in more than 9 frequent itemsets. On
the other hand, generation 140 resulted in more than 8 frequent
itemsets. If we compare these two generations, we could
conclude that generation 100 still did not find some frequent
itemsets. When we increased the number of generation to 140,
it found some itemsets missed by generation 100. Generation
150 gave the same result as generation 140. So users can use
generation 140 as a threshold value for the 10000×8 datasets.
This is also true for Tic-Tac-Toe, generation 1200 and 1300
gave the same result that contains the maximal frequent
itemsets. So for Tic-Tac-Toe, generation 1200 can be used as a
threshold value.
Table 2 shows a comparison between the number of nodes
in a lexicographic tree and the number of nodes tested for
getting maximal frequent itemsets. For 10000×8, there are 255
itemsets and GeneticMax accessed only 39 itemsets in the main
datasets to get the maximal frequent itemsets. Since
GeneticMax uses the principles of genetic algorithm and
prunes invalid chromosomes based on superset-subset
relationship, it dramatically reduces the number of itemsets out
of a dataset for getting the support value to mine maximal
frequent itemsets. The advantage of using those principles in
GeneticMax is showed in Table 2, where (255-39) = 216 nodes
were not examined to get the support value from datasets
10000×8 to get the exact number of maximal frequent itemsets.
Only 39 were examined to get the final solution. For
TicTacToe, only 114 nodes were examined to get the final
solution (the other 397 nodes were not required).
As we can see from Fig. 11, the runtime of GeneticMax
increases with respect to the generation number of
chromosomes. A lower support value that generates more
frequent itemsets needs higher runtime whereas a higher
support value that generate fewer frequent itemsets needs less
computational time, as shown in Fig. 12.
VIII. CONCLUSION AND SUMMARY
In this paper, we proposed a new GA-based approach
GeneticMax to mine maximal frequent itemsets in an efficient
way. We have conducted thorough experiments on different
real datasets. The experimental results demonstrated several
advantages of our algorithm.
 It accesses a large datasets for a fewer number of nodes
to calculate the support value to find maximal frequent
itemsets.
 It shows the power of using an evolutionary algorithm
for generating frequent itemsets from a lexicographic
tree. The whole dataset is projected onto a
lexicographic tree based on a user-defined support
value.
Fig. 11. Run time versus generation for Tic-Tac-Toe.
Fig. 12. Run time of GeneticMax using different support values.
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 72 ISSN (Online): 2203-1731
TABLE I. THE EXPERIMENTAL RESULTS OF GENETICMAX FOR TWO DIFFERENT DATASETS
Database Records Items Support (%) Generation
Frequent
Itemsets
Time (s) Remarks
10000×8 10000 8 20
100 9 10.22
140 8 21.67
This generation
contains MFI
150 8 25.10
TicTacToe 958 9 16
100 6 10.13
250 7 17.53
500 10 43.83
1100 23 78.20
1200 24 95.60 Both Generations
provide the same
result
1300 24 115.66
TABLE II. RESULTS SHOWING THE NUMBER OF TIMES THE DATABASE ACCESSED BY GENETICMAX
 The experimental analysis of GeneticMax shows the
effect of generations of chromosomes and pruning all
the subsets and supersets in both positive and negative
boundary areas, which dramatically reduces search
space and cost of counting support value of itemsets.
 The above advantages of GeneticMax increase the
scalability of this algorithm.
We have implemented the GeneticMax algorithm and
studied its performance. The performance study showed that
this algorithm mines different sizes of patterns in real datasets
in an efficient way and performs better than other candidate
pattern generation and evolutionary based algorithms.
REFERENCES
[1] R. Agrawal and R. Srikant, “Fast algorithms for mining association
rules,” in 20th International Conference on Very Large Data Bases,
1994, pp. 487–499.
[2] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules
between sets of items in large databases,” Proc. 1993 ACM
SIGMOD Int. Conf. Manag. Data - SIGMOD ’93, May 1993, pp.
207–216.
[3] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “A tree
projection algorithm for generation of frequent itemsets,” Parallel
Distrib. Comput. Spec. Issue High Perform. Data Min., vol. 61, no.
3, pp. 350–371, 2001.
[4] J. J. Cameron and C. K. Leung, “Mining frequent patterns from
precise and uncertain data,” Comput. Syst., vol. 1, no. 1, pp. 3–22,
2011.
[5] M. M. J. Kabir, S. Xu, B. H. O. Kang, and Z. Zhao, “Association
rule mining for both frequent and infrequent items using particle
swarm optimization algorithm,” Int. J. Comput. Sci. Eng., vol. 6, no.
7, pp. 221–231, 2014.
[6] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining:
current status and future directions,” Data Min. Knowl. Discov., vol.
15, no. 1, pp. 55–86, Jan. 2007.
[7] Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai, “Generating
semantic annotations for frequent patterns with context analysis,”
Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. -
KDD ’06, 2006, pp. 337–346.
[8] D.-I. Lin and Z. M. Kedem, “Pincer-Search: a new algorithm for
discovering the maximal frequent set,” in Proc. 6th International
Conference on Extending Database Technology, 1998, pp. 103–119.
[9] R. J. Bayardo, “Efficiently mining long patterns from databases,”
ACM SIGMOD, pp. 85–93, 1998.
[10] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “Depth first
generation of long patterns,” Proc. Sixth ACM SIGKDD Int. Conf.
Knowl. Discov. Data Min. - KDD ’00, 2000, pp. 108–118.
[11] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a maximal
frequent itemset algorithm for transactional databases,” Proc. 17th
Int. Conf. Data Eng., 2001, pp. 443–452.
[12] K. Gouda and M. J. Zaki, “GenMax : an efficient algorithm for
mining,” Data Min. Knowl. Discov., vol. 11, no. 3, pp. 223–242,
2005.
[13] B. Alataş and E. Akin, “An efficient genetic algorithm for
automated mining of both positive and negative quantitative
association rules,” Soft Comput., vol. 10, no. 3, pp. 230–237, Apr.
2005.
[14] H. Lu, R. Setiono, and H. Liu, “Effective data mining using neural
networks,” IEEE Trans. Knowl. Data Eng., vol. 8, no. 6, pp. 957–
961.
[15] S. Ghosh, A. Nag, D. Biswas, J. P. Singh, S. Biswas, D. Sarkar, and
P. P. Sarkar, “Weather data mining using artificial neural network,”
2011 IEEE Recent Adv. Intell. Comput. Syst., pp. 192–195, Sep.
2011.
[16] W. Dou, J. Hu, K. Hirasawa, and G. Wu, “Quick response data
mining model using genetic algorithm,” 2008 SICE Annu. Conf.,
Aug. 2008, pp. 1214–1219.
[17] A. Salleb-Aouissi, C. Vrain, C. Nortet, X. Kong, and D. Cassard,
“QuantMiner for mining quantitative association rules,” Mach.
Learn. Res., vol. 14, no. 1, pp. 3153–3157, 2013.
[18] A. Salleb-Aouissi, C. Vrain, and C. Nortet, “QuantMiner : a Genetic
Algorithm for mining quantitative association rules,” Proc. 20th
International Joint Conference on Artificial Intelligence, 2007, pp.
1035–1040.
[19] R. J. Kuo and C. W. Shih, “Association rule mining through the ant
colony system for national health insurance research database in
Taiwan,” Comput. Math. with Appl., vol. 54, no. 11–12, pp. 1303–
1318, Dec. 2007.
Database Items
Support
(%)
No. of nodes in the
Lexicographic Tree
(2𝑖𝑡𝑒𝑚𝑠
− 1)
No. of nodes tested for
getting MFI
10000×8 8 20 255 39
TicTacToe 9 16 511 114
Zoo 17 50 131072 361
IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015
Copyright ISSN (Print): 2204-0595
© Kabir, Xu, Kang, and Zhao, 2015 73 ISSN (Online): 2203-1731
[20] R. J. Kuo, C. M. Chao, and Y. T. Chiu, “Application of particle
swarm optimization to association rule mining,” Appl. Soft Comput.,
vol. 11, no. 1, pp. 326–336, 2011.
[21] J. P. Huang, C. T. Yang, and C. H. Fu, “A Genetic Algorithm based
searching of maximal frequent itemsets,” in International
Conference on Artificial Intelligence, 2004, pp. 548–554.
[22] J. H. Holland, Adaptation in Natural and Artificial Systems. Ann
Arbor: University of Michigan Press, 1975.
[23] D. Beasley, D. R. Bull, and R. R. Martin, “An overview of Genetic
Algorithms : Part 1 , Fundamentals,” Univ. Comput., vol. 15, no. 2,
pp. 58–69, 1993.
[24] M. Srinivas and L. M. Patnaik, “Genetic Algorithms: a survey,”
Computer (Long. Beach. Calif)., vol. 27, no. 6, pp. 17–26, 1994.
[25] D. E. Goldberg, Genetic Algorithms in Search, Optimization and
Machine Learning. Addison-Wesley Longman Publishing Co., Inc.
Boston, MA, USA, 1989.
[26] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution
Programs. Berlin: Springer, 1992.
[27] A. A. Freitas, “A survey of evolutionary algorithms for data mining
and knowledge discovery,” in Advances in Evolutionary Computing,
Springer Berlin-Heidelberg, 2003, pp. 819–845.

More Related Content

What's hot (18)

PPTX
Data Mining: Mining ,associations, and correlations
DataminingTools Inc
 
PDF
Dy33753757
IJERA Editor
 
PDF
Introduction to feature subset selection method
IJSRD
 
PDF
Ap26261267
IJERA Editor
 
PDF
An Efficient and Scalable UP-Growth Algorithm with Optimized Threshold (min_u...
IRJET Journal
 
PDF
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET Journal
 
PDF
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
IOSR Journals
 
PDF
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Association of Scientists, Developers and Faculties
 
PDF
The International Journal of Engineering and Science
theijes
 
PDF
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
ijsrd.com
 
PDF
Frequent Item Set Mining - A Review
ijsrd.com
 
PDF
K355662
IJERA Editor
 
PDF
An improvised frequent pattern tree
IJDKP
 
PDF
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
IOSR Journals
 
PDF
Ec3212561262
IJMER
 
PDF
Application of data mining tools for
IJDKP
 
PDF
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
ijsrd.com
 
PDF
Association Rule Hiding using Hash Tree
ijtsrd
 
Data Mining: Mining ,associations, and correlations
DataminingTools Inc
 
Dy33753757
IJERA Editor
 
Introduction to feature subset selection method
IJSRD
 
Ap26261267
IJERA Editor
 
An Efficient and Scalable UP-Growth Algorithm with Optimized Threshold (min_u...
IRJET Journal
 
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET Journal
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
IOSR Journals
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Association of Scientists, Developers and Faculties
 
The International Journal of Engineering and Science
theijes
 
An Efficient Compressed Data Structure Based Method for Frequent Item Set Mining
ijsrd.com
 
Frequent Item Set Mining - A Review
ijsrd.com
 
K355662
IJERA Editor
 
An improvised frequent pattern tree
IJDKP
 
An Effective Heuristic Approach for Hiding Sensitive Patterns in Databases
IOSR Journals
 
Ec3212561262
IJMER
 
Application of data mining tools for
IJDKP
 
Multiple Minimum Support Implementations with Dynamic Matrix Apriori Algorith...
ijsrd.com
 
Association Rule Hiding using Hash Tree
ijtsrd
 

Similar to GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based on Genetic Algorithms (20)

PDF
Mining Frequent Item set Using Genetic Algorithm
ijsrd.com
 
PDF
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
ijsrd.com
 
PDF
J017114852
IOSR Journals
 
PPTX
Data mining techniques unit III
malathieswaran29
 
PDF
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
IOSR Journals
 
PPT
Mining Frequent Itemsets.ppt
NBACriteria2SICET
 
PPT
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
RaviKiranVarma4
 
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
PPTX
Chapter 01 Introduction DM.pptx
ssuser957b41
 
PPT
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
PDF
Literature Survey of modern frequent item set mining methods
ijsrd.com
 
PDF
B0950814
IOSR Journals
 
PPT
06FPBasic.ppt
KomalBanik
 
PPT
06FPBasic.ppt
KomalBanik
 
PDF
06FPBasic02.pdf
Alireza418370
 
PDF
Dm unit ii r16
Kishore Kumar
 
PDF
Discovering Frequent Patterns with New Mining Procedure
IOSR Journals
 
PDF
06 fp basic
JoonyoungJayGwak
 
PPT
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Subrata Kumer Paul
 
PDF
A Survey on Frequent Patterns To Optimize Association Rules
IRJET Journal
 
Mining Frequent Item set Using Genetic Algorithm
ijsrd.com
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
ijsrd.com
 
J017114852
IOSR Journals
 
Data mining techniques unit III
malathieswaran29
 
Usage and Research Challenges in the Area of Frequent Pattern in Data Mining
IOSR Journals
 
Mining Frequent Itemsets.ppt
NBACriteria2SICET
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
RaviKiranVarma4
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Chapter 01 Introduction DM.pptx
ssuser957b41
 
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Literature Survey of modern frequent item set mining methods
ijsrd.com
 
B0950814
IOSR Journals
 
06FPBasic.ppt
KomalBanik
 
06FPBasic.ppt
KomalBanik
 
06FPBasic02.pdf
Alireza418370
 
Dm unit ii r16
Kishore Kumar
 
Discovering Frequent Patterns with New Mining Procedure
IOSR Journals
 
06 fp basic
JoonyoungJayGwak
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Subrata Kumer Paul
 
A Survey on Frequent Patterns To Optimize Association Rules
IRJET Journal
 
Ad

More from ITIIIndustries (20)

PDF
13th International Conference of Advanced Computer Science & Information Tech...
ITIIIndustries
 
PDF
12th International Conference on Bioinformatics and Bioscience (ICBB 2025)
ITIIIndustries
 
PDF
13th International Conference on Natural Language Processing (NLP 2024)
ITIIIndustries
 
PDF
11th International Conference on Computer Networks & Data Communications (CND...
ITIIIndustries
 
PDF
10th International Conference on Software Engineering and Applications (SOFEA...
ITIIIndustries
 
PDF
10th International Conference on Fuzzy Logic Systems (Fuzzy 2024)
ITIIIndustries
 
PDF
10th International Conference on Natural Language Computing (NATL 2024)
ITIIIndustries
 
PDF
10th International Conference on Fuzzy Logic Systems (Fuzzy 2024)
ITIIIndustries
 
PDF
2nd International Conference on Computer Science and Information Technology A...
ITIIIndustries
 
PDF
10th International Conference on Fuzzy Logic Systems (Fuzzy 2024)
ITIIIndustries
 
PDF
Call For Papers -10th International Conference on Natural Language Computing ...
ITIIIndustries
 
PDF
2nd International Conference on Semantic Technology (SEMTEC 2024)
ITIIIndustries
 
PDF
12th International Conference on Artificial Intelligence, Soft Computing (AIS...
ITIIIndustries
 
PDF
9th International Conference on Education (EDU 2024)
ITIIIndustries
 
PDF
Securing Cloud Computing Through IT Governance
ITIIIndustries
 
PDF
Information Technology in Industry(ITII) - November Issue 2018
ITIIIndustries
 
PDF
Design of an IT Capstone Subject - Cloud Robotics
ITIIIndustries
 
PDF
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
ITIIIndustries
 
PDF
Image Matting via LLE/iLLE Manifold Learning
ITIIIndustries
 
PDF
Annotating Retina Fundus Images for Teaching and Learning Diabetic Retinopath...
ITIIIndustries
 
13th International Conference of Advanced Computer Science & Information Tech...
ITIIIndustries
 
12th International Conference on Bioinformatics and Bioscience (ICBB 2025)
ITIIIndustries
 
13th International Conference on Natural Language Processing (NLP 2024)
ITIIIndustries
 
11th International Conference on Computer Networks & Data Communications (CND...
ITIIIndustries
 
10th International Conference on Software Engineering and Applications (SOFEA...
ITIIIndustries
 
10th International Conference on Fuzzy Logic Systems (Fuzzy 2024)
ITIIIndustries
 
10th International Conference on Natural Language Computing (NATL 2024)
ITIIIndustries
 
10th International Conference on Fuzzy Logic Systems (Fuzzy 2024)
ITIIIndustries
 
2nd International Conference on Computer Science and Information Technology A...
ITIIIndustries
 
10th International Conference on Fuzzy Logic Systems (Fuzzy 2024)
ITIIIndustries
 
Call For Papers -10th International Conference on Natural Language Computing ...
ITIIIndustries
 
2nd International Conference on Semantic Technology (SEMTEC 2024)
ITIIIndustries
 
12th International Conference on Artificial Intelligence, Soft Computing (AIS...
ITIIIndustries
 
9th International Conference on Education (EDU 2024)
ITIIIndustries
 
Securing Cloud Computing Through IT Governance
ITIIIndustries
 
Information Technology in Industry(ITII) - November Issue 2018
ITIIIndustries
 
Design of an IT Capstone Subject - Cloud Robotics
ITIIIndustries
 
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
ITIIIndustries
 
Image Matting via LLE/iLLE Manifold Learning
ITIIIndustries
 
Annotating Retina Fundus Images for Teaching and Learning Diabetic Retinopath...
ITIIIndustries
 
Ad

Recently uploaded (20)

PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PDF
Trading Volume Explained by CIFDAQ- Secret Of Market Trends
CIFDAQ
 
PDF
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
PPTX
UI5Con 2025 - Get to Know Your UI5 Tooling
Wouter Lemaire
 
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
Fwdays
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PPTX
TYPES OF COMMUNICATION Presentation of ICT
JulieBinwag
 
PPTX
Machine Learning Benefits Across Industries
SynapseIndia
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
GITLAB-CICD_For_Professionals_KodeKloud.pdf
deepaktyagi0048
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PDF
CIFDAQ Market Insight for 14th July 2025
CIFDAQ
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
visibel.ai Company Profile – Real-Time AI Solution for CCTV
visibelaiproject
 
PDF
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
Trading Volume Explained by CIFDAQ- Secret Of Market Trends
CIFDAQ
 
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
UI5Con 2025 - Get to Know Your UI5 Tooling
Wouter Lemaire
 
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
Fwdays
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Top Managed Service Providers in Los Angeles
Captain IT
 
TYPES OF COMMUNICATION Presentation of ICT
JulieBinwag
 
Machine Learning Benefits Across Industries
SynapseIndia
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
GITLAB-CICD_For_Professionals_KodeKloud.pdf
deepaktyagi0048
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
CIFDAQ'S Token Spotlight for 16th July 2025 - ALGORAND
CIFDAQ
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
CIFDAQ Market Insight for 14th July 2025
CIFDAQ
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
visibel.ai Company Profile – Real-Time AI Solution for CCTV
visibelaiproject
 
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 

GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based on Genetic Algorithms

  • 1. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 64 ISSN (Online): 2203-1731 GeneticMax: An Efficient Approach to Mining Maximal Frequent Itemsets Based on Genetic Algorithms Mir Md. Jahangir Kabir, Shuxiang Xu, Byeong Ho Kang, Zongyuan Zhao School of Engineering and ICT University of Tasmania Launceston, Australia {mmjkabir, Shuxiang.Xu, Byeong.Kang, Zongyuan.Zhao}@utas.edu.au Abstract—This paper presents a new approach based on genetic algorithms (GAs) to generate maximal frequent itemsets (MFIs) from large datasets. This new algorithm, GeneticMax, is heuristic which mimics natural selection approaches for finding MFIs in an efficient way. The search strategy of this algorithm uses a lexicographic tree that avoids level by level searching which reduces the time required to mine the MFIs in a linear way. Our implementation of the search strategy includes bitmap representation of the nodes in a lexicographic tree and identifying frequent itemsets (FIs) from superset-subset relationships of nodes. This new algorithm uses the principles of GAs to perform global searches. The time complexity is less than many of the other algorithms since it uses a non-deterministic approach. We separate the effect of each step of this algorithm by experimental analysis on real datasets such as Tic-Tac-Toe, Zoo, and a 10000×8 dataset. Our experimental results showed that this approach is efficient and scalable for different sizes of itemsets. It accesses a major dataset to calculate a support value for fewer number of nodes to find the FIs even when the search space is very large, dramatically reducing the search time. The proposed algorithm shows how evolutionary method can be used on real datasets to find all the MFIs in an efficient way. Keywords—data mining; genetic algorithm; maximal frequent itemset; lexicographic tree I. INTRODUCTION Mining frequent itemsets is one of the fundamental and essential issues in various data mining applications such as consumer market-basket problem, discovery of association rules, deducing patterns and correlations, network intrusion detection and other important data mining tasks. The problem is formulated as follows. Given a set of items and a large collection of transactions, each transaction is a subset of these items, find all frequent itemsets. The number of frequent itemsets is defined by a user-specified percentage value (support value) of the datasets. This problem of association rule mining includes two steps: first, mining frequent itemsets from a large dataset, and second, generating association rules or correlation among a large set of data items. Nowadays, huge amounts of data are collected and stored by industries, who are interested in mining frequent itemsets from large datasets. The discovery of association rules among a large number of business transactions helps industries make business decisions [1–4]. A common real life application is the market basket analysis, in which retailers seek to understand the purchase behavior of customers. The data analysis attempts to find interesting hidden relationships among products purchased by customers through association rule mining or frequent pattern mining. For example, by using frequent pattern mining, a shop manager may discover that butter, milk and bread are frequently purchased together by customers. In another example, one association rule may indicate that when a customer buys coffee he would also buy milk. Such information can then be used for cross-selling and up-selling, sales promotions, store design, and discount plans. Similarly, a video shop manager can use frequent pattern mining and/or association rule mining to recommend related videos or games when a customer has hired or bought a specific video or game, to attract the customer to return. Web administrators can use frequent pattern or association rule mining to understand particular collections of web pages that are viewed together by a group of web users. This sort of interesting relationship (e.g., correlation, association) among data items helps managers make relevant decisions [5]. Let D = {t1, t2, t3,…,ti,…,tn} is a dataset, where t1, t2, t3, …,tn are transactions. There are n transactions in total in the dataset. Each transaction ti contains a subset of items I = {i1, i2, i3,…,ik,…,im}, where i1 is item number 1, i2 is item number 2 and so on. Transaction ti is represented as a binary vector. If ti[k] = 1 then it means that ti bought item ik, otherwise, ti[k] = 0. Let X be a subset of items in I i.e. X ⊆ I .The set ti(X) ⊆ I is true for all items in itemset X for transaction ti. The support value of an item is how many times the item appears in the transaction datasets as a subset. The support value of an itemset is denoted by δ(X) = |{t1(X) + t2(X)+ t3(X) + …+ tn-1(X) + tn(X)}| / |D| Here ti(X) gives the binary value. If the examined itemset X appears as a subset in a transaction ti, then ti(X) = 1, otherwise ti(X) = 0. An itemset with 1 item is called a 1-itemset, and an itemset with k-items is called a k-itemset. An itemset is called frequent if its support value is more than or equal to a user- defined threshold value, which is denoted by min_supp (minimum support) i.e. δ(X) ≥ min_supp, where δ(X) is a
  • 2. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 65 ISSN (Online): 2203-1731 support value of an examined itemset. We denote the frequent itemsets by FI. If an itemset X is frequent, and no superset of X is frequent, then we can claim that X is a maximal frequent itemset and we denote the sets of all maximal frequent itemsets by MFI [6]. To generate the MFIs from a large dataset is time consuming. In this paper, we present a novel approach to finding MFIs from large datasets by using a GA. The length of an FI depends on its relationship among the itemsets. There are several advantages of a GA based approach that performs global searches. The time complexity of GA-based approach is significantly less than that of many other algorithms. Another advantage is that a GA-based approach is able to generate frequent itemsets from a large dataset. This work differs from the existing research in the following aspects: 1. Our GeneticMax uses a lexicographic tree as a search tree, and it does not need to enumerate frequent itemsets level by level. 2. This approach uses the principles of GA which randomly generates chromosomes. If a generated chromosome is frequent, then all the subsets of this chromosome are automatically pruned. If a generated chromosome is infrequent, then all the supersets of this chromosome are automatically pruned. This technique dramatically reduces the time for accessing a large datasets to calculate the support value of unnecessary chromosomes to find frequent itemsets. II. RELATED WORK In data mining research, frequent pattern mining is one of the challenging and focused areas for over a decade. A large number of literature and research works have been dedicated to this research area, and significant progress has been made because of these efforts. Progress has been made in sequential pattern mining, correlation and structured pattern mining, scalable and efficient algorithms are designed for mining frequent itemsets and so on [6]. The scope of data analysis has been expanded by the research of frequent pattern mining and have profound impact on the methodologies and applications of data mining for further exploration. Though there has been plenty of progress made in frequent pattern mining, there are still some challenging issues need to be resolved. These critical research issues include: Scalable mining methods, which are focused topics in frequent pattern mining research, have been extensively studied. Current mining methods are used to derive sets of frequent patterns. These sets of frequent patterns are too huge to use effectively. To reduce these huge sets, researchers have proposed several methods such as maximal patterns, representative patterns, closed patterns, condensed patterns and so on. But it is still undefined for a specific application which pattern set provides compactness and the quality of representation. Much investigation is needed to reduce the size of derived sets of patterns and to increase the quality of preserved patterns. Some applications prefer approximate frequent patterns though current studies show that efficient methods are available for mining a complete and explicit set of frequent patterns. In bioinformatics to match with the biological entities, one could be interested in searching long sequence patterns in DNA analysis. Much investigation is needed to design efficient methods to make this mining more competent than the present tools available in bioinformatics. Classification is another major task in data mining. In data mining, classifications using frequent patterns means which frequent patterns are more adequate over another. In the future, researcher should design a method in such a way that effective frequent patterns are mined directly from data [6]. Some applications require in depth understanding of patterns and interpretation of those patterns. Most of the researchers have focused on discovering frequent patterns but have given less attention to analysing and interpreting those patterns. The semantic analysis of a pattern includes the meaning of that pattern, the typical transactions that pattern considers and so on. Finding the reasons behind the frequency of a specific pattern is termed as contextual analysis of a frequent pattern. For example, a pattern could be frequent depending on specific time duration, location, weather and so on. To improve the interpretability, effectiveness and usability of a frequent pattern, it is necessary to have a deep understanding of frequent patterns [7]. It is well known that the Apriori algorithm generates a candidate set and tests it in a breadth fast manner. It discovers all the frequent itemsets at level k before moving to its next level (k+1). It counts the support value of each node at level k and prunes those nodes if the support values of those nodes do not satisfy a user-defined support value. It generates candidate itemsets at each level and scans the datasets so frequently that it becomes costly, especially when there exists a long pattern [1]. The Pincer-search algorithm [8] traverses a lattice through a bi-directional method that follows both top-down and bottom- up approaches. To find the maximal frequent itemsets, pruning is applied based on the following rules: 1. All the subsets of frequent itemsets are pruned 2. All the supersets of infrequent itemsets are pruned. The breadth-first traversal strategy (a level by level search strategy on the search space) was used in the MaxMiner search algorithm. To prune the branches of a tree, MaxMiner employs a look-ahead method. It uses the breadth-first approach to limit the number of passes over the datasets but with look-ahead which involves superset pruning, works better for depth-first search methods [9]. The DepthProject performs depth-first traversal on a lexicographic tree along with variations of superset pruning. To order child nodes, it applies dynamic reordering methods. By trimming infrequent items out of each node’s tail, it reduces the size of the search space. To eliminate non-maximal frequent itemsets the DepthProject would require post-pruning methods [10]. MAFIA, proposed by Burdick, Calimlim, and Gehrke [11], extends the idea of DepthProject. Similar to the DepthProject,
  • 3. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 66 ISSN (Online): 2203-1731 MAFIA also uses vertical bitmap representation where the support value/count of an itemset is based on the bitwise AND operations among the itemsets. An example of four items in a data tuple and the datasets are given in Fig. 1. Bitvectors for the 1-itemset A, B, C, D of Fig. 1.are 10111, 01001, 11110, and 11011, respectively. To get the support value/count of the itemset it needs to apply bitwise AND (&) operation between the bitvectors of the itemsets. For the above example, the result of the bitwise AND operation of bitvectors of itemsets A and C is 10111 & 11110 = 10110. The support value or count of an item is the number of 1’s in the bitvector. Here the support value of the 2-itemset {A, C} is 3. To bitwise AND another bitvector D with the previous result of the bitwise AND operation of {A, C}, is 10110 & 11011= 10010. The support value of the 3-itemset {A, C, D} is 2. The search strategy of MAFIA integrates depth first method to traverse the tree to find maximal frequent itemsets along with effective pruning methodology. The look- ahead pruning methodology, first used by MaxMiner, was also used by MAFIA. The last checking method of MAFIA is easy to test. Without counting A ∪ C, it allows us to conclude that {A, C} is frequent. This technique is defined as Parent Equivalence Pruning in [11]. Gouda and Zaki proposed a novel approach called GENMAX to find maximal itemsets in [12]. In their approach, they used a novel technique called Progressive Focusing. This technique maintains local maximal frequent itemsets (LMFI) which are used for making comparison with newly found frequent itemsets (FI). Non-maximal frequent itemsets are identified through this step, and it decreases the number of subset testing. GENMAX uses a vertical representation of a datasets and stores a transaction identifier set (TIS) for each itemset instead of a bitvector. The support value of an itemset is defined by the cardinality of an itemset’s TIS. Researchers of GENMAX concluded that through experimental results the algorithm performs better than existing algorithms for different types of datasets. Alataş and Akin designed an efficient genetic algorithm as a search strategy to mine both positive and negative quantitative association rules in [13]. Association rules are deduced from frequent patterns. Different from other methods, their approach is used to mine association rules without generating frequent itemsets. The proposed genetic algorithm does not depend on minimum support and confidence value which is hard to define for a dataset. A new genetic operator named uniform operator is used in this approach to ensure genetic diversity. Another interesting problem in data mining is classification. Different lengths of the itemsets are classified into different groups based on the frequencies of the itemsets. Concise A B C D 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 Fig. 1. Vertical bitmap representation. symbolic rules with higher accuracies are mined using neural networks. To get the required accuracy, the network is initially trained. Network pruning algorithm is used to prune redundant connections. Classification rules are generated through the result of the analysis of activation values of the hidden layers. Researchers noticed that the main drawback of using neural networks in different data mining test problems was the training time. Though it provides lower classification error rate than decision trees, it requires long training time [14, 15]. A quick response data mining model based on genetic algorithm has been designed in [16]. This approach gives more flexibility to the user to mine frequent itemsets. Long frequent itemsets are generated because of the higher relationships among data tuples. If the Apriori algorithm is used to mine frequent itemsets from these tuples, it could take a huge amount of time due to frequent access to the datasets and a large number of candidate itemsets generated. This approach avoids considering huge candidate itemsets. It only scans the datasets for those frequent itemsets that users are more interested. This system uses a GA to mine itemsets and then shows them to the user. If the users are interested, then it scans the datasets to get the support values of those itemsets. To mine quantitative association rules, a GA-based algorithm named QUANTMINER was proposed in [17, 18]. By optimizing the support and confidence values, this system dynamically identifies good intervals in association rules. Researchers applied this algorithm to different datasets and showed the usefulness of this algorithm as a data mining tool. R. J. Kuo and C. W. Shih used a new meta-heuristic technique, the ant colony system [19] to mine large databases for efficient searching of association rules. Multi-dimensional constraints are considered in this approach. In addition, this approach also considered user’s assign constraint. The results showed that it gave more condensed rules than the Apriori algorithm. The computational time of this approach is less than that of the Apriori algorithm. Though this system provides promising results, it still faces some issues that need to be resolved. After analysing the results, it was found that lots of similar rules were generated so a fuzzy approach was applied to merge those similar rules into one class. To mine association rules, most researchers focused on ameliorating computational efficiency. To determine the threshold values of support and confidence which are the key factors for the association rule mining task, the researchers proposed a new approach based on the particle swarm optimization technique. Suitable fitness values and their corresponding support and confidence values of the identified swarms are searched through this approach. Their result showed that particle swarm optimization algorithm quickly finds suitable threshold fitness values of itemsets and quality rules are obtained in this way. Users can mine specific rules from a large database by setting support or fitness value. Since this technique free from support constraint, the main problem with this approach is users have no control over mining techniques. Apart from this their result only showed two or three-dimensional rules instead of more dimensional rules that could be interesting for the policy makers of the industry [20].
  • 4. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 67 ISSN (Online): 2203-1731 III. THE IDEA OF FAST RESPONSE If the data tuples contain long itemsets, huge candidate itemsets will be generated, and that reduces the efficiency of a solution. A long itemset enumerates a combinatorial number of shorter, frequent sub-itemsets. For example, a data tuple contains 50 itemsets, {i1, i2, i3, … ,i50}, enumerates (50 1 ) frequent 1-itemsets (i1, i2, … ,i50), (50 2 ) frequent 2-itemsets: (i1, i2), (i1, i3), … , (i1, i50), (i2, i3), (i2, i4), … ,(i2, i50), and so on. Lemma 1: If the length of an itemset is n, then it enumerates 2n -1 frequent sub-itemsets. This can become too huge for a computer to compute and store if the length of an itemset is long. For each sub-itemset, the Apriori algorithm is used to scan the datasets and calculate the support value of that itemset. However, the Apriori algorithm increases the computational time of the algorithm and decreases the efficiency of it. To overcome this low efficiency of the Apriori algorithm we introduce a new approach that takes into consideration the superset-subset relationships. An itemset I is called maximal frequent itemset if the super itemset of I, denoted by 𝐈́, is not frequent such that 𝐈 ⊆ 𝐈́. Here 𝐈́ is an infrequent itemset based on a user-defined support value. Lemma 2: If an itemset I is a frequent itemset, then all the subsets of I are frequent, based on the user-defined support value. For example, if an itemset I = {1,2,3} in set S = {1,2,3,4} is frequent, i.e., δ(I) ≥ min_supp then all the subsets of I, i.e.,{1}, {2}, {3},{1,2}, {1,3}, {2,3} are frequent itemsets, based on the support value defined by the user. The Apriori algorithm scans the datasets for all the subsets of X to get the support value. It takes huge computational time if the length of an itemset is long. In our GeneticMax approach, if the generated chromosome is I = {1,2,3}, and it satisfies the user-defined support value then it will not test all the subsets of which dramatically reduces the computational time for scanning the datasets. IV. PRELIMINARIES In this section, we will introduce some notations and conceptual diagrams that will be used throughout this paper. Initially, we describe datasets through bitmap and bipartite graphs. Then we graphically represent subsets of items using lexicographic trees. A. Bipartite Graph and Bitmap Representation If U and V are two disjoint sets of vertices and E is the set of edges which connect the vertices U and V, then we can represent a bipartite graph as a triple, i.e. G = (U, V, E) where E ⊆ U×V. In a binary matrix, all elements are either 0 or 1. The mapping between binary matrices and datasets of transactions is straight forward. Consider a dataset D which consists of m transactions, {t1, t2, …, tm-1, tm}, corresponding to rows; and n items, {i1, i2, …, in-1, in}, corresponding to columns. The datasets D is an m×n matrix, where each entry is defined as aij. The value of aij is 1 if transaction ti contains item ij; otherwise, 𝑖1 𝑖2 𝑖3 𝑖4 𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 Fig. 2. Bipartite graph representation of the dataset D. i1 i2 i3 i4 t1 1 1 1 0 t2 1 1 1 1 t3 1 0 1 1 t4 1 0 1 1 t5 1 1 1 1 Fig. 3. Binary matrix representation of the dataset D. it is 0. Now we are mapping each transaction as a set of items from the binary matrices. For example, a dataset D which consists of the following transactions t1, t2, t3, t4, t5 and items i1, i2, i3, i4 where t1 = {i1, i2, i3}, t2 = {i1, i2, i3, i4}, t3 = {i1, i3, i4} and t5 = {i1, i2, i3, i4}. Here all the items are different. Fig. 2. and Fig. 3. show the bipartite graph and the binary matrix of the dataset D. From Fig. 2, if we map each transaction by items then the transactions are as follows: t1 = {1, 1, 1, 0} = 1110 t2 = {1, 1, 1, 1} = 1111 t3 = {1, 0, 1, 1} = 1011 t4 = {1, 0, 1, 1} = 1011 t5 = {1, 1, 1, 1} = 1111 B. Lexicographic Tree Our research problem here is to find the maximal frequent itemsets from large datasets using a GA. Itemset I consists of n items, i.e. I = {i1, i2, i3, … ,in}. Xk represents an itemset containing k-items, where k = 1, 2, …, n and Xk ⊆ I. If k = 1, then Xk contains a 1-itemset, i.e. Xk = {i1}. If k = 2, then Xk contains a 2-itemset, i.e. Xk = {i3, i4}, and so on. An itemset is called frequent if its support value satisfies a user-defined support value, and is denoted by FI. An itemset X is called maximal frequent itemset if it is frequent, and no superset of X satisfies any user defined support value, and is denoted by MFI. In this paper, we will consider the search space that includes all feasible solutions. A lexicographic tree [11, 21] is the search space for GeneticMax. A lexicographic tree maintains lexicographic ordering of items of I in a datasets D. If item i occurs before item j in a datasets D, then it maintains lexicographic ordering, i.e. 𝑖 ≤ 𝐿 𝑗. If two subsets S1 and S2, where S1 ⊆ S2 and S1, S2 ⊆ S then it maintains the following lexicographic order: S1 ≤ 𝐿 S2. There is no lexicographic ordering relationship between the two subsets S1 and S2, If S1 and S2 are disjoint subsets.
  • 5. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 68 ISSN (Online): 2203-1731 Fig. 4. Lexicographic tree of four items. Fig. 5. Lexicographic tree of four items based on a user-defined support value. Fig. 4 shows an example of a lexicographic tree that considers lexicographic ordering for four items. The root of the tree is an empty set and each k-level contains k-items. In each level, k-itemsets maintain lexicographic ordering with the tail nodes containing items lexicographically larger than elements of the head node. The support value of the head node is more than that of the tail node. It can be seen that the nodes closer to the root are more frequent than those far from the root. There is a non-linear line (called a cut) in the tree which separates frequent itemsets from infrequent ones. The nodes that are above the cut are frequent itemsets and the elements below this cut are infrequent ones. For GeneticMax, we introduce a new tree based on user- defined support values. The line is defined by a user-defined support value and the area above the line is referred to as positive area and below negative area. All the nodes in the positive area are frequent whereas those in the negative area are infrequent. In Fig. 5, the nodes within the positive boundary area have a minimum support value of 30%. GeneticMax introduces an array to store the frequent itemsets (called FIs). Among the frequent itemsets, the set containing the largest number of items is called the maximal frequent itemset. This maximal frequent itemset (stored in another special array) is called MFI. This algorithm searches frequent nodes within a positive area and tries to converge to a solution, i.e., finding maximal frequent itemsets as early as possible. Fig. 5 verifies Lemma 1, where there are 4 items, and it enumerates (24 -1) = 15 nodes including the root node. With Apriori algorithm, one would test all the nodes at a specific level and generate a candidate set. This candidate set generation needs a long time for finding maximal frequent itemsets. For example, in Fig. 5 it tests the itemsets {1}, {2}, {3}, {4} in level 0 and find that all the itemsets are frequent since these nodes meet the minimum support value. Then it goes to the next level to scan the datasets to get the support values of {1, 2}, {1, 3}, {1, 4}, {2, 3},{2, 4}, {3, 4} and so on. On the next level, it prunes the itemsets {1, 4}, {2, 4}, {3, 4} since these nodes have support values that are less than the user-defined support value. On the other hand, unlike apriori algorithm, with GeneticMax we do not need to test all the nodes, which saves a huge amount of time even when the datasets is very large. For the current example, if the initially generated itemset is {1, 2, 3} then it scans the datasets and calculates the support value. If the support value of the generated itemset {1, 2, 3} is ≥ 30%, then it stores this itemset in a frequent itemset array called FI_Superset_Member. In the future it will not scan the datasets for {1}, {2}, {3}, {1, 2}, and {1, 3} since these itemsets are the subsets of the previously generated itemset {1, 2, 3}. If the generated itemsets are {1}, {2}, {3} or {1, 2} then it always checks the array FI_Superset_Member. If it finds any superset in FI_Superset_Member, GeneticMax will discard theses subsets, which substantially reduces the time for scanning the datasets to calculate the support values correspondingly. Lemma 3: If Y is a superset of an itemset X, i.e., X ⊆ Y and if Y is a frequent itemset, then it can be claimed that X is a frequent itemset. For example, {1, 2, 3} is a superset of itemset {1}, {2}, {3}, {1, 2} and {1, 3}. GeneticMax uses the principles of GA and follows the global search mechanism; therefore, a superset could be generated before generating a subset. In this example, if {1, 2, 3} is generated before its subsets (and stored in the array FI_Superset_Member), then all other generated subsets will be discarded. Lemma 4: If Y is a superset of an itemset X, i.e., X ⊆ Y, and if X is an infrequent itemset, then it can be claimed that Y is an infrequent itemset. For example, if the initially generated chromosome is {1, 4}, and the support value of this itemset is < 30%, then it is stored in a non-frequent itemset array called NFI. If the next generated itemset is {1, 3, 4}, the algorithm will check the NFI array, and if it finds any subset in this array, GeneticMax will discard itemset {1, 3, 4} for any future calculations. Lemma 5: If Z is a superset of an itemset X, Y, i.e., X, Y ⊆ Z and if Z is an infrequent itemset, then we cannot determine whether X or Y is an infrequent itemset. Lemma 5 is slightly different from Lemma 3. With the previous example, if {1, 2, 3} is a frequent itemset then all of its subsets must be frequent, i.e., {1},{2},{3},{1,2},{1,3}, {2,3} are all frequent itemsets. But if {1, 2, 3} is an infrequent itemset then we cannot conclude that all of its subset are infrequent. In the above Fig. 5, {1, 4} is an infrequent itemset but its subsets {1} and {4} are frequent itemsets. Lemma 6: If Z is a subset of itemsets X and Y, i.e. Z ⊆ X, Z ⊆ Y; and if Z is a frequent itemset, then its supersets X and Y could be either frequent or infrequent itemsets. For example, itemset {1} in Fig. 5 is frequent. Although its superset {1, 3} is frequent, its superset {1, 4} is an infrequent itemset. The main idea of GeneticMax is to find maximal frequent itemsets while converging to a solution as fast as possible. It Level
  • 6. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 69 ISSN (Online): 2203-1731 sub-divides a whole lexicographic tree into two sub-areas based on a user-defined support value. GeneticMax can generate any chromosome in any sub-region. If it finds any superset in a positive boundary area, then it follows Lemma 3 and prunes all of its subsets. But if it finds any subset in a negative boundary area, then it follows Lemma 4 and prunes all of its supersets. The main advantage of GeneticMax is its ability to converge quickly to a solution, and find all the supersets in a positive boundary area closer to the cut as fast as possible. In the above example, if {1, 2, 3} is generated before all of its subsets ({1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}) and found to be a frequent itemset, then it will discard those subsets (which are also frequent itemsets). If the next generated itemset is {1, 3, 4} it will check the NFI_Subset_Member array and does not find any subset there. GeneticMax will scan the datasets for this itemset to find its support value and store it in NFI. In the future, all the supersets of {1, 3, 4} will be discarded. V. DESCRIPTION OF GENETICMAX A. Itemsets Mapping to Chromosomes GeneticMax maps itemsets onto a chromosome code. Each node in the lexicographic tree represents different itemsets and all the nodes in the tree get a unique chromosome code. The main features of chromosome coding: 1. It is easy to calculate the support values since GeneticMax uses bitmap representation of the datasets. 2. Generate all the possible nodes. If there are n items, it enumerates (2n -1) itemsets or nodes in the lexicographic tree. If it needs, GeneticMax can generate (2n -1) nodes in its lifetime. The length of a chromosome is fixed. If a dataset contains n items, then the length of all the generated chromosomes are always n, as shown in Fig. 6. B. Lifetime of GeneticMax The lifetime of GeneticMax depends on user’s selection of a generation. The higher the generation number, the higher the probability for getting a correct solution. But there is a threshold value for a generation: after the threshold is reached the solution remains the same. VI. GENETICMAX FOR EFFICIENT MFI MINING There are five main requirements in developing an efficient maximal frequent itemsets mining algorithm. We need a set of techniques that fulfill these requirements: 1. It will not scan a dataset more than once for a specific itemset. 2. If X is an itemset in a positive boundary area, and there are no supersets of X and it has already been tested, then all the subsets of X are pruned and defined as invalid datasets. 3. If X is an itemset in a negative boundary area, and there are no subsets of X and it has already been tested, 𝑉𝑖𝑡𝑒𝑚1 𝑉𝑖𝑡𝑒𝑚2 … 𝑉𝑖𝑡𝑒𝑚 𝑛 Fig. 6. Mapping items onto chromosomes 𝑉𝑖𝑡𝑒𝑚1…𝑛 ∈ [0,1]. Fig. 7. The number of generations and maximal frequent itemsets. then all the supersets of X are pruned and defined as invalid datasets. 4. It should maintain an interactive mining process, where users can change the threshold to get different sets of MFI. 5. It gives correct solutions for different sizes of datasets. Apriori algorithm and FP-Tree do not satisfy requirements 1, 2, 3 and 4 respectively. GeneticMax fulfills all the above requirements. Genetic Algorithms, developed by Holland in 1975, are random search algorithms that generate populations iteratively [22]. This algorithm has been used to find approximate solutions for further optimization. A GA is a search heuristic, considers adaptive methods that are used to solve search as well as optimization problems. This algorithm is inspired by natural selection and the “survival of the fittest” mechanisms which were clearly stated by Charles Darwin in his book “The Origin of Species”. The main concept of “survival of the fittest” mechanism is based on the fitness value; only the stronger individuals will survive in a competitive environment. A GA simulates the processes in natural population which are essential for evolution. A large number of researchers worked on GAs [23–26]. Naturally, individuals are competing for their food, shelter, water, clothes and so on. Even members of the same class often compete to attract partners. Those individuals are referred to as strong if they are successful in surviving and attracting a partner. A large number of offspring will be produced by strong individuals. On the other hand poorly performing individuals are referred to as weak and have less probability to produce newer offspring. “Superfit” offspring can be produced by the combination of good attributes from different parents. That is the fitness of this offspring is higher than the fitness of the parents. In this fashion, species becoming more and more well suited in the present environment.
  • 7. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 70 ISSN (Online): 2203-1731 GA plays a vital role in this study. Evolutionary algorithm based techniques are robust and can be used to solve a wide range of problems including those which can be difficult to solve by other methods. It is well known that GA cannot guarantee optimum solutions to any problem but rather it can find “acceptably good” solutions to a problem “quickly”. Existing methods that are working well as a solution for a particular problem, improvement of those methods can be done by hybridizing with GA. A traditional GA generates an initial population and then computes the fitness value of that population. Two individuals are selected from the old generation and crossover and mutation operators are applied to produce two offsprings. Survivors who have the best fitness values are inserted in the new generation. If the population is converged to a solution, then the algorithm is terminated. In this algorithm, fitness function provides the fitness value of an offspring which is a specification of the offspring. The main aim of using a GA for this problem is to reach good results by discarding bad solutions during generation of populations [27]. The basic steps of GeneticMax algorithm are as follows: A. Procedures of GeneticMax There are eight steps in the GeneticMax algorithm. They are listed as follows: 1. Set number of generations. 2. Generate a population. 3. Check the FI_Superset_Member and NFI_Subset_Member array for superset and subset checking of this generated chromosome. 4. If it finds any superset in FI_Superset_Member, or subset in NFI_Subset_Member, then go to Step 2. 5. Compute a fitness value of individuals according to their support values in dataset D. 6. Perform FI_Member_Add, and if any frequent item sets are found then update FI_Superset_Member. 7. Perform NFI_Member_Add, and if any infrequent itemsets are found then update NFI_Subset_Member. 8. Go to Step 3 with newly generated chromosome until it exceeds the generation number set in Step 1. B. Mining the Superset in a Positive Boundary Area For an itemset X, if there is any subset of X in FI_Superset_Member, then this method (Fig. 8) is called to replace that subset by its superset X. This method is also applicable if X is a new frequent item with no subset in FI_Superset_Member. C. Mining the Subset in Negative Boundary Area For an itemset X, if there is any superset of X in NFI_Subset_Member, then this method (Fig. 9) is called to replace that superset by its subset X. This method is also applicable if X is a new infrequent item, and it has no superset in NFI_Subset_Member. D. GeneticMax Pruning Methods The Check_Member_for_Item function (Fig. 10) incorporates three techniques: 1) Superset Checking Techniques Checking to see whether a given chromosome is a superset in a positive boundary area. Further pruning happens if a given itemset is not a superset in the positive boundary area. 2) Subset Checking Techniques Checking to see whether a given chromosome is a subset in a negative boundary area. Further pruning happens if a given itemset is not a subset in the negative boundary area. 3) Unchecked itemset checking techniques If an itemset is neither a superset in a positive boundary area and nor a subset in a negative boundary area, then this itemset is referred to as an “unchecked” itemset and needs to be tested. For this unchecked itemset, GeneticMax scans the datasets and sets the itemset in FI_Superset_Member or NFI_Subset_Member according to the user-defined support value. //Invocation: FI_Member_Add(IF I_Superset_Member) 1. If any subset of IF is in I_Superset_Member 2. Delete the Subset of IF 3. Add IF in FI_Superset_Member 4. Else add IF in FI_Superset_Member Fig. 8. The FI_Member_Add function. //Invocation: NFI_Member_Add(I1F NFI_Subset_Member) 1. If any superset of I1F is in NFI_Subset_Member 2. Delete the Superset of I1F 3. Add I1F in NFI_Subset_Member 4. Else add I1F in NFI_Subset_Member Fig. 9. The NFI_Member_Add function. //Invocation: Check_Member_for_Item (I, FI_Superset_Member NFI_Subset_Member) 1. If any superset of I is in FI_Superset_Member 2. Discard I 3. Else if any subset of I is in NFI_Subset_Member 4. Discard I 5. Else scan the database to calculate support value for I 6. If support value ≥ user-defined support value 7. Invoke FI_Member_Add 8. Else invoke NFI_Member_Add Fig. 10. The Check_Member_for_Item function.
  • 8. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 71 ISSN (Online): 2203-1731 VII. EXPERIMENTAL RESULTS The experiments were performed on an Intel(R) Core i5- 3210M CPU @2.50GHz, with 4 GB of RAM running on Windows 7 Enterprise. Microsoft Visual Studio 2012 was used to compile the code of GeneticMax. Three datasets including Tic Tac Toe, 10000×8, and Zoo were used to test GeneticMax. Different support values were applied to these datasets to check how many nodes have been tested and the numbers of chromosomes have been generated to get the exact number of maximal frequent itemsets, run times, and so on. Here run time is the total execution time. GeneticMax embeds two main features: i) superset-subset relationship in both positive and negative boundaries in a lexicographic tree for pruning invalid chromosomes, and ii) use of GA which uses a global search mechanism. The purpose of this new approach is to achieve convergence to a solution as fast as possible. A full experiment of GeneticMax on these datasets was conducted, demonstrating GenticMax’s ability to yield solutions rapidly by accessing the datasets for a few number of nodes in a lexicographic tree. As we see from the previous discussions, the Apriori algorithm tests all of the nodes in each level and prunes those nodes that do not satisfy a minimum support value. In GeneticMax, if it generates a chromosome X in any level that satisfies a minimum support value, then all the other subsets of X in any level will be automatically pruned. This approach dramatically reduces the time for accessing a large dataset. This is also true the other way around. If GeneticMax generates a chromosome Y in any level that does not satisfy a minimum support value, then all the other supersets of Y in any level will be automatically pruned. We tested the algorithm on different datasets such as Tic- Tac-Toe, Zoo, 10000×8 and so on. These datasets were taken from the University of California, Irvine (UCI) machine learning repository (http://archive.ics.uci.edu/ml/datasets.html). From the experimental results as shown in Table 1, we can see that if the number of generations is increased, then it increases the frequent itemsets. For example, for the 10000×8 datasets, generation 100 produced 9 frequent itemsets whereas generation 140 produced 8 frequent itemsets. In other words, generation 100 resulted in more than 9 frequent itemsets. On the other hand, generation 140 resulted in more than 8 frequent itemsets. If we compare these two generations, we could conclude that generation 100 still did not find some frequent itemsets. When we increased the number of generation to 140, it found some itemsets missed by generation 100. Generation 150 gave the same result as generation 140. So users can use generation 140 as a threshold value for the 10000×8 datasets. This is also true for Tic-Tac-Toe, generation 1200 and 1300 gave the same result that contains the maximal frequent itemsets. So for Tic-Tac-Toe, generation 1200 can be used as a threshold value. Table 2 shows a comparison between the number of nodes in a lexicographic tree and the number of nodes tested for getting maximal frequent itemsets. For 10000×8, there are 255 itemsets and GeneticMax accessed only 39 itemsets in the main datasets to get the maximal frequent itemsets. Since GeneticMax uses the principles of genetic algorithm and prunes invalid chromosomes based on superset-subset relationship, it dramatically reduces the number of itemsets out of a dataset for getting the support value to mine maximal frequent itemsets. The advantage of using those principles in GeneticMax is showed in Table 2, where (255-39) = 216 nodes were not examined to get the support value from datasets 10000×8 to get the exact number of maximal frequent itemsets. Only 39 were examined to get the final solution. For TicTacToe, only 114 nodes were examined to get the final solution (the other 397 nodes were not required). As we can see from Fig. 11, the runtime of GeneticMax increases with respect to the generation number of chromosomes. A lower support value that generates more frequent itemsets needs higher runtime whereas a higher support value that generate fewer frequent itemsets needs less computational time, as shown in Fig. 12. VIII. CONCLUSION AND SUMMARY In this paper, we proposed a new GA-based approach GeneticMax to mine maximal frequent itemsets in an efficient way. We have conducted thorough experiments on different real datasets. The experimental results demonstrated several advantages of our algorithm.  It accesses a large datasets for a fewer number of nodes to calculate the support value to find maximal frequent itemsets.  It shows the power of using an evolutionary algorithm for generating frequent itemsets from a lexicographic tree. The whole dataset is projected onto a lexicographic tree based on a user-defined support value. Fig. 11. Run time versus generation for Tic-Tac-Toe. Fig. 12. Run time of GeneticMax using different support values.
  • 9. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 72 ISSN (Online): 2203-1731 TABLE I. THE EXPERIMENTAL RESULTS OF GENETICMAX FOR TWO DIFFERENT DATASETS Database Records Items Support (%) Generation Frequent Itemsets Time (s) Remarks 10000×8 10000 8 20 100 9 10.22 140 8 21.67 This generation contains MFI 150 8 25.10 TicTacToe 958 9 16 100 6 10.13 250 7 17.53 500 10 43.83 1100 23 78.20 1200 24 95.60 Both Generations provide the same result 1300 24 115.66 TABLE II. RESULTS SHOWING THE NUMBER OF TIMES THE DATABASE ACCESSED BY GENETICMAX  The experimental analysis of GeneticMax shows the effect of generations of chromosomes and pruning all the subsets and supersets in both positive and negative boundary areas, which dramatically reduces search space and cost of counting support value of itemsets.  The above advantages of GeneticMax increase the scalability of this algorithm. We have implemented the GeneticMax algorithm and studied its performance. The performance study showed that this algorithm mines different sizes of patterns in real datasets in an efficient way and performs better than other candidate pattern generation and evolutionary based algorithms. REFERENCES [1] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in 20th International Conference on Very Large Data Bases, 1994, pp. 487–499. [2] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules between sets of items in large databases,” Proc. 1993 ACM SIGMOD Int. Conf. Manag. Data - SIGMOD ’93, May 1993, pp. 207–216. [3] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “A tree projection algorithm for generation of frequent itemsets,” Parallel Distrib. Comput. Spec. Issue High Perform. Data Min., vol. 61, no. 3, pp. 350–371, 2001. [4] J. J. Cameron and C. K. Leung, “Mining frequent patterns from precise and uncertain data,” Comput. Syst., vol. 1, no. 1, pp. 3–22, 2011. [5] M. M. J. Kabir, S. Xu, B. H. O. Kang, and Z. Zhao, “Association rule mining for both frequent and infrequent items using particle swarm optimization algorithm,” Int. J. Comput. Sci. Eng., vol. 6, no. 7, pp. 221–231, 2014. [6] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: current status and future directions,” Data Min. Knowl. Discov., vol. 15, no. 1, pp. 55–86, Jan. 2007. [7] Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai, “Generating semantic annotations for frequent patterns with context analysis,” Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD ’06, 2006, pp. 337–346. [8] D.-I. Lin and Z. M. Kedem, “Pincer-Search: a new algorithm for discovering the maximal frequent set,” in Proc. 6th International Conference on Extending Database Technology, 1998, pp. 103–119. [9] R. J. Bayardo, “Efficiently mining long patterns from databases,” ACM SIGMOD, pp. 85–93, 1998. [10] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “Depth first generation of long patterns,” Proc. Sixth ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD ’00, 2000, pp. 108–118. [11] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a maximal frequent itemset algorithm for transactional databases,” Proc. 17th Int. Conf. Data Eng., 2001, pp. 443–452. [12] K. Gouda and M. J. Zaki, “GenMax : an efficient algorithm for mining,” Data Min. Knowl. Discov., vol. 11, no. 3, pp. 223–242, 2005. [13] B. Alataş and E. Akin, “An efficient genetic algorithm for automated mining of both positive and negative quantitative association rules,” Soft Comput., vol. 10, no. 3, pp. 230–237, Apr. 2005. [14] H. Lu, R. Setiono, and H. Liu, “Effective data mining using neural networks,” IEEE Trans. Knowl. Data Eng., vol. 8, no. 6, pp. 957– 961. [15] S. Ghosh, A. Nag, D. Biswas, J. P. Singh, S. Biswas, D. Sarkar, and P. P. Sarkar, “Weather data mining using artificial neural network,” 2011 IEEE Recent Adv. Intell. Comput. Syst., pp. 192–195, Sep. 2011. [16] W. Dou, J. Hu, K. Hirasawa, and G. Wu, “Quick response data mining model using genetic algorithm,” 2008 SICE Annu. Conf., Aug. 2008, pp. 1214–1219. [17] A. Salleb-Aouissi, C. Vrain, C. Nortet, X. Kong, and D. Cassard, “QuantMiner for mining quantitative association rules,” Mach. Learn. Res., vol. 14, no. 1, pp. 3153–3157, 2013. [18] A. Salleb-Aouissi, C. Vrain, and C. Nortet, “QuantMiner : a Genetic Algorithm for mining quantitative association rules,” Proc. 20th International Joint Conference on Artificial Intelligence, 2007, pp. 1035–1040. [19] R. J. Kuo and C. W. Shih, “Association rule mining through the ant colony system for national health insurance research database in Taiwan,” Comput. Math. with Appl., vol. 54, no. 11–12, pp. 1303– 1318, Dec. 2007. Database Items Support (%) No. of nodes in the Lexicographic Tree (2𝑖𝑡𝑒𝑚𝑠 − 1) No. of nodes tested for getting MFI 10000×8 8 20 255 39 TicTacToe 9 16 511 114 Zoo 17 50 131072 361
  • 10. IT in Industry, vol. 3, no. 3, 2015 Published online 15-Oct-2015 Copyright ISSN (Print): 2204-0595 © Kabir, Xu, Kang, and Zhao, 2015 73 ISSN (Online): 2203-1731 [20] R. J. Kuo, C. M. Chao, and Y. T. Chiu, “Application of particle swarm optimization to association rule mining,” Appl. Soft Comput., vol. 11, no. 1, pp. 326–336, 2011. [21] J. P. Huang, C. T. Yang, and C. H. Fu, “A Genetic Algorithm based searching of maximal frequent itemsets,” in International Conference on Artificial Intelligence, 2004, pp. 548–554. [22] J. H. Holland, Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press, 1975. [23] D. Beasley, D. R. Bull, and R. R. Martin, “An overview of Genetic Algorithms : Part 1 , Fundamentals,” Univ. Comput., vol. 15, no. 2, pp. 58–69, 1993. [24] M. Srinivas and L. M. Patnaik, “Genetic Algorithms: a survey,” Computer (Long. Beach. Calif)., vol. 27, no. 6, pp. 17–26, 1994. [25] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1989. [26] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs. Berlin: Springer, 1992. [27] A. A. Freitas, “A survey of evolutionary algorithms for data mining and knowledge discovery,” in Advances in Evolutionary Computing, Springer Berlin-Heidelberg, 2003, pp. 819–845.