E5-roughsets unit-V.pdf

Rough Sets in KDD
Tutorial Notes
Andrzej Skowron
Warsaw University
Ning Zhong
Maebashi Institute of Technolgy
Copyright 2000 by A. Skowron & N. Zhong

About the Speakers
 Andrzej Skowron received his Ph.D. from Warsaw University.
He is a professor in Faculty of Mathematics, Computer Science and
Mechanics, Warsaw University, Poland. His research interests
include soft computing methods and applications, in particular,
reasoning with incomplete information, approximate reasoning,
rough sets, rough mereology, granular computing, synthesis and
analysis of complex objects, intelligent agents, knowledge discovery
and data mining, etc, with over 200 journal and conference
publications. He is an editor of several international journals and
book series including Fundamenta Informaticae (editor in chief),
Data Mining and Knowledge Discovery. He is president of
International Rough Set Society. He was an invited speaker at many
international conferences, and has served or is currently serving on
the program committees of over 40 international conferences and
workshops, including ISMIS’97-99 (program chair), RSCTC’98-00
(program chair), RSFDGrC’99 (program chair).

About the Speakers (2)
 Ning Zhong received his Ph.D. from the University of Tokyo.
He is director of Knowledge Information Systems Laboratory, and
an associate professor in Department of Information Engineering,
Maebashi Institute of Technology, Japan. His research interests
include knowledge discovery and data mining, rough sets and
granular-soft computing, intelligent agents and databases,
knowledge-based systems and hybrid systems, with over 80 journal
and conference publications. He is an editor of Knowledge and
Information Systems: an international journal (Springer). He is a
member of the advisory board of International Rough Set Society,
ACM SIGKDD International Liaisons Board, the Steering
Committee of PAKDD conferences, the advisory board and
coordinator of BISC/SIGGrC. He has served or is currently serving
on the program committees of over 25 international conferences and
workshops, including PAKDD’99 (program chair), IAT’99 (program
chair), and RSFDGrC’99 (program chair).

Contents
 Introduction
 Basic Concepts of Rough Sets
 A Rough Set Based KDD process
 Rough Sets in ILP and GrC
 Concluding Remarks
(Summary, Advanced Topics, References
and Further Readings).

Introduction
 Rough set theory was developed by Zdzislaw
Pawlak in the early 1980’s.
 Representative Publications:
– Z. Pawlak, “Rough Sets”, International Journal
of Computer and Information Sciences, Vol.11,
341-356 (1982).
– Z. Pawlak, Rough Sets - Theoretical Aspect of
Reasoning about Data, Kluwer Academic
Pubilishers (1991).

Introduction (2)
 The main goal of the rough set analysis is
induction of approximations of concepts.
 Rough sets constitutes a sound basis for
KDD. It offers mathematical tools to
discover patterns hidden in data.
 It can be used for feature selection, feature
extraction, data reduction, decision rule
generation, and pattern extraction
(templates, association rules) etc.

Introduction (3)
 Recent extensions of rough set theory have
developed new methods for decomposition of
large data sets, data mining in distributed and
multi-agent systems, and granular computing.
This presentation shows how several aspects of
the above problems are solved by the (classic)
rough set approach, discusses some advanced
topics, and gives further research directions.

Basic Concepts of Rough Sets
 Information/Decision Systems (Tables)
 Indiscernibility
 Set Approximation
 Reducts and Core
 Rough Membership
 Dependency of Attributes

Information Systems/Tables
 IS is a pair (U, A)
 U is a non-empty
finite set of objects.
 A is a non-empty finite
set of attributes such
that for
every
 is called the value
set of a.
a
V
U
a 
:
.
A
a
a
V
Age LEMS
x１ 16-30 50
x2 16-30 0
x3 31-45 1-25
x4 31-45 1-25
x5 46-60 26-49
x6 16-30 26-49
x7 46-60 26-49

Decision Systems/Tables
 DS:
 is the decision
attribute.
 The elements of A are
called the condition
attributes.
Age LEMS Walk
x１ 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
})
{
,
( d
A
U
T 

A
d 

Issues in the Decision Table
 The same or indiscernible objects may be
represented several times.
 Some of the attributes may be superfluous.

Indiscernibility
 The equivalence relation
A binary relation which is reflexive
(i.e. an object is in relation with itself xRx) ,
symmetric (if xRy then yRx) and
transitive (if xRy and yRz then xRz).
 The equivalence class of an element
consists of all objects such that xRy.
X
X
R 

X
x
X
y

Indiscernibility (2)
 Let IS = (U, A) be an information system, then
with any there is associated an equivalence
relation:
where is called the B-indiscernibility
relation.
 If then objects x and x’are
indiscernible from each other by attributes from B.
 The equivalence classes of the B-indiscernibility
relation are denoted
A
B 
)}
'
(
)
(
,
|
)
'
,
{(
)
( 2
x
a
x
a
B
a
U
x
x
B
INDIS 




)
(B
INDIS
),
(
)
'
,
( B
IND
x
x IS

.
]
[ B
x

An Example of Indiscernibility
 The non-empty subsets of
the condition attributes
are {Age}, {LEMS}, and
{Age, LEMS}.
 IND({Age}) = {{x1,x2,x6},
{x3,x4}, {x5,x7}}
 IND({LEMS}) = {{x1},
{x2}, {x3,x4}, {x5,x6,x7}}
 IND({Age,LEMS}) =
{{x1}, {x2}, {x3,x4},
{x5,x7}, {x6}}.
Age LEMS Walk
x１ 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no

Observations
 An equivalence relation induces a partitioning
of the universe.
 The partitions can be used to build new subsets
of the universe.
 Subsets that are most often of interest have the
same value of the decision attribute.
It may happen, however, that a concept such as
“Walk” cannot be defined in a crisp manner.

Set Approximation
 Let T = (U, A) and let and
We can approximate X using only the
information contained in B by constructing
the B-lower and B-upper approximations of
X, denoted and respectively, where
A
B  .
U
X 
X
B X
B
},
]
[
|
{ X
x
x
X
B B 

}.
0
]
[
|
{ 

 X
x
x
X
B B

Set Approximation (2)
 B-boundary region of X,
consists of those objects that we cannot
decisively classify into X in B.
 B-outside region of X,
consists of those objects that can be with
certainty classified as not belonging to X.
 A set is said to be rough if the boundary
region is non-empty.
,
)
( X
B
X
B
X
BNB 

,
X
B
U 

An Example of Set Approximation
 Let W = {x | Walk(x) = yes}.
 The decision class, Walk, is
rough since the boundary
region is not empty.
Age LEMS Walk
x１ 16-30 50 yes
x2 16-30 0 no
x3 31-45 1-25 no
x4 31-45 1-25 yes
x5 46-60 26-49 no
x6 16-30 26-49 yes
x7 46-60 26-49 no
}.
7
,
5
,
2
{
},
4
,
3
{
)
(
},
6
,
4
,
3
,
1
{
},
6
,
1
{
x
x
x
W
A
U
x
x
W
BN
x
x
x
x
W
A
x
x
W
A
A






An Example of
Set Approximation (2)
yes
yes/no
no
{{x1,{x6}}
{{x3,x4}}
{{x2}, {x5,x7}}

U
setＸ
U/R
R : subset of
attributes
X
R
X
R
Lower & Upper Approximations

(2)
}
:
/
{ X
Y
R
U
Y
X
R 

 
}
0
:
/
{ 


 X
Y
R
U
Y
X
R 
Lower Approximation:
Upper Approximation:

(3)
X1 = Flu(yes) = {u2, u3, u6, u7}
Lower approx., RX1
{u2, u3}
Upper approx.,
{u2, u3, u6, u7, u8, u5}
X2 = Flu(no) = {u1, u4, u5, u8}
Lower approx., RX2
{u1, u4}
Upper approx.,
{u1, u4, u5, u8, u7, u6}
X1
R X2
R
U Headache Temp. Flu
U1 Yes Normal No
U2 Yes High Yes
U3 Yes Very-high Yes
U4 No Normal No
U5 N
N
No
o
o H
H
Hi
i
ig
g
gh
h
h N
N
No
o
o
U6 No Very-high Yes
U7 N
N
No
o
o H
H
Hi
i
ig
g
gh
h
h Y
Y
Ye
e
es
s
s
U8 No Very-high No
Elementary sets of indiscernibility relations
defined by R = {Headache, Temp.} are {u1},
{u2}, {u3}, {u4}, {u5, u7}, {u6, u8}.

(4)
R = {Headache, Temp.}
U/R = { {u1}, {u2}, {u3}, {u4}, {u5, u7}, {u6, u8}}
X1 = Flu(yes) = {u2,u3,u6,u7}
X2 = Flu(no) = {u1,u4,u5,u8}
RX1 = {u2, u3}
= {u2, u3, u6, u7, u8, u5}
RX2 = {u1, u4}
= {u1, u4, u5, u8, u7, u6}
X1
R
X2
R
u1
u4
u3
X1 X2
u5
u7
u2
u6 u8

Properties of Approximation
Y
X
Y
B
X
B
Y
X
B
Y
B
X
B
Y
X
B
U
U
B
U
B
B
B
X
B
X
X
B












)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
,
)
(
)
(
)
(




)
(
)
( Y
B
X
B  )
(
)
( Y
B
X
B 
implies and

Properties of Approximation (2)
)
(
))
(
(
))
(
(
)
(
))
(
(
))
(
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
X
B
X
B
B
X
B
B
X
B
X
B
B
X
B
B
X
B
X
B
X
B
X
B
Y
B
X
B
Y
X
B
Y
B
X
B
Y
X
B
















where -X denotes U - X.

Four Basic Classes of Rough Sets
 X is roughly B-definable, iff and
 X is internally B-undefinable, iff
and
 X is externally B-undefinable, iff
and
 X is totally B-undefinable, iff
and
0
)
( 
X
B
,
)
( U
X
B 
0
)
( 
X
B
,
)
( U
X
B 
0
)
( 
X
B
,
)
( U
X
B 
0
)
( 
X
B
.
)
( U
X
B 

Accuracy of Approximation
where |X| denotes the cardinality of
Obviously
If X is crisp with respect to B.
If X is rough with respect to B.
|
)
(
|
|
)
(
|
)
(
X
B
X
B
X
B 

.
0

X
.
1
0 
 B

,
1
)
( 
X
B

,
1
)
( 
X
B


Issues in the Decision Table
 The same or indiscernible objects may be
represented several times.
 Some of the attributes may be superfluous
(redundant).
That is, their removal cannot worsen the
classification.

Reducts
 Keep only those attributes that preserve the
indiscernibility relation and, consequently,
set approximation.
 There are usually several such subsets of
attributes and those which are minimal are
called reducts.

Dispensable & Indispensable
Attributes
Let
Attribute c is dispensable in T,
if , otherwise
attribute c is indispensable in T.
.
C
c
)
(
)
( })
{
( D
POS
D
POS c
C
C 

X
C
D
POS
D
U
X
C /
)
(


The positive region:

Independent
 T = (U, A, C, D) is independent
if all are indispensable in T.
C
c

Reduct & Core
 The set of attributes is called a reduct
of C, if T’= (U, A, R, D) is independent and
 The set of all the condition attributes
indispensable in T is denoted by CORE(C).
where RED(C) is the set of all reducts of C.
C
R 
).
(
)
( D
POS
D
POS C
R 
)
(
)
( C
RED
C
CORE 


An Example of Reducts & Core
U Headache Muscle
pain
Temp. Flu
U1 Yes Yes Normal No
U2 Yes Yes High Yes
U3 Yes Yes Very-high Yes
U4 No Yes Normal No
U5 No No High No
U6 No Yes Very-high Yes
U Muscle
pain
Temp. Flu
Ｕ1,U4 Yes Normal No
U2 Yes High Yes
U3,U6 Yes Very-high Yes
U5 No High No
U1 Yes Norlmal No
U2 Yes High Yes
U4 No Normal No
U5 No High No
Reduct1 = {Muscle-pain,Temp.}
Reduct2 = {Headache, Temp.}
CORE = {Headache,Temp} {MusclePain, Temp} = {Temp}


Discernibility Matrix
 Let T = (U, A, C, D) be a decision
table, with
By a discernibility matrix of T, denoted M(T),
we will mean matrix defined as:
for i, j = 1,2,…,n.
Here denotes that this case does not need to
be considered.
They classify objects and into different
classes.
}.
,...,
,
{ 2
1 n
u
u
u
U 
n
n
 )]
(
)
(
[
)}
(
)
(
:
{
)]
(
)
(
[
j
i
j
i
j
i
u
d
u
d
D
d
if
u
c
u
c
C
c
u
d
u
d
D
d
if
ij
m








 
i
u j
u
,
}
{ A
a
a
V 


Discernibility Function
 For any ,
U
ui 
}}
,...,
2
,
1
{
,
:
{
)
( n
j
i
j
m
u
f ij
j
i
T 




where (1) is the disjunction of all variables a
such that if
(2) if
(3) if
ij
m

,
ij
m
c .


ij
m
),
( false
mij 

 .


ij
m
),
(true
t
mij 
 .


ij
m
Each logical product in the minimal disjunctive normal
form defines a reduct of instance .
i
u

Examples of Discernibility Matrix
No a b c d
u1 a0 b1 c1 y
u2 a1 b1 c0 n
u3 a0 b2 c1 n
u4 a1 b1 c1 y
C = {a, b, c}
D = {d}
In order to discern equivalence
classes of the decision attribute d,
to preserve conditions described
by the discernibility matrix for
this table
u1 u2 u3
u2
u3
u4
a,c
b
c a,b
Reduct = {b, c}
)
(
)
( b
a
c
b
c
a 



 


Examples of Discernibility Matrix
(2)
ａｂｃｄＥ
ｕ1 1 0 2 1 1
ｕ2 1 0 2 0 1
ｕ3 1 2 0 0 2
ｕ4 1 2 2 1 0
ｕ5 2 1 0 0 2
ｕ6 2 1 1 0 2
ｕ7 2 1 2 1 1
u1 u2 u3 u4 u5 u6
u2
u3
u4
u5
u6
u7
b,c,d b,c
b b,d c,d
a,b,c,d a,b,c a,b,c,d
a,b,c,d a,b,c a,b,c,d
a,b,c,d a,b c,d c,d
Core = {b}
Reduct1 = {b,c}
Reduct2 = {b,d}


 
 

Rough Membership
 The rough membership function quantifies
the degree of relative overlap between the
set X and the equivalence class to
which x belongs.
 The rough membership function can be
interpreted as a frequency-based estimate of
where u is the equivalence
relation of IND(B).
B
x]
[
]
1
,
0
[
: 
U
B
X

|
]
[
|
|
]
[
|
B
B
B
X
x
X
x 


),
|
( u
X
x
P 

Rough Membership (2)
 The formulae for the lower and upper
approximations can be generalized to some
arbitrary level of precision by means
of the rough membership function
 Note: the lower and upper approximations as
originally formulated are obtained as a special
case with
]
1
,
5
.
0
(


}.
1
)
(
|
{
}
)
(
|
{











x
x
X
B
x
x
X
B
B
X
B
X
.
1



Dependency of Attributes
 Discovering dependencies between attributes
is an important issue in KDD.
 A set of attribute D depends totally on a set
of attributes C, denoted if all values
of attributes from D are uniquely determined
by values of attributes from C.
,
D
C 

Dependency of Attributes (2)
 Let D and C be subsets of A. We will say that
D depends on C in a degree k
denoted if
where called a positive
region of the partition U/D with respect to C.
),
1
0
( 
 k
,
D
C k

|
|
|
)
(
|
)
,
(
U
D
POS
D
C
k C

 
),
(
)
(
/
X
C
D
POS
D
U
X
C

 

Dependency of Attributes (3)
 Obviously
 If k = 1 we say that D depends totally on C.
 If k < 1 we say that D depends partially
(in a degree k) on C.
.
|
|
|
)
(
|
)
,
(
/



D
U
X U
X
C
D
C


A Rough Set Based KDD Process
 Discretization based on RS and
Boolean Reasoning (RSBR).
 Attribute selection based RS
with Heuristics (RSH).
 Rule discovery by GDT-RS.

What Are Real World Issues ?
 Very large data sets
 Uncertainty (noisy data)
 Incompleteness (missing, incomplete data)
 Data change
 Use of background knowledge

very large data set
noisy data
incomplete instances
data change
use of background
knowledge
Real world
issues
Methods
ID3 Prism Version BP Dblearn
(C4.5) Space
Okay possible

Probability
Logic
Set
Soft Techniques for KDD

Stoch. Proc.
Belief Nets
Conn. Nets
GDT
Deduction
Induction
Abduction
RoughSets
Fuzzy Sets
Soft Techniques for KDD (2)

Deduction
Induction
Abduction
GDT
GrC
RS&ILP
RS
TM
A Hybrid Model

GDT : Generalization Distribution Table
RS : Rough Sets
TM: Transition Matrix
ILP : Inductive Logic Programming
GrC : Granular Computing

Discretization based on RSBR
 In the discretization of a decision table =
where is an
interval of real-valued values, we search for
a partition of for any
 Any partition of is defined by a sequence
of the so-called cuts from
 Any family of partitions can be
identified with a set of cuts.
}),
{
,
( d
A
U  )
,
[ a
a
a w
v
V 
a
P a
V .
A
a
a
V
1
v
k
v
v
v 

 ...
2
1 .
a
V
A
a
a
P 
}
{

Discretization Based on RSBR
(2)
In the discretization process, we search for a set
of cuts satisfying some natural conditions.
A a b d
u1 0.8 2 1
u2 1 0.5 0
u3 1.3 3 0
u4 1.4 1 1
u5 1.4 2 0
u6 1.6 3 1
u7 1.3 1 1
A a b d
u1 0 2 1
u2 1 0 0
u3 1 2 0
u4 1 1 1
u5 1 2 0
u6 2 2 1
u7 1 1 1
P P
P = {(a, 0.9),
(a, 1.5),
(b, 0.75),
(b, 1.5)}

A Geometrical Representation of
Data and Cuts
0 0.8 1 1.3 1.4 1.6 a
b
3
2
1
0.5
x1
x2
x3
x4
x7
x5
x6

A Geometrical Representation of
Data and Cuts (2)
0 0.8 1 1.3 1.4 1.6 a
b
3
2
1
0.5
x1
x2
x3
x4
x5
x6
x7

(3)
 The sets of possible values of a and b are
defined by
 The sets of values of a and b on objects
from U are given by
a(U) = {0.8, 1, 1.3, 1.4, 1.6};
b(U) = {0.5, 1, 2, 3}.
);
2
,
0
[

a
V ).
4
,
0
[

b
V

(4)
 The discretization process returns a partition
of the value sets of conditional attributes
into intervals.

A Discretization Process
 Step 1: define a set of Boolean variables,
where
corresponds to the interval [0.8, 1) of a
corresponds to the interval [1, 1.3) of a
corresponds to the interval [1.3, 1.4) of a
corresponds to the interval [1.4, 1.6) of a
corresponds to the interval [0.5, 1) of b
corresponds to the interval [1, 2) of b
corresponds to the interval [2, 3) of b
}
,
,
,
,
,
,
{
)
( 3
2
1
4
3
2
1
b
b
b
a
a
a
a
p
p
p
p
p
p
p
U
BV 
b
b
b
a
a
a
a
p
p
p
p
p
p
p
3
2
1
4
3
2
1

The Set of Cuts on Attribute a
0.8 1.0 1.3 1.4 1.6
a
a
p1
a
p2
a
p3
a
p4
1
c 2
c 3
c 4
c

A Discretization Process (2)
 Step 2: create a new decision table by using
the set of Boolean variables defined in Step 1.
Let be a decision table,
be a propositional variable corresponding to
the interval for any
and
})
{
,
( d
A
U
T 

a
k
p
)
,
[ 1
a
k
a
k v
v 
}
1
,...,
1
{ 
 a
n
k
.
A
a

A Sample T Defined in Step 2
U*
a
p1
a
p3
a
p2
a
p4
b
p1
b
p2
b
p3
(x1,x2)
(x1,x3)
(x1,x5)
(x4,x2)
(x4,x3)
(x4,x5)
(x6,x2)
(x6,x3)
(x6,x5)
(x7,x2)
(x7,x3)
(x7,x5)
1 0 0 0 1 1 0
1 1 0 0 0 0 1
1 1 1 0 0 0 0
0 1 1 0 1 0 0
0 0 1 0 0 1 1
0 0 0 0 0 1 0
0 1 1 1 1 1 1
0 0 1 1 0 0 0
0 0 0 1 0 0 1
0 1 0 0 1 0 0
0 0 0 0 0 1 0
0 0 1 0 0 1 0

The Discernibility Formula
 The discernibility formula
means that in order to discern object x1 and
x2, at least one of the following cuts must
be set,
a cut between a(0.8) and a(1)
a cut between b(0.5) and b(1)
a cut between b(1) and b(2).
b
b
a
p
p
p
x
x 2
1
1
2
1 )
,
( 




The Discernibility Formulae for
All Different Pairs
b
b
a
p
p
p
x
x 2
1
1
2
1 )
,
( 



b
a
a
p
p
p
x
x 3
1
1
3
1 )
,
( 



a
a
a
p
p
p
x
x 3
2
1
5
1 )
,
( 



b
a
a
p
p
p
x
x 1
3
2
2
4 )
,
( 



b
b
a
p
p
p
x
x 3
2
2
3
4 )
,
( 



b
p
x
x 2
5
4 )
,
( 


The Discernibility Formulae for
All Different Pairs (2)
b
b
b
a
a
a
p
p
p
p
p
p
x
x 3
2
1
4
3
2
2
6 )
,
( 






a
a
p
p
x
x 4
3
3
6 )
,
( 


b
a
p
p
x
x 3
4
5
6 )
,
( 


b
a
p
p
x
x 1
2
2
7 )
,
( 


b
b
p
p
x
x 3
2
3
7 )
,
( 


b
a
p
p
x
x 2
3
5
7 )
,
( 



A Discretization Process (3)
 Step 3: find the minimal subset of p that
discerns all objects in different decision
classes.
The discernibility boolean propositional
formula is defined as follows,
)}.
(
)
(
:
)
.
(
{ j
i
U
x
d
x
d
j
i 

 


in CNF Form
)
(
)
( 3
2
1
2
1
1
b
a
a
b
b
a
U
p
p
p
p
p
p 






)
(
)
( 3
2
2
1
3
2
b
b
a
b
a
a
p
p
p
p
p
p 





)
( 3
2
1
4
3
2
b
b
b
a
a
a
p
p
p
p
p
p 





)
(
)
(
)
( 1
2
3
4
4
3
b
a
b
a
a
a
p
p
p
p
p
p 





.
)
(
)
( 2
2
3
3
2
b
b
a
b
b
p
p
p
p
p 





in DNF Form
 We obtain four prime implicants,
is the optimal result, because
it is the minimal subset of P.
}
,
,
{ 2
4
2
b
a
a
p
p
p
)
(
)
( 3
2
3
2
2
4
2
b
b
a
a
b
a
a
U
p
p
p
p
p
p
p 







).
(
)
( 2
1
4
1
3
2
1
3
b
b
a
a
b
b
b
a
p
p
p
p
p
p
p
p 








The Minimal Set Cuts
for the Sample DB
0 0.8 1 1.3 1.4 1.6 a
b
3
2
1
0.5
x1
x2
x3
x4
x5
x6
x7

A Result
A a b d
u1 0.8 2 1
u2 1 0.5 0
u3 1.3 3 0
u4 1.4 1 1
u5 1.4 2 0
u6 1.6 3 1
u7 1.3 1 1
A a b d
u1 0 1 1
u2 0 0 0
u3 1 1 0
u4 1 0 1
u5 1 1 0
u6 2 1 1
u7 1 0 1
P P
P = {(a, 1.2),
(a, 1.5),
(b, 1.5)}

Attribute Selection
U Headache Muscle-pain Temp. Flu
U2 Yes Yes High Yes
U4 No Yes Normal No
U5 No No High No
U Muscle-pain Temp. Flu
U1 Yes Normal No
U2 Yes High Yes
U4 Yes Normal No
U5 No High No
U1 Yes Normal No
U2 Yes High Yes
U4 No Normal No
U5 No High No
U6 No Very-high Yes

Observations
 A database always contains a lot of attributes
that are redundant and not necessary for rule
discovery.
 If these redundant attributes are not removed,
not only the time complexity of rule discovery
increases, but also the quality of the discovered
rules may be significantly depleted.

The Goal of Attribute Selection
Finding an optimal subset of attributes in a
database according to some criterion, so that
a classifier with the highest possible
accuracy can be induced by learning
algorithm using information about data
available only from the subset of attributes.

The Filter Approach
 Preprocessing
 The main strategies of attribute selection:
– The minimal subset of attributes
– Selection of the attributes with a higher rank
 Advantage
– Fast
 Disadvantage
– Ignoring the performance effects of the induction
algorithm

The Wrapper Approach
 Using the induction algorithm as a part of the search
evaluation function
 Possible attribute subsets (N-number of attributes)
 The main search methods:
– Exhaustive/Complete search
– Heuristic search
– Non-deterministic search
 Advantage
– Taking into account the performance of the induction algorithm
 Disadvantage
– The time complexity is high
1
2 
N

Basic Ideas:
Attribute Selection using RSH
 Take the attributes in CORE as the initial
subset.
 Select one attribute each time using the rule
evaluation criterion in our rule discovery
system, GDT-RS.
 Stop when the subset of selected attributes
is a reduct.

Why Heuristics ?
 The number of possible reducts can be
where N is the number of attributes.
Selecting the optimal reduct from all of
possible reducts is NP-hard and heuristics
must be used.
1
2 
N

The Rule Selection Criteria
in GDT-RS
 Selecting the rules that cover as many
instances as possible.
 Selecting the rules that contain as little
attributes as possible, if they cover the same
number of instances.
 Selecting the rules with larger strengths, if
they have same number of condition
attributes and cover the same number of
instances.

Attribute Evaluation Criteria
 Selecting the attributes that cause the number
of consistent instances to increase faster
– To obtain the subset of attributes as small as
possible
 Selecting an attribute that has smaller number
of different values
– To guarantee that the number of instances covered
by rules is as large as possible.

A Heuristic Algorithm
for Attribute Selection
 Let R be a set of the selected attributes, P be the
set of unselected condition attributes, U be the
set of all instances, X be the set of contradictory
instances, and EXPECT be the threshold of
accuracy.
 In the initial state, R = CORE(C),
k = 0.
)
(D
POS
U
X R


),
(C
CORE
C
P 


for Attribute Selection (2)
 Step 1. If k >= EXPECT, finish, otherwise
calculate the dependency degree, k,
 Step 2. For each p in P, calculate
))
}
{
/(
)
(
(
max_
|
)
(
|
})
{
(
})
{
(
D
p
R
D
POS
size
m
D
POS
v
p
R
p
p
R
p






.
|
|
|
)
(
|
U
D
POS
k R

where max_size denotes the cardinality of the maximal subset.

for Attribute Selection (3)
 Step 3. Choose the best attribute p with the
largest and let
 Step 4. Remove all consistent instances u in
from X.
 Step 5. Go back to Step 1.
)
(D
POSR
}.
{
}
{
p
P
P
p
R
R




,
p
p m
v 

Main Features of RSH
 It can select a better subset of attributes
quickly and effectively from a large DB.
 The selected attributes do not damage the
performance of induction so much.

An Example of
Attribute Selection
U a ｂｃｄ e
ｕ1 1 0 2 1 1
ｕ2 1 0 2 0 1
ｕ3 1 2 0 0 2
ｕ4 1 2 2 1 0
ｕ5 2 1 0 0 2
ｕ6 2 1 1 0 2
ｕ7 2 1 2 1 1
Condition Attributes:
a: Va = {1, 2}
b: Vb = {0, 1, 2}
c: Vc = {0, 1, 2}
d: Vd = {0, 1}
Decision Attribute:
e: Ve = {0, 1, 2}

U ｂｃｄ e
ｕ1 0 2 1 1
ｕ2 0 2 0 1
ｕ3 2 0 0 2
ｕ4 2 2 1 0
ｕ5 1 0 0 2
ｕ6 1 1 0 2
ｕ7 1 2 1 1
Searching for ＣＯＲＥ
Removing attribute a
Removing attribute a does
not cause inconsistency.
Hence, a is not used as
CORE.

Searching for ＣＯＲＥ (2)
Removing attribute ｂ
U a ｃｄ e
ｕ1 1 2 1 1
ｕ2 1 2 0 1
ｕ3 1 0 0 2
ｕ4 1 2 1 0
ｕ5 2 0 0 2
ｕ6 2 1 0 2
ｕ7 2 2 1 1
0
1
2
1
4
1
1
2
1
1
:
:
e
d
c
a
u
e
d
c
a
u


Removing attribute b
cause inconsistency.
Hence, b is used as CORE.

Removing attribute c
U a ｂｄ e
ｕ1 1 0 1 1
ｕ2 1 0 0 1
ｕ3 1 2 0 2
ｕ4 1 2 1 0
ｕ5 2 1 0 2
ｕ6 2 1 0 2
ｕ7 2 1 1 1
Removing attribute c
does not cause inconsistency.
Hence, c is not used
as CORE.

Removing attribute d
U a ｂｃ e
ｕ1 1 0 2 1
ｕ2 1 0 2 1
ｕ3 1 2 0 2
ｕ4 1 2 2 0
ｕ5 2 1 0 2
ｕ6 2 1 1 2
ｕ7 2 1 2 1
Removing attribute d
does not cause inconsistency.
Hence, d is not used
as CORE.

CORE(C)={b}
Initial subset R = {b}
Attribute b is the unique indispensable
attribute.

R={b}
Ｕ a ｂｃｄ e
ｕ1 1 0 2 1 1
ｕ2 1 0 2 0 1
ｕ3 1 2 0 0 2
ｕ4 1 2 2 1 0
ｕ5 2 1 0 0 2
ｕ6 2 1 1 0 2
ｕ7 2 1 2 1 1
1
0 e
b 

U’ ｂ e
ｕ1 0 1
ｕ2 0 1
ｕ3 2 2
ｕ4 2 0
ｕ5 1 2
ｕ6 1 2
ｕ7 1 1
The instances containing b0 will not be considered.

Attribute Evaluation Criteria
 Selecting the attributes that cause the number
of consistent instances to increase faster
– To obtain the subset of attributes as small as
possible
 Selecting the attribute that has smaller number
of different values
– To guarantee that the number of instances covered
by a rule is as lager as possible.

Selecting Attribute from {a,c,d}
U’ a ｂ e
ｕ3 1 2 2
ｕ4 1 2 0
ｕ5 2 1 2
ｕ6 2 1 2
ｕ7 2 1 1
1. Selecting {a}
R = {a,b}
0
2
1
2
2
1
e
b
a
e
b
a


1
1
2
2
1
2
e
b
a
e
b
a





 }
/{
}
,
{ )
(
e
U
X
b
a X
POS
u3,u5,u6
u4
u7
U/{e}
u3
u4
u7
U/{a,b}
u5
u6

Selecting Attribute from {a,c,d} (2)
2. Selecting {c}
R = {b,c}
1
2
1
2
1
1
2
0
1
0
2
2
2
0
2
e
c
b
e
c
b
e
c
b
e
c
b
e
c
b





Ｕ’ ｂｃ e
ｕ3 2 0 2
ｕ4 2 2 0
ｕ5 1 0 2
ｕ6 1 1 2
ｕ7 1 2 1
u3,u5,u6
u4
u7
U/{e}
Ｕ’ ｂｃ e
ｕ3 2 0 2
ｕ4 2 2 0
ｕ5 1 0 2
ｕ6 1 1 2
ｕ7 1 2 1
};
7
,
6
,
5
,
4
,
3
{
)
(
}
/{
}
,
{ u
u
u
u
u
X
POS
e
U
X
c
b 



3. Selecting {d}
R = {b,d}
1
1
1
2
0
1
0
1
2
2
0
2
e
d
b
e
d
b
e
d
b
e
d
b




Ｕ’ ｂｄ e
ｕ3 2 0 2
ｕ4 2 1 0
ｕ5 1 0 2
ｕ6 1 0 2
ｕ7 1 1 1
u3,u5,u6
u4
u7
U/{e}
};
7
,
6
,
5
,
4
,
3
{
)
(
}
/{
}
,
{ u
u
u
u
u
X
POS
e
U
X
d
b 



3. Selecting {d}
R = {b,d}
}}
6
,
5
{
},
3
{{
}
,
/{
})
6
,
5
,
3
({
}
,
{ u
u
u
d
b
u
u
u
POS d
b 
Result: Subset of attributes＝ {b, d}
u3,u5,u6
u4
u7
U/{e}
u3,
u4
u7
U/{b,d}
u5,u6
2
})
,
/{
})
6
,
5
,
3
({
(
max_ }
,
{ 
d
b
u
u
u
POS
size d
b

Experimental Results
Data sets Attribute
Number
Instance
Number
Attri. N.
In Core
Selected
Attri. N.
Monk1 6 124 3 3
Monk3 6 122 4 4
Mushroom 22 8124 0 4
Breast
cancer
10 699 1 4
Earthquake 16 155 0 3
Meningitis 30 140 1 4
Bacterial
examination
57 20920 2 9
Slope-
collapse
23 3436 6 8
Gastric
cancer
38 7520 2 19

Main Features of GDT-RS
 Unseen instances are considered in the
discovery process, and the uncertainty of a
rule, including its ability to predict possible
instances, can be explicitly represented in the
strength of the rule.
 Biases can be flexibly selected for search
control, and background knowledge can be
used as a bias to control the creation of a GDT
and the discovery process.

A Sample DB
u1 a0 b0 c1 y
u2 a0 b1 c1 y
u3 a0 b0 c1 y
u4 a1 b1 c0 n
u5 a0 b0 c1 n
u6 a0 b2 c1 y
u7 a1 b1 c1 y
Condition attributes： a, b, c
a = {a0, a1} b = {b0, b1, b2} c = {c0, c1}
Decision attribute： d, d = {y，n}
U a b c d

A Sample Database (2)
 T = (U, A, C, D)
 Attributes A = {C, D} = {a, b, c, d}
 Condition Attributes C = {a, b, c}
a: Va = {a0, a1}
b: Vb = {b0, b1, b2}
c: Vc = {c0, c1}
 Decision Attribute D = {d}
d: Vd = {y，n}
,
}
{ A
a
a
V 

A Sample GDT
a0b0c0 a0b0c1 … … a1b0c0 …... a1b2c1
*b0c0
*b0c1
*b1c0
*b1c1
*b2c0
*b2c1
a0*c0
…...
a1b1*
a1b2*
**c0
…...
a0**
a1**
1/2 …… 1/2 ……
1/2 ……
……
……
……
…… 1/2
1/3 ……
…… ……
……
…… 1/2
1/6 1/6 ……
…… ……
1/6 1/6 ……
1/6 …… 1/6
G(x)
F(x)

Explanation for GDT
 F(x): the possible instances (PI)
 G(x): the possible generalizations (PG)
 the probability relationships
between PI & PG.
:
)
(
)
( x
F
x
G 

Probabilistic Relationship
Between PIs and PGs
 


*}
]
[
|
{ l
PG
l
k
k
PG n
N i
a0*c0
a0b0c0
a0b1c0
a0b2c0
3
}
0
0 
 b
c
a n
N
｛







otherwise
0
if
1
)
|
(
i
j
PG
i
j
PG
PI
N
PG
PI
p i
P=1/3
1/3
1/3
i
PG
N is the number of PI
satisfying the ith PG.

Unseen Instances
U Headache Muscle-pain Temp. Flu
U2 Yes Yes High Yes
U4 No Yes Normal No
U5 No No High No
Unseen Instances:
yes，no，normal
yes, no, high
yes, no, very-high
no, yes, high
no, no, normal
no, no, very-high
Open world Closed world

Rule Representation
X Y with S
 X denotes the conjunction of the conditions
that a concept must satisfy
 Y denotes a concept that the rule describes
 S is a “measure of strength” of which the
rule holds

Rule Strength (1)
 The strength of the generalization X
(BK is no used),
is the number of the observed
instances satisfying the ith generalization.
k
PG
k
rel
ins
N
PG
N
l
k
l
k
PG
PI
p
PG
s
X
s
)
(
)
|
(
)
(
)
(

 


))
(
1
)(
(
)
( Y
X
r
X
s
Y
X
S 



)
( k
rel
ins PG
N 

Rule Strength (2)
 The strength of the generalization X
(BK is used),
k
PG
l
k
l
l
k
l
bk
k
N
PG
PI
BKF
PG
PI
p
PG
s
X
s





)
|
(
)
|
(
)
(
)
(

Rule Strength (3)
 The rate of noises
is the number of instances
belonging to the class Y within the instances
satisfying the generalization X.
)
(
)
,
(
)
(
)
( X
N
Y
X
N
X
N
rel
ins
class
ins
rel
ins
Y
X
r 

 


)
,
( Y
X
N class
ins

Rule Discovery by GDT-RS
Condition Attrs.： a, b, c
a: Va = {a0, a1}
b: Vb = {b0, b1, b2}
c: Vc = {c0, c1}
Class： d:
d: Vd = {y，n}
U a b c d
u1 a0 b0 c1 y
u2 a0 b1 c1 y
u3 a0 b0 c1 y
u4 a1 b1 c0 n
u5 a0 b0 c1 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

Regarding the Instances
(Noise Rate = 0)
U a b c d
u1,
u1’ u3,
u5
a0 b0 c1
y,
y,
n
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y
67
.
0
3
1
1
)
'
1
(
33
.
0
3
2
1
)
'
1
(
}
{
}
{






u
r
u
r
n
y






)
'
1
(
)
'
1
(
)
'
1
(
0
Let
}
{
}
{
u
d
T
u
r
T
u
r
T
noise
n
noise
y
noise
と

U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y


Generating Discernibility Vector
for u2
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y









7
,
2
6
,
2
4
,
2
2
,
2
1
,
2
}
{
}
,
{
}
{
m
b
m
c
a
m
m
b
m
u1’ u2 u4 u6 u7
u2 b  a,c b 

Obtaining Reducts for u2
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

u1’ u2 u4 u6 u7
u2 b  a,c b 
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
2
(
c
b
b
a
c
a
b
b
c
a
b
u
fT
















Generating Rules from u2
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

)
(
)
(
)
2
( c
b
b
a
u
fT 



{a0,b1} {b1,c1}
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 0.5
{b1c1}
a0b1c1(u2)
a1b1c1(u7)
s({b1c1}) = 1
0
)
}
1
1
({ 
 y
c
b
r
y
y
0
)
}
1
0
({ 
 y
b
a
r
y

Generating Rules from u2 (2)
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

1
)
0
1
(
)
2
1
2
(
with
}
1
1
{
5
.
0
)
0
1
(
)
2
1
1
(
with
}
1
0
{












S
y
c
b
S
y
b
a

Generating Discernibility Vector
for u4
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

}
{
}
,
{
}
,
,
{
7
,
4
6
,
4
4
,
4
2
,
4
1
,
4
c
m
m
m
c
a
m
c
b
a
m








u1’ u2 u4 u6 u7
u4 a,b,c a,c   c

Obtaining Reducts for u4
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

)
(
)
(
)
(
)
(
)
4
(
c
c
c
a
c
b
a
u
fT











u1’ u2 u4 u6 u7
u4 a,b,c a,c   c

Generating Rules from u4
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

{c0}
{c0}
a0b0c0
a1b1c0(u4)
0
)
}
0
({
6
1
)
0
(



n
c
r
c
s
n
)
(
)
4
( c
u
fT 
a1b2c0

Generating Rules from u4 (2)
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

167
.
0
)
0
1
(
)
6
1
1
(
with
}
0
{ 




 S
n
c

Generating Rules from All
Instances
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

u2: {a0b1} y, S = 0.5
{b1c1} y, S =1
u4: {c0} n, S = 0.167
u6: {b2} n, S=0.25
u7: {a1c1} y, S=0.5
{b1c1} y, S=1

Rule Selection
 Selecting the rules that contain as many
instances as possible.
 Selecting the rules in the levels as high as
possible.
 Selecting the rules with larger strengths
in the same level of generalization.

Generalization belonging to
Class y
a0b1c1
(ｙ)
a1b1c1
(ｙ)
*b1c1 1/2 1/2
a1*c1 1/3
a0b1* 1/2
{b1c1} y with S = 1 u2，u7
{a1c1} y with S = 1/2 u7
{a0b1} y with S = 1/2 u2
u2 u7

Generalization belonging to
Class n
a0b2c1
(n)
a1b1c0
(n)
**c0 1/6
*b2* 1/4
c0 n with S = 1/6 u4
b2 n with S = 1/4 u6
u4 u6

Results from the Sample DB
（Noise Rate = 0）
 Certain Rules: Instances Covered
{c0} n with S = 1/6 u4
{b2} n with S = 1/4 u6
{b1c1} y with S = 1 u2，u7

 Possible Rules:
b0 y with S = (1/4)(1/2)
a0 & b0 y with S = (1/2)(2/3)
a0 & c1 y with S = (1/3)(2/3)
b0 & c1 y with S = (1/2)(2/3)
Instances Covered: u1, u3, u5
Results from the Sample DB (2)
(Noise Rate ＞ 0)

Regarding Instances
(Noise Rate > 0)
U a b c d
u1,
u1’ u3,
u5
a0 b0 c1
y,
y,
n
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y
67
.
0
3
1
1
)
'
1
(
33
.
0
3
2
1
)
'
1
(
}
{
}
{






u
r
u
r
n
y
y
u
d
T
u
r
T
noise
y
noise




)
'
1
(
)
'
1
(
5
.
0
Let
}
{

U a b c d
u1’ a0 b0 c1 y
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

Rules Obtained from All
Instacnes
U a b c d
u1’ a0 b0 c1 y
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y
u2: {a0b1} y, S=0.5
{b1c1} y, S=1
u4: {c0} n, S=0.167
u6: {b2} n, S=0.25
u7: {a1c1} y, S=0.5
{b1c1} y, S=1
u1’:{b0} y, S=1/4*2/3=0.167

Example of Using BK
a0b0c0 a0b0c1 a0b1c0 a0b1c1 a0b2c0 a0b2c1 … a1b2c1
a0b0* 1/2 1/2
a0b1* 1/2 1/2
a0*c1 1/3 1/3 1/3
a0** 1/6 1/6 1/6 1/6 1/6 1/6
BK： a0 => c1, 100%
a0b0c0 a0b0c1 a0b1c0 a0b1c1 a0b2c0 a0b2c1 … a1b2c1
a0b0* 0 1
a0b1* 0 1
a0*c1 1/3 1/3 1/3
a0** 0 1/3 0 1/3 0 1/3

Changing Strength of
Generalization by BK
U a b c d
u1’ a0 b0 c1
u2 a0 b1 c1 y
u4 a1 b1 c0 n
u6 a0 b2 c1 n
u7 a1 b1 c1 y

)
(
)
(
)
2
( c
b
b
a
u
fT 



{a0,b1} {b1,c1}
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 0.5
0
)
}
1
0
({ 
 y
b
a
r
{a0b1}
a0b1c0
a0b1c1(u2)
s({a0b1}) = 1
0
)
}
1
0
({ 
 y
b
a
r
1/2
1/2 100%
0%
a0 => c1, 100%

Algorithm 1
Optimal Set of Rules
 Step 1. Consider the instances with the same
condition attribute values as one instance,
called a compound instance.
 Step 2. Calculate the rate of noises r for each
compound instance.
 Step 3. Select one instance u from U and
create a discernibility vector for u.
 Step 4. Calculate all reducts for the instance u
by using the discernibility function.

E5-roughsets unit-V.pdf

More Related Content

Similar to E5-roughsets unit-V.pdf (20)

More from Ramya Nellutla (18)

Recently uploaded (20)

E5-roughsets unit-V.pdf