Classification Using Decision tree

By : Mohd. Noor Abdul Hamid, Ph.D
(Universiti Utara Malaysia)

Introduction to Classification
Classification  the task of assigning objects to
one of several predefined categories or class.
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes
is the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.

By : Mohd. Noor Abdul Hamid
(UUM)
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set

Example of Classification techniques:
- Decision Tree
- Neural Network
- Rule-Based
- naïve Bayes Classifier, etc.
Classification techniques are most suited for
predicting data sets with binary or nominal
categories. They are less effective for ordinal
categories since they do not consider the implicit
order among the categories.

Performance of a Classification model is evaluated based on
the counts of test records correctly and incorrectly predicted
by the model
 Confusion Matrix
Figure : Confusion matrix for a 2-class problem
Based on the entries of the confusion matrix, the total
number of :
- correct prediction made by the model is (f11 + f00)
- incorrect predictions is(f01 + f10).
Predicted Class
Class = 1 Class = 0
Actual
Class
Class = 1 f11 f10
Class = 0 f01 f00

Therefore we can evaluate the performance of
Classification model by looking at the accuracy of the
model to make prediction.
Equivalently, the performance of a model can
be expressed in terms of its error rate:
Accuracy =
Error Rate =
00011011
0110
ffff
ff
spredictionofnumberTotal
spredictionwrongofNumber



00011011
0011
ffff
ff
spredictionofnumberTotal
spredictioncorrectofNumber




• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
Example of Classification Task

What is a Decision Tree?
A decision tree is a structure that can be used to
divide up a large collection of records into
successfully smaller sets of records by applying a
sequence of simple decision rules.
With each successive division, the members of the
resulting sets become more and more similar to each
other.
A decision tree model consists of a set of rules for
dividing a large heterogeneous population into
smaller, more homogeneous (mutually exclusive)
groups with respect to a particular target.
Hence, the algorithm used to construct decision
tree is referred to as recursive partitioning.

The target variable is usually categorical and the
decision tree is used either to:
Calculate the probability that a given record belong to
each of the category or,
To classify the record by assigning it to the most likely class
(or category).
Note : Decision tree can also be used to estimate the value of a
continuous target variable. However, regression models and
neural network are generally more appropriate for
estimation.

Decision Tree has three types of nodes:
Root Node : top (or left-most) node with no incoming edges
and zero or more outgoing edges.
Child or Internal Node : descendent node which has exactly
one incoming edge and two or more outgoing edges.
Leaf Node : terminal node which has exactly one incoming
edge and no outgoing edges.
In Decision Tree, each leaf node is assigned a class
label.
The rules or branches are the unique path (edges)
with a set of conditions (attribute) that divide the
observations into smaller subset.

Decision Tree Diagram
Gender
Height
Height
Short Tall Short
Medium
Tall
Female
Male
<1.3m >1.8m <1.5m
> 2.0m
ROOT
NODE
INTERNAL
NODE
BRANCH
Medium
LEAF NODE

Types of Decision Tree
Balanced Tree
Bushy Tree
Deep Tree

How to Build Decision Tree?
Generally, building a decision tree involved 2 steps:
Tree construction  recursively split the tree according to
selected attributes (conditions),
Tree pruning  identify and remove the irrelevance
branches (that might lead to outliers) – to increase
classification accuracy.
Pruning

(UUM)
How to Build Decision Tree?
In principle, there are exponentially many decision
tree that can be construct from a given set of
attributes  finding the optimal tree is computationally infeasible
because of the exponential size of the search space.
Efficient algorithms has been develop to induce
reasonably accurate, albeit suboptimal, decision
tree in a reasonable amount of time.
These algorithm usually employ a greedy strategy 
making a series of locally optimal decisions about
which attribute to use for partitioning the data.
One such algorithm is Hunt’s Algorithm –
which is the basis of many existing decision tree
algorithm including ID3, C4.5 and CART.

(UUM)
Hunt’s Algorithm
Let Dt be the set of training records that are associated with
node t and y = {y1, y2,…yc}, where y is the target variable
with c number of classes.
The following is a recursive definition of Hunt’s algorithm:
Step 1 :
If all the records in Dt belong to the same class yt, then node t is a
leaf node labeled as yt.
Step 2 :
If Dt contains records that belong to more than one class, an
attribute test condition is selected to partition the records into
smaller subsets. A child node is created for each outcome of the
test condition and the records in Dt are distributed to the children
based on the outcomes. The algorithm is then recursively applied
to each child node.

(UUM)
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree

(UUM)
Another Example of Decision
Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one
tree that fits the same data!

(UUM)
Apply Model to Test Data
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Start from the root of tree.

(UUM)
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data

(UUM)
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
< 80K > 80K
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Test Data
Assign Cheat to “No”

(UUM)
Design Issue of Decision Tree Induction
How should the training records be split?
Which attribute test condition works better to classify the
records?
What is the objective measures for evaluating the goodness
of each test condition?
How should the splitting procedure stop?
What is the condition to stop splitting the records?
One strategy is to continue expanding a node until all the
records belong to the same class or all the records have
identical attributes values.
Other criteria can also be imposed to allow the tree-growing
procedure to terminate earlier.

(UUM)
Methods for Expressing Attribute Test Condition
a) Binary Attributes  generates two possible outcomes
(binary split)
GENDER
Male Female

(UUM)
b) Nominal Attributes : Multiway split
Marital Status
Single Divorced Married

(UUM)
b) Nominal Attributes : Binary split (eg : in CART)
Marital Status
{Single} { Married, Divorced}
Marital Status
{Married} { Single, Divorced}
Marital Status
{Divorced} { Married, Single }
OR

(UUM)
c) Ordinal Attributes : Multiway split
Shirt Size
Small Large Extra LargeMedium

(UUM)
c) Ordinal Attributes : Binary split – as long as it does not violate
the order property of the attribute values.
Shirt Size
{S, M} { L, XL}
Marital Status
{S} { M, L, XL }
Marital Status
{S, L } { M, XL }
OR

(UUM)
d) Continuous Attributes  Binary split
Annual
Income
> 80K
Yes No

(UUM)
d) Continuous Attributes : Multiway split
Annual Income
<10K 50K, 80K} >80k{10K, 25K} {25K, 50K}

(UUM)
Measures for selecting the Best Split
Let p(i|t) denote the fraction of records belonging to class
i at a given node t.
In a two-class problem, the class distribution at any node
can be written as (p0, p1), where p1 = 1 – p0
The measure developed for selecting the best split are
often based on the degree of impurity of the child class
distribution (0,1) has zero impurity, whereas a node with
uniform class distribution (0.5, 0.5) has the highest impurity.
Example of impurity measures include:
Entropy
Gini
Classification error Categorical Target
Information Gain Ratio
Chi Square Test
Variance Reduction
F-Test Interval Target

(UUM)
Measures of Impurity (I)




1
0
2 )|(log)|()(
c
i
tiptiptEntropy




1
0
2
)]|([1)(
c
i
tiptGini
)]|([max1)( tipterrortionClassifica
i

Where c is the number of classes and
0log20 = 0 in the entropy calculations

(UUM)
Example : Measures of Impurity
Parent Node Count
Class = 0 4
Class = 1 14
Node N1 Count
Class = 0 0
Class = 1 6
Node N2 Count
Class = 0 1
Class = 1 5
Node N3 Count
Class = 0 3
Class = 1 3
Node N1:
0)]6/6(),6/0max[(1
0)6/6(log)6/6()6/0(log)6/0(
0)6/6()6/0(1
22
22



Error
Entropy
Gini

(UUM)
Parent Node Count
Class = 0 4
Class = 1 14
Node N1 Count
Class = 0 0
Class = 1 6
Node N2 Count
Class = 0 1
Class = 1 5
Node N3 Count
Class = 0 3
Class = 1 3
Node N2:
167.0)]6/5(),6/1max[(1
65.0)6/5(log)6/5()6/1(log)6/1(
278.0)6/5()6/1(1
22
22



Error
Entropy
Gini

(UUM)
Parent Node Count
Class = 0 4
Class = 1 14
Node N1 Count
Class = 0 0
Class = 1 6
Node N2 Count
Class = 0 1
Class = 1 5
Node N3 Count
Class = 0 3
Class = 1 3
Node N3:
5.0)]6/3(),6/3max[(1
1)6/3(log)6/3()6/3(log)6/3(
5.0)6/3()6/3(1
22
22



Error
Entropy
Gini

(UUM)
Parent Node Count
Class = 0 4
Class = 1 14
Node N1 Count
Class = 0 0
Class = 1 6
Gini 0
Entropy 0
Error 0
Node N2 Count
Class = 0 1
Class = 1 5
Gini 0.278
Entropy 0.650
Error 0.167
Node N3 Count
Class = 0 3
Class = 1 3
Gini 0.5
Entropy 1
Error 0.5
N1 has the lowest impurity value, followed by N2 and N3

(UUM)
Measures of Impurity
To determine how well a test condition performs, we need to
compare the degree of impurity of the parent node (before
splitting) and the child node (after splitting).
The larger the different, the better the test condition.
The gain ∆, is a criterion that can be used to determine the
goodness of a split.
Where:
I(.) is the impurity measure of a given node
N is the total number of records at the parent node
k is the number of attributes value (class)
N(vj) is the number of records associated with the
child node vj.
)(
)(
)(
1
j
k
j
j
vI
N
vN
parentI 

Weighted
Average
Impurity

(UUM)
Measures of Impurity : Info. Gain Ratio
Since I(parent) is the same for all test condition,
maximizing the gain is equivalent to minimizing the
weighted average impurity measure of the child
nodes.
When entropy is used as the impurity measure, the
difference in entropy is known as the Information
Gain Ratio (IGR)
Decision tree build using entropy tend to be quite
bushy. Bushy tree with many multi-way split are
undesirable as these splits lead to small numbers of
records in each node.

(UUM)
Splitting Binary Attributes (using Gini)
Example :
Suppose there are two ways (A and B) to split the data into smaller subset.
N2
C0 2
C1 3
N1
C0 1
C1 4
N1
C0 4
C1 3
N2
C0 5
C1 2
Parent
C0 6
C1 6
Gini = 0.5
A B
Which one is a better split??
Compute the weighted average of the Gini index of
both attribute
Gini :
1 –(6/12)2 – (6/12)2
= 0.5
Gini Index:
0.4898
Gini Index:
0.480
Gini Index:
0.4082
Gini Index:
0.320

(UUM)
Splitting Binary Attributes (using Gini)
Example :
N2
C0 2
C1 3
N1
C0 1
C1 4
N1
C0 4
C1 3
N2
C0 5
C1 2
A B
Gini Index:
0.4898
Gini Index:
0.480
Gini Index:
0.4082
Gini Index:
0.320
Weighted Average of Gini Index:
[(7/12) x 0.4898] + [(5/12) x 0.480]
= 0.486
Gain, ∆ = 0.5 - 0.486 = 0.014
Weighted Average of Gini Index:
[(5/12) x 0.320] + [(7/12) x 0.4082]
= 0.3715
Gain, ∆ = 0.5 - 0.3715 = 0.1285
Therefore, B is preferred


(UUM)
Splitting Nominal Attributes (using Gini)
Example : Which split is better? Binary or Multi-way splits.
Family
C0 1
C1 3
Sports
C0 8
C1 0
Sports, Luxury
C0 9
C1 7
Family, Luxury
C0 2
C1 10
Car Type Car Type
Car Type
Family
C0 1
C1 3
Sports
C0 8
C1 0
Luxury
C0 1
C1 7
Weighted Average Gini = 0.468 Weighted Average Gini = 0.167
Weighted Average Gini = 0.163


(UUM)
Splitting Continuous Attributes (using Gini)
A brute-force method is used to find the best split position (v) for
a continuous attribute (eg: Annual Income).
To reduce complexity, the training records are sorted based on
the annual income.
Class No No No Yes Yes Yes No No No No
Annual
Income
(sorted)
60 70 75 85 90 95 100 120 125 220
Candidate split positions (v) are identified by taking the
midpoints between two adjacent sorted values.
0716253443434343526170No
0303030303122130303030Yes
>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤
23017212211097928780726555Split
position
(mid
points)

(UUM)
Splitting Continuous Attributes (using Gini)
A brute-force method is used to find the best split position (v) for
a continuous attribute (eg: Annual Income).
We then compute the Gini index for each candidate and
choose the one that gives the lowest value.
Class No No No Yes Yes Yes No No No No
Annual
Income
(sorted)
60 70 75 85 90 95 100 120 125 220
0716253443434343526170No
0303030303122130303030Yes
>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤>≤
23017212211097928780726555Split
position
(mid
points)
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.4 0.420


Classification Using Decision tree

More Related Content

What's hot (20)

Similar to Classification Using Decision tree (20)

More from Mohd. Noor Abdul Hamid (14)

Recently uploaded (20)

Classification Using Decision tree