Lecture -3 Classification(Decision Tree).ppt

Classification : Decision
Tree (DT)
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2021)

Outline
 What is decision tree (DT) algorithm
 Why we need DT
 Pros and Cons of DT
 Information Theory
 Some Issues in DT
 Assignment II
02/04/25 2

Decision Tree (DT)
 Decision trees: splitting datasets one features at a time.
 The decision tree is one of the most commonly used classification technique.
 It has a decision blocks (rectangles).
 Termination block (ovals).
 The right and left arrows are called branches.
 The kNN algorithm can do a grate job of classification, but it didn’t lead to
any major insight about the data.
02/04/25 3

Decision Tree (DT)
A decision Tree
02/04/25 4

Decision Tree (DT)
 The best part of the DT (decision tree) algorithm is that humans can easily
understand the data:
 The Decision Tree algorithm:
 Takes a set of data. (training examples)
 Build a decision tree (model), and draw it.
 It can also be re-represented as sets of if-then rules to improve human readability.
 The DT does a grate job of distilling data into knowledge.
 Takes a set of unfamiliar data and extract a set of rules.
 DT is often used in expert system development.
02/04/25 5

Decision Tree (DT)
 The DT can be expressed using the following expression:
 (Outlook = Sunny Humidity =
˄ Normal) → Yes
 (Outlook =
˅ Overcast) → Yes
 (Outlook =
˅ Rain Wind =
˄ Weak) → Yes
02/04/25 6

Decision Tree (DT)
 The pros and cons of DT:
 Pros of DT:
 Computationally cheap to use,
 Easy for humans to understand the learned results,
 Missing values OK (robust to errors),
 Can deal with irrelevant features.
 Cons of DT:
 Prone to overfitting.
 Work with: Numeric values, nominal values.
02/04/25 7

Decision Tree (DT)
 Appropriate problems for DT learning:
 Instance are represented by attribute-value pairs (fixed set of attributes and
their values),
 The target function has discrete output values,
 Disjunctive descriptions may be required,
 The training data may contain errors,
 The training data may contain missing attribute values.
02/04/25 8

Decision Tree (DT)
 The mathematics that is used by DT to split the dataset is called
information theory:
 The first decision, you need to make is:
 Which feature shall be used to split the data.
 You need to try every feature and measure which split will give the
best result.
 Then split the dataset into subsets.
 The subsets will then traverse down the branches of the decision
node.
 If the data on the branch is the same, stop; else repeat the splitting.
02/04/25 9

Decision Tree (DT)
Pseudo-code for the splitting function
02/04/25 10

Decision Tree (DT)
 General approach to decision trees:
 Collect: Any method.
 Prepare: DI3 algorithm works only on nominal values, so any
continues values will need to be quantized.
 Analyze: Any method. You should visually inspect the tree after it
is build.
 Train: Construct a tree data structure. (DT)
 Test: Calculate the error rate with the learned tree.
 Use: This can be used in any supervised learning task. Often, to
better understand the data.
02/04/25 11

Decision Tree (DT)
 We would like to classify the following animals into two
classes:
Fish and not Fish
02/04/25 12
Marine animal data

Decision Tree (DT)
 Need to decide whether we should split the data based on the
first feature or the second feature:
To make more organize the unorganized data.
One way to do this is to measure the information.
Measure the information before and after the split.
 Information theory is a branch of science that is concerned with
quantifying information.
 The change in information before and after the split is known
as the information gain.
02/04/25 13

Decision Tree (DT)
 The split with the highest information gain is the best option.
The measure of information of a set is known as the Shannon
entropy or entropy.
One way to do this is to measure the information.
 The change in information before and after the split is known
as the information gain.
02/04/25 14

Decision Tree (DT)
 To calculate entropy, you need the expected value of all the information of
all possible values of our class.
 This is given by:
02/04/25 15
 Where n is the number of classes:

Decision Tree (DT)
 The higher the entropy, the more mixed up the data.
 Another common measure of disorder in a set is the Gini
impurity.
Which is the probability of choosing an item from the set and the
probability of that data item being misclassified.
Calculate the shannon entropy of a dataset.
Dataset splitting on a given feature.
Choosing the best feature to split on.
02/04/25 16

Decision Tree (DT)
 Recursively building the tree.
Start with dataset and split it based on the best attribute.
The data will traverse down the branches of the tree to another
node.
This node will then split the data again (recursively)
Stop under the following conditions: run out of attributes or if the
instances in a branch are the same class.
02/04/25 17

Decision Tree (DT)
02/04/25 18
Table 2: Example training sets

Decision Tree (DT)
02/04/25 19
Figure 3: Data path while splitting

Decision Tree (DT)
 ID3 uses the information gain measure to select among the
candidate attributes.
Start with dataset and split it based on the best attribute.
Given a collection S, containing positive and negative examples
of some target.
The entropy of S relative to this Boolean classification is:
Entropy(S) = -p1Xlog2p1+ - p2Xlog2p2
02/04/25 20

Decision Tree (DT)
 Example:
The target attribute is PlayTennis. (yes/no)
02/04/25 21
Table 3: Example training sets

Decision Tree (DT)
 Suppose S is a collection of 14 examples of some Boolean
concept, including 9 positive and 5 negative examples.
 Then the entropy of S relative to Boolean classification is:
02/04/25 22

Decision Tree (DT)
 Note that the entropy is 0 if all members of S belong to the
same class.
 For example: if all the members are positive (p+ = 1), then (p-
= 0).
Entropy (s) = (-1.log2(1)) + (-0.log2(0)) = 0
 Note the entropy is one (1) when the collection contain an
equal number of positive and negative examples.
 If the collection contain unequal number of positive and
negative the entropy is b/n 0 and 1.
02/04/25 23

Decision Tree (DT)
 Suppose S is a collection of training-example days described by
attributes Wind. (weak, strong)
 The information gain is the measure used by ID3 to select the
best attribute at each step in growing the tree.
02/04/25 24

Decision Tree (DT)
Information gain of the two attributes: Humidity and Wind.
02/04/25 25

Decision Tree (DT)
 Example:
ID3 will determines the information gain for each attribute.
(Outlook, Temperature, Humidity and Wind)
Then select the one with the highest information gain.
The information gain values for all four attributes are:
Gain (S, Outlook) = 0.246
Gain (S, Humidity) = 0.151
Gain (S, Wind) = 0.048
Gain (S, Temperature) = 0.029
Outlook provides grater information gain than the other.
02/04/25 26

Decision Tree (DT)
 Example:
According to the information gain measure, the Outlook attribute
selected as the root node.
Branches are created below the root for each of its possible
values. (Sunny, Overcast, and Rain)
02/04/25 27

Decision Tree (DT)
 The partially learned decision tree resulting from the first step of ID3.
02/04/25 28

Decision Tree (DT)
 The overcast descendant has only positive examples and
therefore becomes a leaf node with classification Yes:
 The other two nodes will be expand by selecting the attribute
with the highest information gain relative to the new subsets.
02/04/25 29

Decision Tree (DT)
 Decision Tree learning can be:
Classification tree: finite set values target variables
Regression tree: continuous values target variable
There are many specific Decision Tree algorithms:
ID3 (Iterative of ID3)
C4.5 (Successor of ID3)
CART(Classification and Regression Tree)
CHAID (Chi – squared Automatic Interaction Detector)
MARS: extends DT to handle numerical data better
02/04/25 30

Decision Tree (DT)
 Different Decision Tree algorithms use different metrics for
measuring the “best attribute” :
Information gain: used by ID3, C4.5 and C5.0
Gini impurity: used by CART
02/04/25 31

Decision Tree (DT)
 ID3 in terms of its search space and search strategy:
ID3’s hypothesis space of all decision tree is a complete space of
finite discrete-valued functions.
ID3’s maintains only a single current hypothesis as it searches
through the space of decision trees.
ID3 in its pure form perform no backtracking in its search. (post-
pruning the decision tree)
ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. (much less sensitive to error)
02/04/25 32

Decision Tree (DT)
 Inductive bias in Decision Tree learning (ID3) :
Inductive bias are the set of assumption.
ID3 selects in favor of shorter trees over longer ones. (breadth
first approach)
Selects trees that place the attributes with the highest information
gain closest to the root.
02/04/25 33

Decision Tree (DT)
 Issues in Decision Tree learning:
How deeply to grow the decision tree,
Handling continuous attributes ,
Choosing an appropriate attributes selection measure,
Handling training data with missing attribute values,
Handling attributes with differing costs and,
Improve computational efficiency.
 ID3 extended to address most of these issues to C4.5.
02/04/25 34

Decision Tree (DT)
 Avoiding over fitting the Data:
 Noisy data and too small training examples are problems.
 Over fitting is a practical problem for Decision Tree and many of
the learning algorithms.
 Over fitting was found to decrease the accuracy of the learned tree
by 10-25%.
 Approach to avoid over fitting:
 Stop growing the tree, before it over fitting. (direct but less practical)
Allow the tree to over fitting, and then post-prune (the most successful
in practice)
02/04/25 35

Decision Tree (DT)
 Incorporating continuous-value attributes:
Initial definition of ID3, attributes and target value must be
discrete set of values.
The attributes tested in the decision nodes of the tree must be
discrete value :
 Create a new Boolean attribute for the continuous value or
 Multiple interval rather than just two interval.
02/04/25 36

Decision Tree (DT)
 Alternative measure for selecting attributes:
Information gain favor attributes with many values.
One alternative measure that has been used successfully is the
gain ratio.
02/04/25 37

Decision Tree (DT)
 Handling training example with missing attribute value:
Assign it with the most common value among training examples at
node n.
Assign the probability to each of the possible values of the attribute.
The 2nd
approach, is used in C4.5.
02/04/25 38

Decision Tree (DT)
 Handling attributes with different costs:
Low-cost attributes than high-cost attributes.
ID3 can be modified to take into account costs by introducing a cost
term into the attribute selection measure.
Divide the gain by the cost of the attribute.
02/04/25 39

Assignment II
 Answer the given questions by considering the following set oftraining examples.
02/04/25 42

Assignment II
(a) What is the entropy of this collection of training examples with respect to the target function classification?
 (b) What is the information gain of a2 relative to these training examples?
02/04/25 43

Decision Tree (DT)
 Do some research on the following Decision Tree algorithms:
ID3 (Iterative of ID3)
C4.5 (Successor of ID3)
CART(classification and Regression Tree)
CHAID (Chi – squared Automatic Interaction Detector)
MARS: extends DT to handle numerical data better
02/04/25 44

Lecture -3 Classification(Decision Tree).ppt

More Related Content

Similar to Lecture -3 Classification(Decision Tree).ppt (20)

More from gadisaAdamu (20)

Recently uploaded (20)

Lecture -3 Classification(Decision Tree).ppt