SlideShare a Scribd company logo
Machine Learning Chapter 3. Decision Tree Learning Tom M. Mitchell
Abstract Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting
Decision Tree for  PlayTennis
A Tree to Predict C-Section Risk Learned from medical records of 1000 women Negative examples are C-sections
Decision Trees Decision tree representation: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent:  ,   , XOR (A    B)    (C      D    E) M of N
When to Consider Decision Trees Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences
Top-Down Induction of Decision Trees Main loop: 1.  A     the “best” decision attribute for next  node 2. Assign  A  as decision attribute for  node 3. For each value of  A , create new descendant of  node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best?
Entropy(1/2) S  is a sample of training examples p ⊕  is the proportion of positive examples in  S p ⊖  is the proportion of negative examples in  S Entropy measures the impurity of  S Entropy ( S )      -  p ⊕ log 2   p ⊕   -   p ⊖ log 2  p ⊖
Entropy(2/2) Entropy ( S ) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the optimal, shortest-length code) Why? Information theory: optimal length code assigns log 2 p  bits to message having probability  p . So, expected number of bits to encode ⊕ or ⊖ of random  member of  S : p ⊕ ( - log 2   p ⊕ ) +  p ⊖ ( - log 2  p ⊖ ) Entropy ( S )      -  p ⊕ log 2   p ⊕   -   p ⊖ log 2  p ⊖
Information Gain Gain ( S ,  A ) = expected reduction in entropy due to sorting on  A
Training Examples
Selecting the Next Attribute(1/2) Which attribute is the best classifier?
Selecting the Next Attribute(2/2) S sunny   = {D1,D2,D8,D9,D11} Gain (S sunny  , Humidity)  = .970 - (3/5) 0.0 - (2/5) 0.0 = .970 Gain (S sunny  , Temperature)  = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (S sunny , Wind)  = .970 - (2/5) 1.0 - (3/5) .918 = .019
Hypothesis Space Search by ID3(1/2)
Hypothesis Space Search by ID3(2/2) Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis (which one?) Can’t play 20 questions... No back tracking Local minima... Statistically-based search choices Robust to noisy data... Inductive bias: approx “prefer shortest tree”
Inductive Bias in ID3 Note H is the power set of instances X ->  Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a  preference  for some hypotheses, rather than a  restriction  of hypothesis space H Occam's razor: prefer the shortest hypothesis that fits the data
Occam’s Razor Why prefer short hypotheses? Argument in favor : Fewer short hyps. than long hyps. ->   a short hyp that fits data unlikely to be coincidence ->   a long hyp that fits data might be coincidence Argument opposed : There are many ways to define small sets of hyps e.g., all trees with a prime number of nodes that use attributes beginning with “Z” What's so special about small sets based on  size  of hypothesis??
Overfitting in Decision Trees Consider adding noisy training example #15: Sunny, Hot, Normal, Strong,   PlayTennis  =  No What effect on earlier tree?
Overfitting Consider error of hypothesis h over training data:  error train (h) entire distribution  D  of data:  error D (h) Hypothesis  h  ∈  H   overfits  training data if there is an alternative hypothesis  h ' ∈  H  such that error train (h) < error train (h ' ) and error D (h) > error D (h ' )
Overfitting in Decision Tree Learning
Avoiding Overfitting How can we avoid overfitting? stop growing when data split not statistically significant grow full tree, then post-prune How to select “best” tree : Measure performance over training data Measure performance over separate validation data set MDL: minimize size ( tree )  + size ( misclassifications ( tree ))
Reduced-Error Pruning Split data into  training  and  validatio n set Do until further pruning is harmful: 1. Evaluate impact on  validation  set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves  validation  set accuracy produces smallest version of most accurate subtree What if data is limited?
Effect of Reduced-Error Pruning
Rule Post-Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use Perhaps most frequently used method (e.g., C4.5 )
Converting A Tree to Rules IF  ( Outlook  =  Sunny ) ∧ ( Humidity  =  High ) THEN  PlayTennis  =  No IF  ( Outlook  =  Sunny ) ∧ ( Humidity  =  Normal ) THEN  PlayTennis  =  Yes … .
Continuous Valued Attributes Create a discrete attribute to test continuous Temperature =  82.5 ( Temperature >  72.3)  = t, f
Attributes with Many Values Problem: If attribute has many values,  Gain  will select it Imagine using  Date  =  Jun _3_1996 as attribute One approach : use  GainRatio  instead where  S i  is subset of  S  for which A has value  v i
Attributes with Costs Consider medical diagnosis,  BloodTest  has cost $150 robotics,  Width _ from _1 ft  has cost 23 sec. How to learn a consistent tree with low expected cost? One approach: replace gain by Tan and Schlimmer (1990) Nunez (1988) where  w  ∈ [0,1] determines importance of cost
Unknown Attribute Values What if some examples missing values of A? Use training example anyway, sort through tree If node  n  tests  A , assign most common value of  A  among other examples sorted to node  n assign most common value of  A  among other examples with same target value assign probability  p i  to each possible value  v i  of  A assign fraction  p i  of example to each descendant in tree Classify new examples in same fashion

More Related Content

What's hot (20)

PDF
Introduction to text classification using naive bayes
Dhwaj Raj
 
PDF
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
PPT
3.2 partitioning methods
Krish_ver2
 
PPT
Decision tree and random forest
Lippo Group Digital
 
PDF
Decision tree lecture 3
Laila Fatehy
 
PPTX
Unit 1-Data Science Process Overview.pptx
Anusuya123
 
PPT
2.1 Data Mining-classification Basic concepts
Krish_ver2
 
PPT
Data preprocessing
ankur bhalla
 
PPTX
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
PPTX
Data mining: Classification and prediction
DataminingTools Inc
 
PPTX
Ensemble learning Techniques
Babu Priyavrat
 
PPTX
Learning set of rules
swapnac12
 
PPT
Support Vector Machines
nextlib
 
PPTX
Decision Tree - C4.5&CART
Xueping Peng
 
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
PDF
Introduction to Machine Learning Classifiers
Functional Imperative
 
PDF
I.BEST FIRST SEARCH IN AI
vikas dhakane
 
PPTX
Machine learning and types
Padma Metta
 
PDF
Bayesian Networks - A Brief Introduction
Adnan Masood
 
PPTX
Text MIning
Prakhyath Rai
 
Introduction to text classification using naive bayes
Dhwaj Raj
 
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
3.2 partitioning methods
Krish_ver2
 
Decision tree and random forest
Lippo Group Digital
 
Decision tree lecture 3
Laila Fatehy
 
Unit 1-Data Science Process Overview.pptx
Anusuya123
 
2.1 Data Mining-classification Basic concepts
Krish_ver2
 
Data preprocessing
ankur bhalla
 
Random Forest Classifier in Machine Learning | Palin Analytics
Palin analytics
 
Data mining: Classification and prediction
DataminingTools Inc
 
Ensemble learning Techniques
Babu Priyavrat
 
Learning set of rules
swapnac12
 
Support Vector Machines
nextlib
 
Decision Tree - C4.5&CART
Xueping Peng
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Introduction to Machine Learning Classifiers
Functional Imperative
 
I.BEST FIRST SEARCH IN AI
vikas dhakane
 
Machine learning and types
Padma Metta
 
Bayesian Networks - A Brief Introduction
Adnan Masood
 
Text MIning
Prakhyath Rai
 

Viewers also liked (20)

PPTX
Decision tree
Venkata Reddy Konasani
 
PPT
Machine Learning 1 - Introduction
butest
 
PPTX
ID3 ALGORITHM
HARDIK SINGH
 
PPTX
Decision Tree- M.B.A -DecSci
Lesly Lising
 
PPTX
Decision tree
Daksh Goyal
 
PPTX
Decision trees for machine learning
Amr BARAKAT
 
PPSX
Decision tree Using c4.5 Algorithm
Mohd. Noor Abdul Hamid
 
PPTX
Decision theory
Aditya Mahagaonkar
 
PDF
LinkedIn SlideShare: Knowledge, Well-Presented
SlideShare
 
PDF
TaPP 2013 - Provenance for Data Mining
Boris Glavic
 
PPT
ppt
butest
 
PPTX
Artificial intelligence
Shikhar Bansal
 
PPT
Ch 9-2.Machine Learning: Symbol-based[new]
butest
 
PPT
002.decision trees
hoangminhdong
 
PPT
Machine Learning Chapter 11 2
butest
 
PDF
ID3 Algorithm & ROC Analysis
Talha Kabakus
 
PDF
Machine learning Lecture 4
Srinivasan R
 
PPT
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
butest
 
PPTX
Project anlysis
jatinderbatish
 
PPT
Decision in Risk EVPI
Javaid Toosy
 
Decision tree
Venkata Reddy Konasani
 
Machine Learning 1 - Introduction
butest
 
ID3 ALGORITHM
HARDIK SINGH
 
Decision Tree- M.B.A -DecSci
Lesly Lising
 
Decision tree
Daksh Goyal
 
Decision trees for machine learning
Amr BARAKAT
 
Decision tree Using c4.5 Algorithm
Mohd. Noor Abdul Hamid
 
Decision theory
Aditya Mahagaonkar
 
LinkedIn SlideShare: Knowledge, Well-Presented
SlideShare
 
TaPP 2013 - Provenance for Data Mining
Boris Glavic
 
ppt
butest
 
Artificial intelligence
Shikhar Bansal
 
Ch 9-2.Machine Learning: Symbol-based[new]
butest
 
002.decision trees
hoangminhdong
 
Machine Learning Chapter 11 2
butest
 
ID3 Algorithm & ROC Analysis
Talha Kabakus
 
Machine learning Lecture 4
Srinivasan R
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
butest
 
Project anlysis
jatinderbatish
 
Decision in Risk EVPI
Javaid Toosy
 
Ad

Similar to Machine Learning 3 - Decision Tree Learning (20)

PDF
Decision tree learning
Dr. Radhey Shyam
 
PPT
Machine Learning
butest
 
PPT
Decision_Tree in machine learning with examples.ppt
amrita chaturvedi
 
PPT
Decision tree
Ami_Surati
 
PPT
Decision tree
Soujanya V
 
PPT
Machine Learning
butest
 
PPT
Introduction to Machine Learning
butest
 
PPT
20070702 Text Categorization
midi
 
PPTX
ML_Unit_1_Part_C
Srimatre K
 
PPT
My7class
ketan533
 
PPT
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
butest
 
PPT
ppt
butest
 
PPT
lecture_mooney.ppt
butest
 
PPT
Download presentation source
butest
 
PPTX
ID3 Algorithm
CherifRehouma
 
PPT
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
PDF
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
PPT
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
PDF
Isolation Forest
Konkuk University, Korea
 
PPT
slides
butest
 
Decision tree learning
Dr. Radhey Shyam
 
Machine Learning
butest
 
Decision_Tree in machine learning with examples.ppt
amrita chaturvedi
 
Decision tree
Ami_Surati
 
Decision tree
Soujanya V
 
Machine Learning
butest
 
Introduction to Machine Learning
butest
 
20070702 Text Categorization
midi
 
ML_Unit_1_Part_C
Srimatre K
 
My7class
ketan533
 
Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles ...
butest
 
ppt
butest
 
lecture_mooney.ppt
butest
 
Download presentation source
butest
 
ID3 Algorithm
CherifRehouma
 
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Introduction to Machine Learning Aristotelis Tsirigos
butest
 
Isolation Forest
Konkuk University, Korea
 
slides
butest
 
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
DOC
1. MPEG I.B.P frame之不同
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPT
Timeline: The Life of Michael Jackson
butest
 
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
PPTX
Com 380, Summer II
butest
 
PPT
PPT
butest
 
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
DOC
MICHAEL JACKSON.doc
butest
 
PPTX
Social Networks: Twitter Facebook SL - Slide 1
butest
 
PPT
Facebook
butest
 
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
DOC
NEWS ANNOUNCEMENT
butest
 
DOC
C-2100 Ultra Zoom.doc
butest
 
DOC
MAC Printing on ITS Printers.doc.doc
butest
 
DOC
Mac OS X Guide.doc
butest
 
DOC
hier
butest
 
DOC
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
butest
 
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
butest
 
hier
butest
 
WEB DESIGN!
butest
 

Machine Learning 3 - Decision Tree Learning

  • 1. Machine Learning Chapter 3. Decision Tree Learning Tom M. Mitchell
  • 2. Abstract Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting
  • 3. Decision Tree for PlayTennis
  • 4. A Tree to Predict C-Section Risk Learned from medical records of 1000 women Negative examples are C-sections
  • 5. Decision Trees Decision tree representation: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification How would we represent:  ,  , XOR (A  B)  (C   D  E) M of N
  • 6. When to Consider Decision Trees Instances describable by attribute-value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Modeling calendar scheduling preferences
  • 7. Top-Down Induction of Decision Trees Main loop: 1. A  the “best” decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A , create new descendant of node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes Which attribute is best?
  • 8. Entropy(1/2) S is a sample of training examples p ⊕ is the proportion of positive examples in S p ⊖ is the proportion of negative examples in S Entropy measures the impurity of S Entropy ( S )  - p ⊕ log 2 p ⊕ - p ⊖ log 2 p ⊖
  • 9. Entropy(2/2) Entropy ( S ) = expected number of bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the optimal, shortest-length code) Why? Information theory: optimal length code assigns log 2 p bits to message having probability p . So, expected number of bits to encode ⊕ or ⊖ of random member of S : p ⊕ ( - log 2 p ⊕ ) + p ⊖ ( - log 2 p ⊖ ) Entropy ( S )  - p ⊕ log 2 p ⊕ - p ⊖ log 2 p ⊖
  • 10. Information Gain Gain ( S , A ) = expected reduction in entropy due to sorting on A
  • 12. Selecting the Next Attribute(1/2) Which attribute is the best classifier?
  • 13. Selecting the Next Attribute(2/2) S sunny = {D1,D2,D8,D9,D11} Gain (S sunny , Humidity) = .970 - (3/5) 0.0 - (2/5) 0.0 = .970 Gain (S sunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (S sunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019
  • 14. Hypothesis Space Search by ID3(1/2)
  • 15. Hypothesis Space Search by ID3(2/2) Hypothesis space is complete! Target function surely in there... Outputs a single hypothesis (which one?) Can’t play 20 questions... No back tracking Local minima... Statistically-based search choices Robust to noisy data... Inductive bias: approx “prefer shortest tree”
  • 16. Inductive Bias in ID3 Note H is the power set of instances X -> Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the root Bias is a preference for some hypotheses, rather than a restriction of hypothesis space H Occam's razor: prefer the shortest hypothesis that fits the data
  • 17. Occam’s Razor Why prefer short hypotheses? Argument in favor : Fewer short hyps. than long hyps. -> a short hyp that fits data unlikely to be coincidence -> a long hyp that fits data might be coincidence Argument opposed : There are many ways to define small sets of hyps e.g., all trees with a prime number of nodes that use attributes beginning with “Z” What's so special about small sets based on size of hypothesis??
  • 18. Overfitting in Decision Trees Consider adding noisy training example #15: Sunny, Hot, Normal, Strong, PlayTennis = No What effect on earlier tree?
  • 19. Overfitting Consider error of hypothesis h over training data: error train (h) entire distribution D of data: error D (h) Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h ' ∈ H such that error train (h) < error train (h ' ) and error D (h) > error D (h ' )
  • 20. Overfitting in Decision Tree Learning
  • 21. Avoiding Overfitting How can we avoid overfitting? stop growing when data split not statistically significant grow full tree, then post-prune How to select “best” tree : Measure performance over training data Measure performance over separate validation data set MDL: minimize size ( tree ) + size ( misclassifications ( tree ))
  • 22. Reduced-Error Pruning Split data into training and validatio n set Do until further pruning is harmful: 1. Evaluate impact on validation set of pruning each possible node (plus those below it) 2. Greedily remove the one that most improves validation set accuracy produces smallest version of most accurate subtree What if data is limited?
  • 24. Rule Post-Pruning 1. Convert tree to equivalent set of rules 2. Prune each rule independently of others 3. Sort final rules into desired sequence for use Perhaps most frequently used method (e.g., C4.5 )
  • 25. Converting A Tree to Rules IF ( Outlook = Sunny ) ∧ ( Humidity = High ) THEN PlayTennis = No IF ( Outlook = Sunny ) ∧ ( Humidity = Normal ) THEN PlayTennis = Yes … .
  • 26. Continuous Valued Attributes Create a discrete attribute to test continuous Temperature = 82.5 ( Temperature > 72.3) = t, f
  • 27. Attributes with Many Values Problem: If attribute has many values, Gain will select it Imagine using Date = Jun _3_1996 as attribute One approach : use GainRatio instead where S i is subset of S for which A has value v i
  • 28. Attributes with Costs Consider medical diagnosis, BloodTest has cost $150 robotics, Width _ from _1 ft has cost 23 sec. How to learn a consistent tree with low expected cost? One approach: replace gain by Tan and Schlimmer (1990) Nunez (1988) where w ∈ [0,1] determines importance of cost
  • 29. Unknown Attribute Values What if some examples missing values of A? Use training example anyway, sort through tree If node n tests A , assign most common value of A among other examples sorted to node n assign most common value of A among other examples with same target value assign probability p i to each possible value v i of A assign fraction p i of example to each descendant in tree Classify new examples in same fashion