SlideShare a Scribd company logo
Machine Learning
Decision Tree and Random Forest
Machine Learning
• Introduction
• What is ML, DL, AL?
• Decision Tree
Definition
Why Decision Tree?
Basic Terminology
Challenges
• Random Forest
Definition
Why Random Forest
How it works?
• Advantages & Disadvantages
Machine Learning
According to Arthur Samuel (1950) “Machine Learning is a field of study that gives computers the
ability to learn without being explicitly programmed”.
Machine learning is a study and design of algorithms which can learn by processing input (learning
samples) data.
The most widely used definition of machine learning is that of Carnegie Mellon University Professor
Tom Mitchell: “A computer program is said to learn from experience ‘E’, with respect to some
class of tasks ‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by
‘P’ improves with experience ‘E’”.
Machine Learning
AI
ML
DS
DL
Decision Tree & Random Forest
• Decision Tree
 Definition
 Why Decision Tree?
 Basic Terminology
 Challenges
• Random Forest
 Definition
 Why Random Forest
 How it works?
Decision Tree
Decision tree is a supervised machine learning algorithm which can be used
for classification as well as for regression problems. It represent the target on
its leaf nodes as a result or inferring's with a tree like structure
Why Decision Tree?
 Helpful in solving more complex problem where a linear prediction line does not
perform well
 Gives wonderful graphical presentation of each possible results
Decision Tree & Random Forest
Why Decision Tree?
Prediction can be done with
a linear regression line
Dose(mg)
Effectiveness
Dose
(mg)
Age Sex Effect
10 25 F 95
20 78 M 0
35 52 F 98
5 12 M 44
… … … …
… … … …
… … … …
Prediction can not be done
with a linear regression line
Why Decision Tree?
Dose
(mg)
Age Sex Effect
10 25 F 95
20 78 M 0
35 52 F 98
5 12 M 44
… … … …
… … … …
… … … …
Sample dataset Sample Decision Tree
Decision Tree
Root Node
Intermediate
Node
Leaf node Leaf Node
Intermediate
Node
Leaf node
Root Node: The top-most node of a decision tree. It does not have any parent node. It represents
the entire population or sample
Leaf / Terminal Nodes: Nodes that do not have any child node are known as Terminal/Leaf Nodes
Challenge in building Decision Tree
Challenge in building Decision Tree:
1. How to decide splitting Criteria?
• Target Variable(Categorical)
• Target Variable(Continuous)
2. How to decide depth of decision tree/ when to stop?
• Considerably all data points have been covered
• Check for node purity/ homogeneity
3. Over fitting
• Pre Pruning
• Post Pruning
How to built a decision tree using criteria:
How to built a decision tree
Love
Popcorn Love Soda Gender
Love Ice
cream
Y Y M N
Y N F N
N Y M Y
N Y M Y
Y Y F Y
Y N F N
N N M N
Y N M ?
Root node?
How to decide splitting Criteria?
1. Check if target variable if Categorical:
Gini Impurity: It indicate the feature purity, less impurity of a feature help it to be a root node or split node
Entropy/ Information Gain: Information gain and Entropy are opposite to each other, here entropy
indicates the impurity of a feature. That means higher the entropy, lesser the information gain. If information
gain of a node is high, higher the chances it become the root node.
Chi Square:
2. Target Variable(Continuous):
• Reduction in variance: When target variable is a continuous type of variable then this method can be
used to check variance of feature to decide it will be a splitting node or not.
How to decide splitting Criteria?
How to built a decision tree using criteria(Gini Index/ impurity):
How to built a decision tree
Love
Popcorn Love Soda Gender
Love Ice
cream
Y Y M N
Y N F N
N Y M Y
N Y M Y
Y Y F Y
Y N F N
N N M N
Y N M ?
Root node?
G.I. of leaf love popcorn (yes): 0.375
G.I. of leaf love popcorn (no): 0.444
G.I of feature love popcorn: 0.404
G.I. of leaf love Soda (yes): 0.375
G.I. of leaf love Soda (no): 0
G.I of feature love soda: 0.214
G.I. of leaf Gender (Male): 0.5
G.I. of leaf Gender (Female): 0.444
G.I of feature Gender: 0.476
Figure: Feature description with target variable i.e. Love Ice-cream
Decision Tree
Figure: Initial Decision Tree
Next
node?
Decision Tree
Love
Soda
Love
Popcorn Gender
Love Ice
cream
Y Y M N
Y N M Y
Y N M Y
Y Y F Y
Figure: Subset of decision of intermediate node
Feature description with target variable i.e. Love Ice-cream
Decision Tree
G.I. of leaf love popcorn (yes): 0.5
G.I. of leaf love popcorn (no): 0.
G.I of feature love popcorn: 0.25
G.I. of leaf Gender (Male): 0.444
G.I. of leaf Gender (Female): 0
G.I of feature Gender: 0.333
Figure: Feature description with target variable i.e. Love Ice-cream
Decision Tree
Love
Soda
Love
Popcorn Gender
Love
Icecream
Y Y M N
Y Y F Y
Decision Tree
Figure: Final Decision tree
Love
Popcorn Love Soda Gender
Love Ice
cream
Y Y M N
Y N F N
N Y M Y
N Y M Y
Y Y F Y
Y N F N
N N M N
Y Y M ?
Decision Tree
Decision Tree
Over fitting Problem: Decision tree are prune to over fitting because of high variance
in outcome produced, it make decision tree results uncertain. It can be overcome with
following methods:
Pre Pruning: Tune hyper parameters while fitting the feature in decision tree classifier.
Post Pruning: Set alpha parameter after preparation of decision tree and prune with
CCP alpha parameter.
Hands-On Decision Tree
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
Hands-On Decision Tree
from sklearn import tree
plt.figure(figsize=(15,10))
tree.plot_tree(clf,filled=True)
Hands-On Decision Tree
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
ccp_alphas
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
Hands-On Decision Tree
clf = DecisionTreeClassifier(random_state=0, ccp_alpha=0.012)
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)
from sklearn import tree
plt.figure(figsize=(15,10))
tree.plot_tree(clf,filled=True)
Random Forest:
Definition: Random forest is a type of ensemble techniques named as BAGGING(Bootstrap
Aggregation). It works on the principal of “Wisdom of Crowd”.
Why Random Forest?
Random forest are mostly used to overcome the issue of over fitting while using decision
tree classifier as it reduces the variance problem of decision tree and produce efficient
outcome with maximum accuracy.
Random Forest
How it works?
Random Forest
Decision Tree & Random Forest
Decision Tree:
Advantages:
1. Simple and easy implementation like IF-ELSE statements
2. Better visualization and understandable
3. Used for Classification as well as for Regression
Disadvantages:
1. Over fitting
2. Unstable Results
3. Prone to noisy data
4. Less effective with large dataset
Decision Tree & Random Forest
Random Forest:
Advantages:
1. Overcome for problem of over fitting with decision tree
2. Used for Classification as well as for Regression
Disadvantages:
1. Higher training time than decision tree
2. Less effective with small dataset
3. Require computation power as well as resources
Decision Tree & Random Forest
References:
• https://en.wikipedia.org/wiki/Decision_tree
• https://en.wikipedia.org/wiki/Random_forest
• https://en.wikipedia.org/wiki/Random_forest#Bagging
• https://en.wikipedia.org/wiki/Decision_tree#Association_rule_induction
• https://en.wikipedia.org/wiki/Decision_tree#Advantages_and_disadvantages
• https://en.wikipedia.org/wiki/Machine_learning
• https://en.wikipedia.org/wiki/Machine_learning#Artificial_intelligence
• https://en.wikipedia.org/wiki/Machine_learning#Overfitting
• https://www.abrisconsult.com/artificial-intelligence-and-data-science/
Decision Tree & Random Forest

More Related Content

What's hot (20)

PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPTX
Apriori algorithm
Gaurav Aggarwal
 
PPTX
Classification techniques in data mining
Kamal Acharya
 
PPTX
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
PDF
Bayes Belief Networks
Sai Kumar Kodam
 
PPTX
Instance based learning
Slideshare
 
PPTX
Apriori algorithm
Mainul Hassan
 
PPTX
Machine learning clustering
CosmoAIMS Bassett
 
PDF
Understanding random forests
Marc Garcia
 
PDF
Decision tree
R A Akerkar
 
PPTX
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
PDF
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
PPTX
Machine Learning - Ensemble Methods
Andrew Ferlitsch
 
PPTX
Chapter 4 Classification
Khalid Elshafie
 
PDF
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
PPTX
Ensemble learning
Haris Jamil
 
PPTX
Data mining: Classification and prediction
DataminingTools Inc
 
PPTX
Ensemble learning Techniques
Babu Priyavrat
 
PDF
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
PDF
Decision tree
SEMINARGROOT
 
Apriori algorithm
Gaurav Aggarwal
 
Classification techniques in data mining
Kamal Acharya
 
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch
 
Bayes Belief Networks
Sai Kumar Kodam
 
Instance based learning
Slideshare
 
Apriori algorithm
Mainul Hassan
 
Machine learning clustering
CosmoAIMS Bassett
 
Understanding random forests
Marc Garcia
 
Decision tree
R A Akerkar
 
K-Folds Cross Validation Method
SHUBHAM GUPTA
 
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Machine Learning - Ensemble Methods
Andrew Ferlitsch
 
Chapter 4 Classification
Khalid Elshafie
 
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Ensemble learning
Haris Jamil
 
Data mining: Classification and prediction
DataminingTools Inc
 
Ensemble learning Techniques
Babu Priyavrat
 
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Decision tree
SEMINARGROOT
 

Similar to Random forest and decision tree (20)

PPTX
Lecture 12.pptx for bca student DAA lecture
AjayKumar773878
 
PDF
From decision trees to random forests
Viet-Trung TRAN
 
PDF
2023 Supervised Learning for Orange3 from scratch
FEG
 
PDF
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
 
PPTX
Decision Tree Machine Learning Detailed Explanation.
DrezzingGaming
 
PPTX
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
DOCX
Classification Using Decision Trees and RulesChapter 5.docx
monicafrancis71118
 
PDF
Decision trees
Ncib Lotfi
 
PPTX
DECISION TRESS 2 for machine learning beginners
DebdattaBhattacharya1
 
PPTX
DECISION TRESS for Machine Learning Beginners
DebdattaBhattacharya1
 
PPTX
Decision Tree.pptx
JayabharathiMuraliku
 
PPTX
Lect9 Decision tree
hktripathy
 
PDF
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
KalighatOkira
 
PPTX
Ai & Machine learning - 31140523010 - BDS302.pptx
BhaktMahadevKA
 
PPTX
Decision_Trees_Random_Forests for use in machine learning and computer scienc...
nicolusstephen6
 
PPTX
Machine Learning with Python unit-2.pptx
GORANG6
 
PPTX
learning using decision trees_machine.pptx
Abigesh
 
PDF
CSA 3702 machine learning module 2
Nandhini S
 
PDF
Data Science Interview Preparation(#DAY 02).pdf
RahulPandey951774
 
PPTX
Decision Tree Classification Algorithm.pptx
PriyadharshiniG41
 
Lecture 12.pptx for bca student DAA lecture
AjayKumar773878
 
From decision trees to random forests
Viet-Trung TRAN
 
2023 Supervised Learning for Orange3 from scratch
FEG
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
AdityaSoraut
 
Decision Tree Machine Learning Detailed Explanation.
DrezzingGaming
 
Machine learning session6(decision trees random forrest)
Abhimanyu Dwivedi
 
Classification Using Decision Trees and RulesChapter 5.docx
monicafrancis71118
 
Decision trees
Ncib Lotfi
 
DECISION TRESS 2 for machine learning beginners
DebdattaBhattacharya1
 
DECISION TRESS for Machine Learning Beginners
DebdattaBhattacharya1
 
Decision Tree.pptx
JayabharathiMuraliku
 
Lect9 Decision tree
hktripathy
 
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
KalighatOkira
 
Ai & Machine learning - 31140523010 - BDS302.pptx
BhaktMahadevKA
 
Decision_Trees_Random_Forests for use in machine learning and computer scienc...
nicolusstephen6
 
Machine Learning with Python unit-2.pptx
GORANG6
 
learning using decision trees_machine.pptx
Abigesh
 
CSA 3702 machine learning module 2
Nandhini S
 
Data Science Interview Preparation(#DAY 02).pdf
RahulPandey951774
 
Decision Tree Classification Algorithm.pptx
PriyadharshiniG41
 
Ad

More from AAKANKSHA JAIN (12)

PPTX
Dimension reduction techniques[Feature Selection]
AAKANKSHA JAIN
 
PPTX
Inheritance in OOPs with java
AAKANKSHA JAIN
 
PPTX
OOPs with java
AAKANKSHA JAIN
 
PPTX
Probability
AAKANKSHA JAIN
 
PPTX
Data Mining & Data Warehousing
AAKANKSHA JAIN
 
PPTX
Distributed Database Design and Relational Query Language
AAKANKSHA JAIN
 
PPTX
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
AAKANKSHA JAIN
 
PPTX
Distributed Database Management System
AAKANKSHA JAIN
 
PPT
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
AAKANKSHA JAIN
 
PPTX
Fingerprint matching using ridge count
AAKANKSHA JAIN
 
PPTX
Image processing second unit Notes
AAKANKSHA JAIN
 
PPTX
Advance image processing
AAKANKSHA JAIN
 
Dimension reduction techniques[Feature Selection]
AAKANKSHA JAIN
 
Inheritance in OOPs with java
AAKANKSHA JAIN
 
OOPs with java
AAKANKSHA JAIN
 
Probability
AAKANKSHA JAIN
 
Data Mining & Data Warehousing
AAKANKSHA JAIN
 
Distributed Database Design and Relational Query Language
AAKANKSHA JAIN
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
AAKANKSHA JAIN
 
Distributed Database Management System
AAKANKSHA JAIN
 
DETECTION OF MALICIOUS EXECUTABLES USING RULE BASED CLASSIFICATION ALGORITHMS
AAKANKSHA JAIN
 
Fingerprint matching using ridge count
AAKANKSHA JAIN
 
Image processing second unit Notes
AAKANKSHA JAIN
 
Advance image processing
AAKANKSHA JAIN
 
Ad

Recently uploaded (20)

PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Day2 B2 Best.pptx
helenjenefa1
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 

Random forest and decision tree

  • 1. Machine Learning Decision Tree and Random Forest
  • 2. Machine Learning • Introduction • What is ML, DL, AL? • Decision Tree Definition Why Decision Tree? Basic Terminology Challenges • Random Forest Definition Why Random Forest How it works? • Advantages & Disadvantages
  • 3. Machine Learning According to Arthur Samuel (1950) “Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed”. Machine learning is a study and design of algorithms which can learn by processing input (learning samples) data. The most widely used definition of machine learning is that of Carnegie Mellon University Professor Tom Mitchell: “A computer program is said to learn from experience ‘E’, with respect to some class of tasks ‘T’ and performance measure ‘P’ if its performance at tasks in ‘T’ as measured by ‘P’ improves with experience ‘E’”.
  • 5. Decision Tree & Random Forest • Decision Tree  Definition  Why Decision Tree?  Basic Terminology  Challenges • Random Forest  Definition  Why Random Forest  How it works?
  • 6. Decision Tree Decision tree is a supervised machine learning algorithm which can be used for classification as well as for regression problems. It represent the target on its leaf nodes as a result or inferring's with a tree like structure Why Decision Tree?  Helpful in solving more complex problem where a linear prediction line does not perform well  Gives wonderful graphical presentation of each possible results Decision Tree & Random Forest
  • 7. Why Decision Tree? Prediction can be done with a linear regression line Dose(mg) Effectiveness Dose (mg) Age Sex Effect 10 25 F 95 20 78 M 0 35 52 F 98 5 12 M 44 … … … … … … … … … … … … Prediction can not be done with a linear regression line
  • 8. Why Decision Tree? Dose (mg) Age Sex Effect 10 25 F 95 20 78 M 0 35 52 F 98 5 12 M 44 … … … … … … … … … … … … Sample dataset Sample Decision Tree
  • 9. Decision Tree Root Node Intermediate Node Leaf node Leaf Node Intermediate Node Leaf node Root Node: The top-most node of a decision tree. It does not have any parent node. It represents the entire population or sample Leaf / Terminal Nodes: Nodes that do not have any child node are known as Terminal/Leaf Nodes
  • 10. Challenge in building Decision Tree Challenge in building Decision Tree: 1. How to decide splitting Criteria? • Target Variable(Categorical) • Target Variable(Continuous) 2. How to decide depth of decision tree/ when to stop? • Considerably all data points have been covered • Check for node purity/ homogeneity 3. Over fitting • Pre Pruning • Post Pruning
  • 11. How to built a decision tree using criteria: How to built a decision tree Love Popcorn Love Soda Gender Love Ice cream Y Y M N Y N F N N Y M Y N Y M Y Y Y F Y Y N F N N N M N Y N M ? Root node?
  • 12. How to decide splitting Criteria? 1. Check if target variable if Categorical: Gini Impurity: It indicate the feature purity, less impurity of a feature help it to be a root node or split node Entropy/ Information Gain: Information gain and Entropy are opposite to each other, here entropy indicates the impurity of a feature. That means higher the entropy, lesser the information gain. If information gain of a node is high, higher the chances it become the root node. Chi Square:
  • 13. 2. Target Variable(Continuous): • Reduction in variance: When target variable is a continuous type of variable then this method can be used to check variance of feature to decide it will be a splitting node or not. How to decide splitting Criteria?
  • 14. How to built a decision tree using criteria(Gini Index/ impurity): How to built a decision tree Love Popcorn Love Soda Gender Love Ice cream Y Y M N Y N F N N Y M Y N Y M Y Y Y F Y Y N F N N N M N Y N M ? Root node?
  • 15. G.I. of leaf love popcorn (yes): 0.375 G.I. of leaf love popcorn (no): 0.444 G.I of feature love popcorn: 0.404 G.I. of leaf love Soda (yes): 0.375 G.I. of leaf love Soda (no): 0 G.I of feature love soda: 0.214 G.I. of leaf Gender (Male): 0.5 G.I. of leaf Gender (Female): 0.444 G.I of feature Gender: 0.476 Figure: Feature description with target variable i.e. Love Ice-cream Decision Tree
  • 16. Figure: Initial Decision Tree Next node? Decision Tree
  • 17. Love Soda Love Popcorn Gender Love Ice cream Y Y M N Y N M Y Y N M Y Y Y F Y Figure: Subset of decision of intermediate node Feature description with target variable i.e. Love Ice-cream Decision Tree
  • 18. G.I. of leaf love popcorn (yes): 0.5 G.I. of leaf love popcorn (no): 0. G.I of feature love popcorn: 0.25 G.I. of leaf Gender (Male): 0.444 G.I. of leaf Gender (Female): 0 G.I of feature Gender: 0.333 Figure: Feature description with target variable i.e. Love Ice-cream Decision Tree
  • 20. Figure: Final Decision tree Love Popcorn Love Soda Gender Love Ice cream Y Y M N Y N F N N Y M Y N Y M Y Y Y F Y Y N F N N N M N Y Y M ? Decision Tree
  • 21. Decision Tree Over fitting Problem: Decision tree are prune to over fitting because of high variance in outcome produced, it make decision tree results uncertain. It can be overcome with following methods: Pre Pruning: Tune hyper parameters while fitting the feature in decision tree classifier. Post Pruning: Set alpha parameter after preparation of decision tree and prune with CCP alpha parameter.
  • 22. Hands-On Decision Tree %matplotlib inline import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer from sklearn.tree import DecisionTreeClassifier X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train,y_train) pred=clf.predict(X_test) from sklearn.metrics import accuracy_score accuracy_score(y_test, pred)
  • 23. Hands-On Decision Tree from sklearn import tree plt.figure(figsize=(15,10)) tree.plot_tree(clf,filled=True)
  • 24. Hands-On Decision Tree path = clf.cost_complexity_pruning_path(X_train, y_train) ccp_alphas, impurities = path.ccp_alphas, path.impurities ccp_alphas clfs = [] for ccp_alpha in ccp_alphas: clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha) clf.fit(X_train, y_train) clfs.append(clf) print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format( clfs[-1].tree_.node_count, ccp_alphas[-1])) train_scores = [clf.score(X_train, y_train) for clf in clfs] test_scores = [clf.score(X_test, y_test) for clf in clfs] fig, ax = plt.subplots() ax.set_xlabel("alpha") ax.set_ylabel("accuracy") ax.set_title("Accuracy vs alpha for training and testing sets") ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post") ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post") ax.legend() plt.show()
  • 25. Hands-On Decision Tree clf = DecisionTreeClassifier(random_state=0, ccp_alpha=0.012) clf.fit(X_train,y_train) pred=clf.predict(X_test) from sklearn.metrics import accuracy_score accuracy_score(y_test, pred) from sklearn import tree plt.figure(figsize=(15,10)) tree.plot_tree(clf,filled=True)
  • 26. Random Forest: Definition: Random forest is a type of ensemble techniques named as BAGGING(Bootstrap Aggregation). It works on the principal of “Wisdom of Crowd”. Why Random Forest? Random forest are mostly used to overcome the issue of over fitting while using decision tree classifier as it reduces the variance problem of decision tree and produce efficient outcome with maximum accuracy. Random Forest
  • 28. Decision Tree & Random Forest Decision Tree: Advantages: 1. Simple and easy implementation like IF-ELSE statements 2. Better visualization and understandable 3. Used for Classification as well as for Regression Disadvantages: 1. Over fitting 2. Unstable Results 3. Prone to noisy data 4. Less effective with large dataset
  • 29. Decision Tree & Random Forest Random Forest: Advantages: 1. Overcome for problem of over fitting with decision tree 2. Used for Classification as well as for Regression Disadvantages: 1. Higher training time than decision tree 2. Less effective with small dataset 3. Require computation power as well as resources
  • 30. Decision Tree & Random Forest References: • https://en.wikipedia.org/wiki/Decision_tree • https://en.wikipedia.org/wiki/Random_forest • https://en.wikipedia.org/wiki/Random_forest#Bagging • https://en.wikipedia.org/wiki/Decision_tree#Association_rule_induction • https://en.wikipedia.org/wiki/Decision_tree#Advantages_and_disadvantages • https://en.wikipedia.org/wiki/Machine_learning • https://en.wikipedia.org/wiki/Machine_learning#Artificial_intelligence • https://en.wikipedia.org/wiki/Machine_learning#Overfitting • https://www.abrisconsult.com/artificial-intelligence-and-data-science/
  • 31. Decision Tree & Random Forest