SlideShare a Scribd company logo
Machine Learning with Python
Machine Learning Algorithms - RANDOM FOREST
Prof.ShibdasDutta,
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Machine Learning Algorithms – Classification Algo- RANDOM FOREST
Introduction - RANDOM FOREST
As the name suggests, the Random forest is a “forest” of trees! i.e Decision Trees.
A random forest is a tree-based machine learning algorithm that randomly selects
specific features to build multiple decision trees.
The random forest then combines the output of individual decision trees to generate
the final output.
Decision trees involve the greedy selection to the best split point from the dataset at
each step.
We can use random forest for classification as well as regression problems.
If the total number of column in the training dataset is denoted by p :
We take sqrt(p) number of columns for classification
For regression, we take a p/3 number of columns.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
WHEN TO USE RANDOM FOREST ?
When we focus on accuracy rather than interpretation
If you want better accuracy on the unexpected validation dataset
HOW TO USE RANDOM FOREST ?
Select random samples from a given dataset
Construct decision trees from every sample and obtain their output
Perform a vote for each predicted result.
Most voted prediction is selected as the final prediction result.
Random Forest
Training
Sample 1
Training
Sample 2
Voting
Prediction
Training
Sample 1
Training
Sample n
Training
Sample 1
Training
Sample 1
Training Set
Test Set
The following diagram will illustrate its working:
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
STOCK PREDICTION USING RANDOM FOREST-EXAMPLE
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
data = pd.read_csv('data.csv')
data.head()
Here, we will be using the dataset (available below) which contains seven columns namely date, open, high, low, close,
volume, and name of the company.
Here in this case google is the only company we have used.
Open refers to the time at which people can begin trading on a particular exchange.
Low represents a lower price point for a stock or index.
High refers to a market milestone in which a stock or index reaches a greater price point than previously for a particular
time period.
Close simply refers to the time at which a stock exchange closes to trading.
Volume refers to the number of shares of stock traded during a particular time period, normally measured in average
daily trading volume.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
abc=[]
for i in range(len(data)):
abc.append(data['Date'][i].split('-'))
data['Date'][i] = ''.join(abc[i])
Using above dataset, we are now trying to predict the ‘Close’ Value based on all attributes. Let’s split the data into
train and test dataset.
#These are the labels: They describe what the stock price was over a period.
X_1 = data.drop('Close',axis=1)
Y_1 = data['Close']
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, Y_1, test_size=0.33, random_state=42)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Now, let’s instantiate the model and train the model on training dataset:
rfg = RandomForestRegressor(n_estimators= 10, random_state=42)
rfg.fit(X_train_1,y_train_1)
pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(
drop=True)], axis=1)
Let’s find out the features on the basis of their importance by calculating numerical feature importances
# Saving feature names for later use
feature_list = list(X_1.columns)
print(feature_list)
# Get numerical feature importances
importances = list(rfg.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
rfg.score(X_test_1, y_test_1)
We are getting an accuracy of ~99% while predicting. We then display the original value and the predicted Values.
pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(drop=True)], axis=1)
Prediction
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
ADVANTAGES OF RANDOM FOREST
It reduces overfitting as it yields prediction based on majority voting.
Random forest can be used for classification as well as regression.
It works well on a large range of datasets.
Random forest provides better accuracy on unseen data and even if some data is missing
Data normalization isn’t required as it is a rule-based approach
DISADVANTAGES
Random forest requires much more computational power and memory space to build numerous decision trees.
Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each
variable.
Random forests can be less intuitive for a large collection of decision trees.
Using bagging techniques, Random forest makes trees only which are dependent on each other. Bagging might provide
similar predictions in each tree as the same greedy algorithm is used to create each decision tree. Hence, it is likely to be
using the same or very similar split points in each tree which mitigates the variance originally sought.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
Thank You
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com

More Related Content

Similar to Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf (20)

PDF
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
IRJET Journal
 
PPT
RANDOM FORESTS Ensemble technique Introduction
Lalith86
 
PPTX
Random_Forest_Presentation_Detailed.pptx
jamesjoker442
 
PPTX
Footprinting, Enumeration, Scanning, Sniffing, Social Engineering
MubashirHussain792093
 
PDF
TERM DEPOSIT SUBSCRIPTION PREDICTION
IRJET Journal
 
PPTX
13 random forest
Vishal Dutt
 
PPTX
Comparitive Analysis .pptx Footprinting, Enumeration, Scanning, Sniffing, Soc...
MubashirHussain792093
 
PPTX
Supervised and Unsupervised Learning .pptx
KerenEvangelineI
 
PPTX
Random_Forest_Presentation_More_Detailed.pptx
jamesjoker442
 
PDF
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
PDF
Random forest algorithm for regression a beginner's guide
prateek kumar
 
PPTX
DecisionTree_RandomForest.pptx
SagynKarabay
 
PPTX
DecisionTree_RandomForest good for data science
Kuzivakwashe1
 
PPTX
stock market prediction project powerpoint
KeshavRaj924345
 
PDF
What Is Random Forest_ analytics_ IBM.pdf
Dr Arash Najmaei ( Phd., MBA, BSc)
 
PPTX
what is Random-Forest-Machine-Learning.pptx
Anupama Kate
 
PPTX
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
PPTX
Introduction to RandomForests 2004
Salford Systems
 
PDF
Predict oscars (5:11)
Thinkful
 
PPTX
Decision_Trees_Random_Forests for use in machine learning and computer scienc...
nicolusstephen6
 
STOCK PRICE PREDICTION USING MACHINE LEARNING [RANDOM FOREST REGRESSION MODEL]
IRJET Journal
 
RANDOM FORESTS Ensemble technique Introduction
Lalith86
 
Random_Forest_Presentation_Detailed.pptx
jamesjoker442
 
Footprinting, Enumeration, Scanning, Sniffing, Social Engineering
MubashirHussain792093
 
TERM DEPOSIT SUBSCRIPTION PREDICTION
IRJET Journal
 
13 random forest
Vishal Dutt
 
Comparitive Analysis .pptx Footprinting, Enumeration, Scanning, Sniffing, Soc...
MubashirHussain792093
 
Supervised and Unsupervised Learning .pptx
KerenEvangelineI
 
Random_Forest_Presentation_More_Detailed.pptx
jamesjoker442
 
Tree models with Scikit-Learn: Great models with little assumptions
Gilles Louppe
 
Random forest algorithm for regression a beginner's guide
prateek kumar
 
DecisionTree_RandomForest.pptx
SagynKarabay
 
DecisionTree_RandomForest good for data science
Kuzivakwashe1
 
stock market prediction project powerpoint
KeshavRaj924345
 
What Is Random Forest_ analytics_ IBM.pdf
Dr Arash Najmaei ( Phd., MBA, BSc)
 
what is Random-Forest-Machine-Learning.pptx
Anupama Kate
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
RaflyRizky2
 
Introduction to RandomForests 2004
Salford Systems
 
Predict oscars (5:11)
Thinkful
 
Decision_Trees_Random_Forests for use in machine learning and computer scienc...
nicolusstephen6
 

More from KalighatOkira (6)

PDF
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
PDF
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
KalighatOkira
 
PDF
Machine Learning with Python- Machine Learning Algorithms- Logistic Regressio...
KalighatOkira
 
PDF
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
KalighatOkira
 
PDF
Machine Learning with Python- Machine Learning Algorithms- K-Means Clustering...
KalighatOkira
 
PDF
Basics of C Prog Lang.pdf
KalighatOkira
 
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
Machine Learning with Python- Machine Learning Algorithms- Naïve Bayes.pdf
KalighatOkira
 
Machine Learning with Python- Machine Learning Algorithms- Logistic Regressio...
KalighatOkira
 
Machine Learning with Python- Machine Learning Algorithms- Decision Tree.pdf
KalighatOkira
 
Machine Learning with Python- Machine Learning Algorithms- K-Means Clustering...
KalighatOkira
 
Basics of C Prog Lang.pdf
KalighatOkira
 
Ad

Recently uploaded (20)

PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PPTX
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
GitOps_Without_K8s_Training simple one without k8s
DanialHabibi2
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Day2 B2 Best.pptx
helenjenefa1
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Ad

Machine Learning with Python- Machine Learning Algorithms- Random Forest.pdf

  • 1. Machine Learning with Python Machine Learning Algorithms - RANDOM FOREST Prof.ShibdasDutta, Associate Professor, DCGDATACORESYSTEMSINDIAPVTLTD Kolkata Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 2. Machine Learning Algorithms – Classification Algo- RANDOM FOREST Introduction - RANDOM FOREST As the name suggests, the Random forest is a “forest” of trees! i.e Decision Trees. A random forest is a tree-based machine learning algorithm that randomly selects specific features to build multiple decision trees. The random forest then combines the output of individual decision trees to generate the final output. Decision trees involve the greedy selection to the best split point from the dataset at each step. We can use random forest for classification as well as regression problems. If the total number of column in the training dataset is denoted by p : We take sqrt(p) number of columns for classification For regression, we take a p/3 number of columns. Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 3. WHEN TO USE RANDOM FOREST ? When we focus on accuracy rather than interpretation If you want better accuracy on the unexpected validation dataset HOW TO USE RANDOM FOREST ? Select random samples from a given dataset Construct decision trees from every sample and obtain their output Perform a vote for each predicted result. Most voted prediction is selected as the final prediction result. Random Forest
  • 4. Training Sample 1 Training Sample 2 Voting Prediction Training Sample 1 Training Sample n Training Sample 1 Training Sample 1 Training Set Test Set The following diagram will illustrate its working: Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 5. STOCK PREDICTION USING RANDOM FOREST-EXAMPLE import matplotlib.pyplot as plt import numpy as np import pandas as pd # Import the model we are using from sklearn.ensemble import RandomForestRegressor data = pd.read_csv('data.csv') data.head() Here, we will be using the dataset (available below) which contains seven columns namely date, open, high, low, close, volume, and name of the company. Here in this case google is the only company we have used. Open refers to the time at which people can begin trading on a particular exchange. Low represents a lower price point for a stock or index. High refers to a market milestone in which a stock or index reaches a greater price point than previously for a particular time period. Close simply refers to the time at which a stock exchange closes to trading. Volume refers to the number of shares of stock traded during a particular time period, normally measured in average daily trading volume. Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 6. abc=[] for i in range(len(data)): abc.append(data['Date'][i].split('-')) data['Date'][i] = ''.join(abc[i]) Using above dataset, we are now trying to predict the ‘Close’ Value based on all attributes. Let’s split the data into train and test dataset. #These are the labels: They describe what the stock price was over a period. X_1 = data.drop('Close',axis=1) Y_1 = data['Close'] # Using Skicit-learn to split data into training and testing sets from sklearn.model_selection import train_test_split X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, Y_1, test_size=0.33, random_state=42) Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 7. Now, let’s instantiate the model and train the model on training dataset: rfg = RandomForestRegressor(n_estimators= 10, random_state=42) rfg.fit(X_train_1,y_train_1) pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index( drop=True)], axis=1) Let’s find out the features on the basis of their importance by calculating numerical feature importances # Saving feature names for later use feature_list = list(X_1.columns) print(feature_list) # Get numerical feature importances importances = list(rfg.feature_importances_) # List of tuples with variable and importance feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)] # Sort the feature importances by most important first feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True) # Print out the feature and importances [print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]; Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 8. rfg.score(X_test_1, y_test_1) We are getting an accuracy of ~99% while predicting. We then display the original value and the predicted Values. pd.concat([pd.Series(rfg.predict(X_test_1)), y_test_1.reset_index(drop=True)], axis=1) Prediction Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 9. ADVANTAGES OF RANDOM FOREST It reduces overfitting as it yields prediction based on majority voting. Random forest can be used for classification as well as regression. It works well on a large range of datasets. Random forest provides better accuracy on unseen data and even if some data is missing Data normalization isn’t required as it is a rule-based approach DISADVANTAGES Random forest requires much more computational power and memory space to build numerous decision trees. Due to the ensemble of decision trees, it also suffers interpretability and fails to determine the significance of each variable. Random forests can be less intuitive for a large collection of decision trees. Using bagging techniques, Random forest makes trees only which are dependent on each other. Bagging might provide similar predictions in each tree as the same greedy algorithm is used to create each decision tree. Hence, it is likely to be using the same or very similar split points in each tree which mitigates the variance originally sought. Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
  • 10. Thank You Company Confidential: Data-Core Systems, Inc. | datacoresystems.com