SlideShare a Scribd company logo
2
Most read
11
Most read
12
Most read
Machine
Learning - V
Random Forest - III
Feature Selection
& Variable Importance
Random forest offers an important feature selection & explicit / implicit
ranking of predictor variables which helps us to quickly understand the
important and non-important features that is effecting the final
result/model.
The measurement of getting the right splitting criteria is again based on
node impurity as in Decision Trees.
For classification: Gini/Information Gain and for Regression: Variance
And both regression and classification in Random Forest uses Out of Bag
(OOB) samples as well
Thus when training a tree it can be computed how much each feature
decreases the weighted impurity in a tree. For a forest the impurity
decreases from each feature.
Rupak Roy
Feature Selection
& Variable Importance
There are 2 way that Random Forest uses to measure the variable
Importance
1) Mean Decrease GINI 2) Mean Decrease Accuracy
1) Mean Decrease GINI is the average (mean) of a variable's
total decrease in node impurity, in simple words it is a measure of
node impurity in a tree based classification.
2) A low GINI score means that the particular feature/ predictor
variable plays a greater role in spitting the data into the more
classes.
2) Mean Decrease Accuracy is very specific to R. It represents the
measure of the predictive accuracy of the tree and the loss of
accuracy is the Mean Decrease in Accuracy.
Rupak Roy
Random Forest
>Install.packages(“randomForest”,dep =T)
>library(randomForest)
For Binary target variable convert it factor
> sample$target<-as.factor(sample$target)
#To apply random forest
>model1=randomForest(target~ . , data=sample)
Rupak Roy
Random Forest
>model1
Call
Random Forest(Formula = target~.data=sample)
Type: Classification
No. of tree: 500
No. of Variables tried at each split: 4 ( default for classification = ^K
regression = 1/3 ^K)
OOB estimate of error
(OOB represents 60% of data
is used to build the tree, rest is
used for cross validation)
Predicted
0
Class
1
Class 0 93 1
1 28 6
Rupak Roy
Random Forest
Sources of Randomness:
1) Number of trees
2) Number of Variables tried at each split
>Accuracy =(TP+TN)/Total no. of obs. =77.34%
OOB estimate of error rate: 100 -77.34=22.66%
Class Error for 1 is higher than 0
Reason could be the
Class Imbalance Problem
0 1 Class. Error
Class 0 93 1 0.0106
1 28 6 0.823
Rupak Roy
Random Forest
#Identify the best number of split using tuneRF
>mtry<-tuneRF(sample[,-1], sample[,1], myStart=1, stepFactor =2, ntreeTry =
500, improvide =0.01)
mTry: no. of randomly selected features/predictors to make the split
Mtree: Number of trees to grow
Step factor: specifies at each iteration mtry in inflated(or deflated)
Improve: specifies the (relatives) improvement in OOB error must be search
to continue.
#for more information in tuneRF paramters
>?tuneRF
#save the best min value
>best.mtry<-mtry[mtry[,2]==min[,2],1];
>set.seed(100)
>rf<-randomForest(target ~. , data = sample, mytry= best.mtry, ntree=500,
importance = TRUE)
Importance: importance of predictors will be assessed for feature selection.
Rupak Roy
Validation Metric
For Classification:
1. OOB Error Rate
2. Confusion Matrix
3. ROC Curve
4. AUC
For Regression:
1. Variance
>set.seed(100)
>index<-sample(nrow(data),0.70*nrow(data),replace=F)
>data_train<-data1[index,]
>data_test<-data[-index,]
#Build the model
>model1.rf=randomForest(target~.,data=data_train,mtry=6,ntree=500,
keep.forest=Trees,Importance = TRUE)
Where keep.forest If set to FALSE, the forest will not be retained in the output object
Rupak Roy
Validation Metric
>predicted<-predict(model1.rf,type=“prob”,newdata=data_test)
>p<-prediction(test.forest[,2],test$target)
>pref<-performance(forestpred,”tpr”,”fpr”)
>plot(prefcol="red")
#plot the absolute line
abline(0,1, lty=8, col="grey")
The closer the curve follows the left side border
and the top border of the ROC space, the more
accurate the model is.
P<-prediction(data_train$predicted,data_train$target)
auc1<-performance(P,"auc")
auc1<-unlist(slot(auc1,"y.values"))
auc1 #now choose the model with highest auc value
#alternative way
>auc<-as.numeric(auc.temp@y.values)
Rupak Roy
Purpose of Random Forest
The two important purposes of Random Forest are
1) Classification or Regression
2) Feature Selection
Feature Selection (Importance of Variables)
 Using the Importance Function we can get the Mean Decreases Gini
, in simple words it is a measure of node impurity in a tree based
classification.
 A low GINI score means that the particular feature/ predictor variable
plays a greater role in spitting the data into more classes.
>Importance(model1.rf)
>varImpPlot(model1.rf) #to plot
However Random Forest doesn’t provides the explicit ranking of
features/variables. So for explicit ranking we have an another function
call Boruta and VarSELRF
Rupak Roy
Explicit ranking using Boruta
Using Boruta Algorithm
- It’s a wrapper/similar function of random forest. The only difference is
it adds more randomness to the data by creating randomized copy
of all the features(also known as shadow copy) and applies a
features importance measure i.e. Mean Decrease Accuracy to
measure the feature importance where higher means more
important.
- Every features are compared with the randomized copy(shadow
copy) of the features for a higher importance whether the feature
has higher Z-score
- Finally it confirms only those features whose importance is higher than
that of the randomized copy(shadow copy) of features.
>attStats(model1.boruta)
• gives Z scores i.e. mean importance of each individual features.
Rupak Roy
Explicit ranking using Boruta
>library(Boruta)
>set.seed(100)
>model.boruta<- Boruta(target ~.,data=sample,doTrace=2,ntree=500)
>model.boruta
>getSelectedAttributes(model.boruta)
>feat.stats<-attStats(model1.boruta)
#gives Z scores i.e. mean importance of
each features.
>feat.stats<-data.frame(feat.stats)
>plot(feat.stats) #Plot the features
Red marked Features are unimportant
Yellow marked are Tentative
& Green Marked are Important Features.
>getSelectedAttributes(boruta.train, withTentative = FALSE)
> getSelectedAttributes(boruta.train, withTentative = TRUE)
Rupak Roy
Explicit ranking using Boruta
 Tentative features are those features where Boruta algorithm is
unable to make decision. Hence we can include this in our model
and later remove it as unimportant based on p-value.
However we can also try
>TentativeRoughFix(model1.boruta) #that performs an another test on
tentative features for judging the tentative features.
Else we can also use with withTentative = True/False
>getSelectedAttributes(boruta.train, withTentative = FALSE)
> getSelectedAttributes(boruta.train, withTentative = TRUE)
Rupak Roy
Explicit Ranking using VarSELRF
VarSELRF is a variable selection method from random forests using both
backwards variable elimination (for the selection of small sets of non-
redundant variables) and selection based on the importance spectrum
(somewhat similar to scree plots; for the selection of large, potentially
highly-correlated variables
Steps:
> It starts with all the features/variables and gets the Out of Bag (OOB)
samples.
> Then sequentially reduces the number of variables by comparing
OOB in each step.
> The group of variables with least OOB is selected.
Similar to Backward Elimination from Regression for selecting the
important variables.
#Feature Selection using VarSELRF
>library( varSELRF)
>set.seed(100)
>mode1.vrf=varSELRF(sample[,1:9],data$target,vars.drop.frac=0.2)
Rupak Roy
Next
Let’s perform all of this in our lab. video

More Related Content

What's hot (20)

PDF
Optimal binary search tree dynamic programming
P. Subathra Kishore, KAMARAJ College of Engineering and Technology, Madurai
 
PDF
Decision Tree in Machine Learning
Souma Maiti
 
PPTX
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
 
PDF
Bias and variance trade off
VARUN KUMAR
 
PPTX
Autoencoders in Deep Learning
milad abbasi
 
PDF
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
PPTX
Instance based learning
Slideshare
 
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
jins0618
 
PPT
Adaline madaline
Nagarajan
 
PPTX
Machine learning clisification algorthims
Mohammed Abdalla Youssif
 
PPTX
Whale optimizatio algorithm
Ahmed Fouad Ali
 
PPTX
Knn 160904075605-converted
rameswara reddy venkat
 
PPTX
boosting algorithm
Prithvi Paneru
 
PPTX
Fp growth
Farah M. Altufaili
 
PPT
Decision tree and random forest
Lippo Group Digital
 
PDF
Master theorem
fika sweety
 
PPTX
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
PPTX
Decision Trees
Student
 
PPSX
ADABoost classifier
SreerajVA
 
Decision Tree in Machine Learning
Souma Maiti
 
Decision Trees for Classification: A Machine Learning Algorithm
Palin analytics
 
Bias and variance trade off
VARUN KUMAR
 
Autoencoders in Deep Learning
milad abbasi
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Mohammed Bennamoun
 
Instance based learning
Slideshare
 
Clustering:k-means, expect-maximization and gaussian mixture model
jins0618
 
Adaline madaline
Nagarajan
 
Machine learning clisification algorthims
Mohammed Abdalla Youssif
 
Whale optimizatio algorithm
Ahmed Fouad Ali
 
Knn 160904075605-converted
rameswara reddy venkat
 
boosting algorithm
Prithvi Paneru
 
Decision tree and random forest
Lippo Group Digital
 
Master theorem
fika sweety
 
Ensemble methods in machine learning
SANTHOSH RAJA M G
 
Decision Trees
Student
 
ADABoost classifier
SreerajVA
 

Similar to Machine Learning Feature Selection - Random Forest (20)

PDF
Random Forest / Bootstrap Aggregation
Rupak Roy
 
PDF
13. Random forest
ExternalEvents
 
PPTX
Data analytics concepts
Hiranthi Tennakoon
 
PPT
Data structure and algorithm first chapter
amiyapal2408
 
PDF
Mo2521632166
IJERA Editor
 
PPTX
Feature Engineering Fundamentals Explained.pptx
shilpamathur13
 
DOC
Observations
butest
 
PDF
Evaluating classifierperformance ml-cs6923
Raman Kannan
 
PDF
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
PDF
Slope one recommender on hadoop
YONG ZHENG
 
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
PDF
Random Forests: The Vanilla of Machine Learning - Anna Quach
WithTheBest
 
PDF
Machine Learning With R
David Chiu
 
PPTX
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
hyunsung lee
 
PPTX
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
Bobby Filar
 
PDF
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
PDF
Improving Spam Mail Filtering Using Classification Algorithms With Partition ...
IRJET Journal
 
PPTX
Credit card Fraud detection- Feature Selection.pptx
pera123sas
 
PDF
A Random Forest Approach To Skin Detection With R
Auro Tripathy
 
PPTX
A random forest approach to skin detection with r
Dmitry Makarchuk
 
Random Forest / Bootstrap Aggregation
Rupak Roy
 
13. Random forest
ExternalEvents
 
Data analytics concepts
Hiranthi Tennakoon
 
Data structure and algorithm first chapter
amiyapal2408
 
Mo2521632166
IJERA Editor
 
Feature Engineering Fundamentals Explained.pptx
shilpamathur13
 
Observations
butest
 
Evaluating classifierperformance ml-cs6923
Raman Kannan
 
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Slope one recommender on hadoop
YONG ZHENG
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Random Forests: The Vanilla of Machine Learning - Anna Quach
WithTheBest
 
Machine Learning With R
David Chiu
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
hyunsung lee
 
TreeHuggr: Discovering Where Tree-based Classifiers are Vulnerable to Adversa...
Bobby Filar
 
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Improving Spam Mail Filtering Using Classification Algorithms With Partition ...
IRJET Journal
 
Credit card Fraud detection- Feature Selection.pptx
pera123sas
 
A Random Forest Approach To Skin Detection With R
Auro Tripathy
 
A random forest approach to skin detection with r
Dmitry Makarchuk
 
Ad

More from Rupak Roy (20)

PDF
Hierarchical Clustering - Text Mining/NLP
Rupak Roy
 
PDF
Clustering K means and Hierarchical - NLP
Rupak Roy
 
PDF
Network Analysis - NLP
Rupak Roy
 
PDF
Topic Modeling - NLP
Rupak Roy
 
PDF
Sentiment Analysis Practical Steps
Rupak Roy
 
PDF
NLP - Sentiment Analysis
Rupak Roy
 
PDF
Text Mining using Regular Expressions
Rupak Roy
 
PDF
Introduction to Text Mining
Rupak Roy
 
PDF
Apache Hbase Architecture
Rupak Roy
 
PDF
Introduction to Hbase
Rupak Roy
 
PDF
Apache Hive Table Partition and HQL
Rupak Roy
 
PDF
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
PDF
Introductive to Hive
Rupak Roy
 
PDF
Scoop Job, import and export to RDBMS
Rupak Roy
 
PDF
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
PDF
Introduction to scoop and its functions
Rupak Roy
 
PDF
Introduction to Flume
Rupak Roy
 
PDF
Apache Pig Relational Operators - II
Rupak Roy
 
PDF
Passing Parameters using File and Command Line
Rupak Roy
 
PDF
Apache PIG Relational Operations
Rupak Roy
 
Hierarchical Clustering - Text Mining/NLP
Rupak Roy
 
Clustering K means and Hierarchical - NLP
Rupak Roy
 
Network Analysis - NLP
Rupak Roy
 
Topic Modeling - NLP
Rupak Roy
 
Sentiment Analysis Practical Steps
Rupak Roy
 
NLP - Sentiment Analysis
Rupak Roy
 
Text Mining using Regular Expressions
Rupak Roy
 
Introduction to Text Mining
Rupak Roy
 
Apache Hbase Architecture
Rupak Roy
 
Introduction to Hbase
Rupak Roy
 
Apache Hive Table Partition and HQL
Rupak Roy
 
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
Introductive to Hive
Rupak Roy
 
Scoop Job, import and export to RDBMS
Rupak Roy
 
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
Introduction to scoop and its functions
Rupak Roy
 
Introduction to Flume
Rupak Roy
 
Apache Pig Relational Operators - II
Rupak Roy
 
Passing Parameters using File and Command Line
Rupak Roy
 
Apache PIG Relational Operations
Rupak Roy
 
Ad

Recently uploaded (20)

PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Climate Action.pptx action plan for climate
justfortalabat
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 

Machine Learning Feature Selection - Random Forest

  • 2. Feature Selection & Variable Importance Random forest offers an important feature selection & explicit / implicit ranking of predictor variables which helps us to quickly understand the important and non-important features that is effecting the final result/model. The measurement of getting the right splitting criteria is again based on node impurity as in Decision Trees. For classification: Gini/Information Gain and for Regression: Variance And both regression and classification in Random Forest uses Out of Bag (OOB) samples as well Thus when training a tree it can be computed how much each feature decreases the weighted impurity in a tree. For a forest the impurity decreases from each feature. Rupak Roy
  • 3. Feature Selection & Variable Importance There are 2 way that Random Forest uses to measure the variable Importance 1) Mean Decrease GINI 2) Mean Decrease Accuracy 1) Mean Decrease GINI is the average (mean) of a variable's total decrease in node impurity, in simple words it is a measure of node impurity in a tree based classification. 2) A low GINI score means that the particular feature/ predictor variable plays a greater role in spitting the data into the more classes. 2) Mean Decrease Accuracy is very specific to R. It represents the measure of the predictive accuracy of the tree and the loss of accuracy is the Mean Decrease in Accuracy. Rupak Roy
  • 4. Random Forest >Install.packages(“randomForest”,dep =T) >library(randomForest) For Binary target variable convert it factor > sample$target<-as.factor(sample$target) #To apply random forest >model1=randomForest(target~ . , data=sample) Rupak Roy
  • 5. Random Forest >model1 Call Random Forest(Formula = target~.data=sample) Type: Classification No. of tree: 500 No. of Variables tried at each split: 4 ( default for classification = ^K regression = 1/3 ^K) OOB estimate of error (OOB represents 60% of data is used to build the tree, rest is used for cross validation) Predicted 0 Class 1 Class 0 93 1 1 28 6 Rupak Roy
  • 6. Random Forest Sources of Randomness: 1) Number of trees 2) Number of Variables tried at each split >Accuracy =(TP+TN)/Total no. of obs. =77.34% OOB estimate of error rate: 100 -77.34=22.66% Class Error for 1 is higher than 0 Reason could be the Class Imbalance Problem 0 1 Class. Error Class 0 93 1 0.0106 1 28 6 0.823 Rupak Roy
  • 7. Random Forest #Identify the best number of split using tuneRF >mtry<-tuneRF(sample[,-1], sample[,1], myStart=1, stepFactor =2, ntreeTry = 500, improvide =0.01) mTry: no. of randomly selected features/predictors to make the split Mtree: Number of trees to grow Step factor: specifies at each iteration mtry in inflated(or deflated) Improve: specifies the (relatives) improvement in OOB error must be search to continue. #for more information in tuneRF paramters >?tuneRF #save the best min value >best.mtry<-mtry[mtry[,2]==min[,2],1]; >set.seed(100) >rf<-randomForest(target ~. , data = sample, mytry= best.mtry, ntree=500, importance = TRUE) Importance: importance of predictors will be assessed for feature selection. Rupak Roy
  • 8. Validation Metric For Classification: 1. OOB Error Rate 2. Confusion Matrix 3. ROC Curve 4. AUC For Regression: 1. Variance >set.seed(100) >index<-sample(nrow(data),0.70*nrow(data),replace=F) >data_train<-data1[index,] >data_test<-data[-index,] #Build the model >model1.rf=randomForest(target~.,data=data_train,mtry=6,ntree=500, keep.forest=Trees,Importance = TRUE) Where keep.forest If set to FALSE, the forest will not be retained in the output object Rupak Roy
  • 9. Validation Metric >predicted<-predict(model1.rf,type=“prob”,newdata=data_test) >p<-prediction(test.forest[,2],test$target) >pref<-performance(forestpred,”tpr”,”fpr”) >plot(prefcol="red") #plot the absolute line abline(0,1, lty=8, col="grey") The closer the curve follows the left side border and the top border of the ROC space, the more accurate the model is. P<-prediction(data_train$predicted,data_train$target) auc1<-performance(P,"auc") auc1<-unlist(slot(auc1,"y.values")) auc1 #now choose the model with highest auc value #alternative way >auc<-as.numeric([email protected]) Rupak Roy
  • 10. Purpose of Random Forest The two important purposes of Random Forest are 1) Classification or Regression 2) Feature Selection Feature Selection (Importance of Variables)  Using the Importance Function we can get the Mean Decreases Gini , in simple words it is a measure of node impurity in a tree based classification.  A low GINI score means that the particular feature/ predictor variable plays a greater role in spitting the data into more classes. >Importance(model1.rf) >varImpPlot(model1.rf) #to plot However Random Forest doesn’t provides the explicit ranking of features/variables. So for explicit ranking we have an another function call Boruta and VarSELRF Rupak Roy
  • 11. Explicit ranking using Boruta Using Boruta Algorithm - It’s a wrapper/similar function of random forest. The only difference is it adds more randomness to the data by creating randomized copy of all the features(also known as shadow copy) and applies a features importance measure i.e. Mean Decrease Accuracy to measure the feature importance where higher means more important. - Every features are compared with the randomized copy(shadow copy) of the features for a higher importance whether the feature has higher Z-score - Finally it confirms only those features whose importance is higher than that of the randomized copy(shadow copy) of features. >attStats(model1.boruta) • gives Z scores i.e. mean importance of each individual features. Rupak Roy
  • 12. Explicit ranking using Boruta >library(Boruta) >set.seed(100) >model.boruta<- Boruta(target ~.,data=sample,doTrace=2,ntree=500) >model.boruta >getSelectedAttributes(model.boruta) >feat.stats<-attStats(model1.boruta) #gives Z scores i.e. mean importance of each features. >feat.stats<-data.frame(feat.stats) >plot(feat.stats) #Plot the features Red marked Features are unimportant Yellow marked are Tentative & Green Marked are Important Features. >getSelectedAttributes(boruta.train, withTentative = FALSE) > getSelectedAttributes(boruta.train, withTentative = TRUE) Rupak Roy
  • 13. Explicit ranking using Boruta  Tentative features are those features where Boruta algorithm is unable to make decision. Hence we can include this in our model and later remove it as unimportant based on p-value. However we can also try >TentativeRoughFix(model1.boruta) #that performs an another test on tentative features for judging the tentative features. Else we can also use with withTentative = True/False >getSelectedAttributes(boruta.train, withTentative = FALSE) > getSelectedAttributes(boruta.train, withTentative = TRUE) Rupak Roy
  • 14. Explicit Ranking using VarSELRF VarSELRF is a variable selection method from random forests using both backwards variable elimination (for the selection of small sets of non- redundant variables) and selection based on the importance spectrum (somewhat similar to scree plots; for the selection of large, potentially highly-correlated variables Steps: > It starts with all the features/variables and gets the Out of Bag (OOB) samples. > Then sequentially reduces the number of variables by comparing OOB in each step. > The group of variables with least OOB is selected. Similar to Backward Elimination from Regression for selecting the important variables. #Feature Selection using VarSELRF >library( varSELRF) >set.seed(100) >mode1.vrf=varSELRF(sample[,1:9],data$target,vars.drop.frac=0.2) Rupak Roy
  • 15. Next Let’s perform all of this in our lab. video