Machine Learning Feature Selection - Random Forest

Machine
Learning - V
Random Forest - III

Feature Selection
& Variable Importance
Random forest offers an important feature selection & explicit / implicit
ranking of predictor variables which helps us to quickly understand the
important and non-important features that is effecting the final
result/model.
The measurement of getting the right splitting criteria is again based on
node impurity as in Decision Trees.
For classification: Gini/Information Gain and for Regression: Variance
And both regression and classification in Random Forest uses Out of Bag
(OOB) samples as well
Thus when training a tree it can be computed how much each feature
decreases the weighted impurity in a tree. For a forest the impurity
decreases from each feature.
Rupak Roy

Feature Selection
& Variable Importance
There are 2 way that Random Forest uses to measure the variable
Importance
1) Mean Decrease GINI 2) Mean Decrease Accuracy
1) Mean Decrease GINI is the average (mean) of a variable's
total decrease in node impurity, in simple words it is a measure of
node impurity in a tree based classification.
2) A low GINI score means that the particular feature/ predictor
variable plays a greater role in spitting the data into the more
classes.
2) Mean Decrease Accuracy is very specific to R. It represents the
measure of the predictive accuracy of the tree and the loss of
accuracy is the Mean Decrease in Accuracy.
Rupak Roy

Random Forest
>Install.packages(“randomForest”,dep =T)
>library(randomForest)
For Binary target variable convert it factor
> sample$target<-as.factor(sample$target)
#To apply random forest
>model1=randomForest(target~ . , data=sample)
Rupak Roy

Random Forest
>model1
Call
Random Forest(Formula = target~.data=sample)
Type: Classification
No. of tree: 500
No. of Variables tried at each split: 4 ( default for classification = ^K
regression = 1/3 ^K)
OOB estimate of error
(OOB represents 60% of data
is used to build the tree, rest is
used for cross validation)
Predicted
0
Class
1
Class 0 93 1
1 28 6
Rupak Roy

Random Forest
Sources of Randomness:
1) Number of trees
2) Number of Variables tried at each split
>Accuracy =(TP+TN)/Total no. of obs. =77.34%
OOB estimate of error rate: 100 -77.34=22.66%
Class Error for 1 is higher than 0
Reason could be the
Class Imbalance Problem
0 1 Class. Error
Class 0 93 1 0.0106
1 28 6 0.823
Rupak Roy

Random Forest
#Identify the best number of split using tuneRF
>mtry<-tuneRF(sample[,-1], sample[,1], myStart=1, stepFactor =2, ntreeTry =
500, improvide =0.01)
mTry: no. of randomly selected features/predictors to make the split
Mtree: Number of trees to grow
Step factor: specifies at each iteration mtry in inflated(or deflated)
Improve: specifies the (relatives) improvement in OOB error must be search
to continue.
#for more information in tuneRF paramters
>?tuneRF
#save the best min value
>best.mtry<-mtry[mtry[,2]==min[,2],1];
>set.seed(100)
>rf<-randomForest(target ~. , data = sample, mytry= best.mtry, ntree=500,
importance = TRUE)
Importance: importance of predictors will be assessed for feature selection.
Rupak Roy

Validation Metric
For Classification:
1. OOB Error Rate
2. Confusion Matrix
3. ROC Curve
4. AUC
For Regression:
1. Variance
>set.seed(100)
>index<-sample(nrow(data),0.70*nrow(data),replace=F)
>data_train<-data1[index,]
>data_test<-data[-index,]
#Build the model
>model1.rf=randomForest(target~.,data=data_train,mtry=6,ntree=500,
keep.forest=Trees,Importance = TRUE)
Where keep.forest If set to FALSE, the forest will not be retained in the output object
Rupak Roy

Validation Metric
>predicted<-predict(model1.rf,type=“prob”,newdata=data_test)
>p<-prediction(test.forest[,2],test$target)
>pref<-performance(forestpred,”tpr”,”fpr”)
>plot(prefcol="red")
#plot the absolute line
abline(0,1, lty=8, col="grey")
The closer the curve follows the left side border
and the top border of the ROC space, the more
accurate the model is.
P<-prediction(data_train$predicted,data_train$target)
auc1<-performance(P,"auc")
auc1<-unlist(slot(auc1,"y.values"))
auc1 #now choose the model with highest auc value
#alternative way
>auc<-as.numeric(auc.temp@y.values)
Rupak Roy

Purpose of Random Forest
The two important purposes of Random Forest are
1) Classification or Regression
2) Feature Selection
Feature Selection (Importance of Variables)
 Using the Importance Function we can get the Mean Decreases Gini
, in simple words it is a measure of node impurity in a tree based
classification.
 A low GINI score means that the particular feature/ predictor variable
plays a greater role in spitting the data into more classes.
>Importance(model1.rf)
>varImpPlot(model1.rf) #to plot
However Random Forest doesn’t provides the explicit ranking of
features/variables. So for explicit ranking we have an another function
call Boruta and VarSELRF
Rupak Roy

Explicit ranking using Boruta
Using Boruta Algorithm
- It’s a wrapper/similar function of random forest. The only difference is
it adds more randomness to the data by creating randomized copy
of all the features(also known as shadow copy) and applies a
features importance measure i.e. Mean Decrease Accuracy to
measure the feature importance where higher means more
important.
- Every features are compared with the randomized copy(shadow
copy) of the features for a higher importance whether the feature
has higher Z-score
- Finally it confirms only those features whose importance is higher than
that of the randomized copy(shadow copy) of features.
>attStats(model1.boruta)
• gives Z scores i.e. mean importance of each individual features.
Rupak Roy

>library(Boruta)
>set.seed(100)
>model.boruta<- Boruta(target ~.,data=sample,doTrace=2,ntree=500)
>model.boruta
>getSelectedAttributes(model.boruta)
>feat.stats<-attStats(model1.boruta)
#gives Z scores i.e. mean importance of
each features.
>feat.stats<-data.frame(feat.stats)
>plot(feat.stats) #Plot the features
Red marked Features are unimportant
Yellow marked are Tentative
& Green Marked are Important Features.
>getSelectedAttributes(boruta.train, withTentative = FALSE)
> getSelectedAttributes(boruta.train, withTentative = TRUE)
Rupak Roy

 Tentative features are those features where Boruta algorithm is
unable to make decision. Hence we can include this in our model
and later remove it as unimportant based on p-value.
However we can also try
>TentativeRoughFix(model1.boruta) #that performs an another test on
tentative features for judging the tentative features.
Else we can also use with withTentative = True/False
>getSelectedAttributes(boruta.train, withTentative = FALSE)
> getSelectedAttributes(boruta.train, withTentative = TRUE)
Rupak Roy

Explicit Ranking using VarSELRF
VarSELRF is a variable selection method from random forests using both
backwards variable elimination (for the selection of small sets of non-
redundant variables) and selection based on the importance spectrum
(somewhat similar to scree plots; for the selection of large, potentially
highly-correlated variables
Steps:
> It starts with all the features/variables and gets the Out of Bag (OOB)
samples.
> Then sequentially reduces the number of variables by comparing
OOB in each step.
> The group of variables with least OOB is selected.
Similar to Backward Elimination from Regression for selecting the
important variables.
#Feature Selection using VarSELRF
>library( varSELRF)
>set.seed(100)
>mode1.vrf=varSELRF(sample[,1:9],data$target,vars.drop.frac=0.2)
Rupak Roy

Next
Let’s perform all of this in our lab. video

Machine Learning Feature Selection - Random Forest

More Related Content

What's hot (20)

Similar to Machine Learning Feature Selection - Random Forest (20)

More from Rupak Roy (20)

Recently uploaded (20)

Machine Learning Feature Selection - Random Forest