Solving the Multicollinearity Problem with Decision Tree

Last Updated : 26 Feb, 2024

Multicollinearity is a common issue in data science, affecting various types of models, including decision trees. This article explores what multicollinearity is, why it's problematic for decision trees, and how to address it.

Table of Content

Multicollinearity in Decision Trees
Detecting Multicollinearity
Stepwise Guide of how Decision Trees Handle Multicollinearity

Multicollinearity in Decision Trees

What is Multicollinearity?

Multicollinearity is a problems in statistical analysis in which two or more independent variables in a regression model are significantly connected. This correlation can cause issues in model estimation and interpretation.

What are Decision Trees?

A decision tree is a type of tree structure that resembles a flowchart, with core nodes representing features, branches representing rules, and leaf nodes representing the algorithm's outcome. It is a flexible supervised machine-learning approach that may be applied to regression and classification issues alike. It is among the most potent algorithms.

Multicollinearity in Decision Trees:

While multicollinearity in linear regression models is a well-known issue, decision trees' implications have not been as thoroughly studied. This is primarily because decision trees do not require or assume a particular relationship between the independent variables, in contrast to linear regression models. As a result, decision trees can generate accurate predictions even in situations where there is a high level of correlation between some variables.

In decision trees, multicollinearity is handled implicitly through the feature selection process.

Feature Importance: Decision trees evaluate the importance of features based on how well they split the data. If two features are highly correlated (multicollinear), they will essentially provide redundant information for splitting the data. In such cases, the decision tree will select one of the correlated features for splitting and may not consider the other one, as including both would not provide additional benefit in reducing impurity.
Splitting Criteria: Decision trees use splitting criteria such as information gain or Gini impurity to determine the best feature to split at each node. If two features are highly correlated, they are likely to have similar information gain or impurity reduction. In such cases, the decision tree may choose either feature for splitting, but not both.
Tree Structure: As the decision tree grows, it naturally filters out redundant or correlated features. If one feature has already been used for splitting at an earlier node and has effectively reduced impurity, the decision tree is less likely to select a correlated feature for splitting at subsequent nodes, as it would not provide additional information gain.

However, it's important to note that decision trees are sensitive to small changes in the dataset, and multicollinearity can still impact their performance. Ensemble methods like random forests are often used to mitigate this sensitivity by building multiple trees on different subsets of the data and averaging the results.

Detecting Multicollinearity

Detecting multicollinearity is an important step in ensuring the reliability of your regression model. Here are two common methods for detecting multicollinearity:

Correlation Matrix:
- Calculate the correlation coefficient between each pair of predictor variables.
- Values close to 1 or -1 indicate a high degree of correlation.
- Identify pairs of variables with high correlation coefficients (e.g., greater than 0.7 or less than -0.7).
Variance Inflation Factor (VIF):
- VIF measures how much the variance of an estimated regression coefficient is increased due to multicollinearity.
- Calculate the VIF for each predictor variable.
- VIF values greater than 5 or 10 are often used as thresholds to indicate multicollinearity.

Python Implementation to Detect Multicollinearity

Detecting multicollinearity can be done using the correlation matrix and VIF (Variance Inflation Factor) in Python.

Python3

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Sample dataset
data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 4, 6, 8, 10],
    'X3': [3, 6, 9, 12, 15]
}
df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Display the correlation matrix
print(&quot;Correlation Matrix:&quot;)
print(correlation_matrix)

# Calculate VIF for each feature
vif = pd.DataFrame()
vif[&quot;Feature&quot;] = df.columns
vif[&quot;VIF&quot;] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

# Display VIF
print(&quot;\nVariance Inflation Factor (VIF):&quot;)
print(vif)

Output:

Correlation Matrix:
     X1   X2   X3
X1  1.0  1.0  1.0
X2  1.0  1.0  1.0
X3  1.0  1.0  1.0
Variance Inflation Factor (VIF):
  Feature  VIF
0      X1  inf
1      X2  inf
2      X3  inf

The correlation matrix and VIF values you provided suggest that all three variables (X1, X2, X3) are perfectly correlated with each other, resulting in infinite VIF values.

Stepwise Guide of how Decision Trees Handle Multicollinearity

Generating Synthetic Data:
- We use np.random.rand(100, 1) * 10 to generate 100 random numbers between 0 and 10, which serves as our feature X.
- We use np.sin(X) to create the target variable y as a sine wave of the feature X.
- We add some random noise using np.random.normal(0, 0.1, size=(100, 1)) to make the relationship more realistic.
Splitting the Dataset:
- We split the dataset into training and test sets using train_test_split, with 80% of the data used for training and 20% for testing.
Linear Regression Model:
- We fit a Linear Regression model (lr.fit(X_train, y_train)) to the training data and make predictions on the test data (lr.predict(X_test)).
- We calculate the Mean Squared Error (MSE) between the predicted and actual values using mean_squared_error.
Decision Tree Regression Model:
- We fit a Decision Tree Regression model (dtr.fit(X_train, y_train)) to the training data and make predictions on the test data (dtr.predict(X_test)).
- We calculate the Mean Squared Error (MSE) between the predicted and actual values using mean_squared_error.
Printing Results:
- We print the MSE for both the Linear Regression and Decision Tree Regression models to compare their performance.

Python3

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Create a synthetic dataset with a non-linear relationship
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = np.sin(X) + np.random.normal(0, 0.1, size=(100, 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Calculate the correlation matrix between X and y
corr_matrix = np.corrcoef(X.squeeze(), y.squeeze())

# Display the correlation matrix
print(&quot;Correlation Matrix between X and y:&quot;)
print(corr_matrix)

Output:

Correlation Matrix between X and y:
[[ 1.         -0.94444709]
 [-0.94444709  1.        ]]

Fitting Linear Regression And Decision Tree to Compare

Python3

# Fit a Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)

# Fit a Decision Tree Regression model
dtr = DecisionTreeRegressor(random_state=42)
dtr.fit(X_train, y_train)
dtr_pred = dtr.predict(X_test)
dtr_mse = mean_squared_error(y_test, dtr_pred)

print(&quot;Linear Regression MSE:&quot;, lr_mse)
print(&quot;Decision Tree Regression MSE:&quot;, dtr_mse)

Output:

Linear Regression MSE: 0.4352358901582881
Decision Tree Regression MSE: 0.01578187903036423

A lower MSE indicates a better fit of the model to the data. In this example, the Decision Tree Regression model has a significantly lower MSE compared to the Linear Regression model, which shows Decision tree performs better.

Multicollinearity in Regression Analysis

vanshikakkumar

Improve

Article Tags :

Practice Tags :

Machine Learning

Solving the Multicollinearity Problem with Decision Tree

Multicollinearity in Decision Trees

What is Multicollinearity?

What are Decision Trees?

Multicollinearity in Decision Trees:

Detecting Multicollinearity

Python Implementation to Detect Multicollinearity

Stepwise Guide of how Decision Trees Handle Multicollinearity

Fitting Linear Regression And Decision Tree to Compare

Similar Reads

Thank You!

What kind of Experience do you want to share?