Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
1
Multicollinearity
Definition of Multicollinearity
One of the assumption of the classical linear regression model (CLRM) is that there is no
relationship among the repressors(independent variables) included in the regression
model. However in most practical situations the predictor variables are nearly perfectly
related. When there are near linear dependency among the predictor variables the
problem of multicollinearity is said to exist. The term multicollinearity is used to denote
the presence of linear relationship among some or all predictor variables of a regression
model.
Example on Multicollinearity
Consider an economic model designed to explain the weekly sales of a specific item by a
supermarket. The factors that affect the sales of the item include, among other things, the price of the
product, the prices of competitive and complementary goods, and the extent of marketing or
advertising efforts devoted to the product. When the experimental design is determined by the
uncontrolled operation of the market, one group of variables that may move together in a systematic
way are the prices of the product and its competitors. Under this situation, it would not be a surprise
to find in a sample of data that when one price is going up all the prices are going up, and when one
price is falling all are falling.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
2
These systematic, collinear relationships between the prices are the type that are potential problem,
and therefore the problem of multicollinearity arises. Which having some serious affects on the
regression.
Nature of Multicollinearity
Consider the general linear regression model in matrix form:
εβ += XY
Where Y is a (n ×1) vector of responses, and X is a (n×p+1) matrix of predictors, β is a (p+1×1)
vector of regression coefficients and ε is a vector of disturbance term of order (n×1), such that
ε ≈NID (0, δ2
)
Let the jth
column of the matrix 'X' is denoted by xj, exact or perfect colinearity exists when there is an
exact linear relationship among the explanatory variables(columns of X). That is, one or more
relations of the form
0...332211 =++++ ppcxcxcxcx
exists, where the constants cj are not all zero. For example, if 432 2xxx += then the variables 2x , 3x
and 4x are exactly linearly related.
consider the following case where the X variables are intercorrelated but not perfectly:
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
3
0...332211 =+++++ εpp cxcxcxcx
where ε is a stochastic error term.
To see the difference between exact and non exact multicollinearity consider the following numerical
example
X1 X2 X2
*
10
20
30
40
50
20
40
60
80
100
22
40
67
89
102
It is clear from above table that X2=2X1. Therefore there is perfect collinearity between X1 and X2.
The variable X2
*
is created from X2 by simply adding the random numbers 2,0,7,9, and 2 taken
from the random number table. Now there is no longer perfect multicollinearity between variables X1
and X2
*
, however the two variables are highly correlated.
Note that multicollinearity is a question of degree and not of kind. The meaningful distinction is not
between the presence and the absence of multicollinearity, but between its various degrees.
Since multicollinearity refers to the condition of the explanatory variables that are assumed to be non
stochastic, it is feature of the sample and not of the population.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
4
Sources of multicollinearity
There are four primary sources of multicollinearity
1. The data collection method employed
2. Constraints on the model or in the population
3. Model specification
4. An over defined model.
1. The data collection method
The data collection methods can lead to multicollinearity problems when the researcher
samples only a subset of the region of the repressors.
2. Constraints on the model or in the population
Constraints on the model or in the population being samples can cause multicollinearity. For example,
suppose that an electric utility investigating the effect of family income(x1) and house size (x2) on
residential electricity consumption. In this example a physical constraint in the population has caused
this phenomenon, namely , families with higher incomes generally have larger homes than families
with lower incomes. When physical constraints such as this are present, multicollinearity will exist
regardless of the sampling method employed.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
5
3. Model Specification
We often encounter situations when two or more repressors are nearly linearly dependent and
retaining all these repressors may contribute to multicollinearity.
4. An over defined model
An over defined model has more regressor variables than observations. These models are
sometimes encountered in medical and behavioral research, where there may be only a small
number of subjects(sample units)available, and information is collected for a large number of
regressors on each subject. The usual approach to dealing with multicollinearity in this context
is to eliminate some of the regressor variables form consideration
Effects of Multicollinearity
The presence of multicollinearity has a number of potentially serious effects on the least-
squares estimates of the regression coefficients. Some of these effects may be easily
demonstrated.
1. The least square estimates are indeterminate in case of perfect multicollinearity:
Consider that there are only two predictor variables say x1 and x2. The simple linear
regression model can be then written as:
Yi = β0 + β1X1i + β2X2i + iε - - - - - - (i)
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
6
Taking sum and then divide it by 'n'.
)(22110
22110
2
2
1
1
1
1
0
1
iiXXY
XXY
nn
X
n
X
nn
Y
iii
iii
ii
n
i
i
n
i
n
i
i
−−−−−−++=
+++=
∑+
∑
++=
∑∑
∑ ==
=
βββ
εβββ
ε
ββ
β
Subtract equation (ii) from (i)
)(
)()(
2211
222111
iiixxy
XXXXYY
iiii
iiiiii
−−−−−−++=
+−+−=−
εββ
εββ
Equation (iii) is the deviation form.
The least square normal equations is given by:
( ) YXXX ′=′ βˆ
Now ( ) YXXX ′′=
−1ˆβ
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
7




















=
=
2n1
2212
21
21
x
..
..
..
.
][
nx
xx
xx
X
xxX
11
xx
x
xx
x
x
..
..
..
x
x
......
........
2
221
21
2
1
2
2i21i
21i
2
1
2n1
2212
2111
22221
11211








∑∑∑
∑∑∑
=








∑∑∑
∑∑∑
=


























=′
x
xx
x
xx
x
x
x
xxx
xxx
XX
i
ii
n
n
n
In case of only two observations.
Now to obtain βˆ vector we need ( )XX′ -1
.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
8
Now
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
9
0
0
0
1
x-0
1
x-
-
)()([
1
)(
)()x(-
)x(-
})({)(
1
get.eequation waboveinthisngsubstitutiSoconstant.zerononiswhere,xthatAssume
x-
x-
)(
1
)(ˆNow
x-
x-
)(
1
x-
x-
x
x
1
)(
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
22
2
2
2
2
2
22
2
22
2
2
2
2
2
222
22
2
2
2
22
2
2
2
2
21
2
1
2
121
21
2
2
2
21
2
2
2
1
1
2
121
21
2
2
2
21
2
2
2
1
2
121
21
2
2
2
221
21
2
1
1
λλ
λλ
λ
λλ
λ
λ
λ
λλ
λ
λλ
λλ
β






=








∑∑+∑∑
∑∑−∑∑
=
















∑
∑








∑∑
∑∑
∑−∑
=
















∑
∑








∑∑∑
∑∑∑
∑−∑∑
=
=
















∑
∑








∑∑∑
∑∑∑
∑−∑∑
′′=








∑∑∑
∑∑∑
∑−∑∑
=








∑∑∑
∑∑∑








∑∑
∑∑
=′
−
−
yxxyx
yxxyxx
yx
yx
x
xx
xx
yx
yx
xx
xx
xxxx
x
yx
yx
xx
xx
xxxx
YXXX
xx
xx
xxxx
xx
xx
xx
xx
XX
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
10
ˆ
ˆ
0
0
0
0
0
0
0
1
2
1








=












=






=
β
β
Thus in case of perfect multicollinearity the least square estimates are indeterminate.
2. Variances and co-variances of least square estimates are infinite, if there is perfect
linear relationship between predictor variables.
Var-Cov matrix of least square estimate :bygivenisˆβ
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
11
infinite.areestimates
OLSofesconvariancandvariancesnearity,multicolliperfectofcaseinthatshows
)ˆvar()ˆ,ˆcov(
)ˆ,ˆcov()ˆvar(
00
00
0
-
1
)(
)()ˆ(
221
211
2
2
222
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
12
21
Which
xx
xx
xx
xx
XX
XXVar








=





∞∞
∞∞
=
















∑−∑−
∑−∑
=






















∑∑−
∑∑
=
′=
′=
−
−
βββ
βββ
λδλδ
λδδ
λ
λ
δ
δ
δβ
Consequences of multicollinearity
In case of near or high multicollinerity, one is likely to have the following consequences:
1. Although the estimators are BLUE(best liner unbiased estimator) but having large variances
and covariance's.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
12
2. The confidence interval tend to be much wider leading to the acceptance of the null hypothesis
because of the large variances.
3. The t ratio became insignificant.
4. R2
the overall measure of goodness of fit, can be very high.
5. Small changes in the data set can affect the standard errors.
Methods of detection of multicollinearity
Following are methods of detecting multicollinearity:
1. High R2 but few significant t ratios
If R2
is high, say in excess of 0.8, the F test in most cases will reject the hypothesis that the
partial slope coefficients are simultaneously equal to zero, but the individual t tests will show
that none or very few of the partial slope coefficients are statistically different from zero.
2. High pair-wise correlations among regressors
Another suggested rule of thumb is that if the pair-wise or zero-order correlation coefficient
between two regressors is high, say , in excess of 0.8, then multicollinearity is a serious
problem
3. Examination of the correlation matrix
A very simple method to detect multicollinearity is the inspection of the off-diagonal elements
rij in )( /
XX . If predictor variables xi and xj are really linearly dependent then ijr will be close
to 1. Examining the simple correlation rij between the predictor variables is helpful in detecting
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
13
near linear dependency between pair of predictor variables only, but is not sufficient for detecting
anything more.
4. Auxiliary regressions
Since multicollinearity arises because one or more of the regressors are exact or approximately
linear combinations of the other regresssors, one way of finding out which X variable is related
to other X variables is to regress each X on the remaining X variables and compute the
corresponding R2
. Each of these regressions is called an auxiliary regression. Consider the
following Fi statistic
)1/()1(
)2/(
.....
2
.....
2
321
321
+−−
−
=
knR
kR
F
k
k
xxxx
xxxx
i
follows the F distribution with k-2 and n-k+1 df, n stands for the sample size, k stands for the
number of explanatory variables including the intercept term, kxxxxR .....
2
321 is the coefficient of
determination in the regression of variable Xi on the remaining X variables.
If the computed F exceeds the critical F at the chosen level of significance, it is taken to mean
that the particular X is collinear with other X's; if it does not exceed the critical F, we say that
it is not collinear with other X's, in which case we may retain that variable in the model.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
14
5. Eigenvalues and condition index:
The condition number k defined as
k=Maximum Eigen value/Minimum Eigen value
and the condition index defined as
CI= k
If 'k' is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds
1000 there is severe multicollinearity. Alternatively, if the CI is between 10 and 30, there is
moderate to strong multicollinearity and if it exceeds 30 there is severe multicollinearity
6. Variance inflation factor
Variance inflation factor is an indicator of multicollinearity, the larger the value of VIF, the
more 'troublesome" or collinear the variable Xj. If the VIF of a variable exceeds 10, which will
happen if R2
j exceeds 0.90, that variable is said to be highly collinear.
Remedial measures for multicollinearity
Several methods have been proposed for dealing with the problem caused by multicollinearity. Some
of these methods are given below.
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
15
1. Collecting additional data:
Collecting additional data has been suggested as the best method of combating multicollinerity.
The additional data should be collected in a manner designed to break the multicollinearity in
the existing data. Unfortunately, collecting additional data is not always possible because of
economic constrains or because the process being studied is no longer available for sampling.
Finally, note that collecting additional data is not a viable solution to the multicollinearity
problem when the multicollinearity is due to constraints on the model or in the population.
2. Model Respecification
Multicollinearity is often caused by the choice of model, such as when two highly correlated
regressors are used in the regression equation. In these situation some respecification of the
regression equation may lessen the impact of multicollinearity. There are two methods of
model respecification:
i. Redefine regressors
If x1,x2, and x3 are nearly linearly dependent, it may be possible to find some function such as
x=(x1+x2)/x3 or x=x1x2x3 that preserves the information content in the original regressors but
reduces the ill-conditioning.
ii. Variable elimination method
Another widely used approach is variable elimination. That is, if x1,x2, and x3 are nearly linearly
dependent, eliminating one regressor(say x3) may be helpful in combating multicollinearity. Variable
elimination is often a highly effective technique. However, it may not provide a satisfactory
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
16
solution if the regressors dropped from the model have significant explanatory power relative
to the response y.
3. Transformation of variables
Suppose we have time series data on consumption expenditure, income, and wealth. One
reason for high multicollinearity between income and wealth is such data is that over time both
the variables tend to move in the same direction. One way of minimizing this dependecne is to
proceed as follows:
If the relation
Yt = β0 +β1X1t+β2X2t+ut -------- A
holds at time t, it must also hold at time t-1. Therefore we have
Yt-1 = β0 +β1X1,t-1+β2X2,t-1+ut-1--------- B
If we subtract (B) from (A)
Yt - Yt-1= β1(X1t- X1,t-1)+ β2 ( X2t- X2,t-1)+ vt--- C
where vt=ut-ut-1
Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
17
Equation C is known as the first difference form because we run the regression, not on the original
variables, but on the differences of successive values of the variables. The first difference regression
model often reduces the severity of multicollinearity because, although the levels of X2 and X3 may
be highly correlated, there is no a priori reason to belive that their differences will also be highly
correlated.

Multicollinearity1

  • 1.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 1 Multicollinearity Definition of Multicollinearity One of the assumption of the classical linear regression model (CLRM) is that there is no relationship among the repressors(independent variables) included in the regression model. However in most practical situations the predictor variables are nearly perfectly related. When there are near linear dependency among the predictor variables the problem of multicollinearity is said to exist. The term multicollinearity is used to denote the presence of linear relationship among some or all predictor variables of a regression model. Example on Multicollinearity Consider an economic model designed to explain the weekly sales of a specific item by a supermarket. The factors that affect the sales of the item include, among other things, the price of the product, the prices of competitive and complementary goods, and the extent of marketing or advertising efforts devoted to the product. When the experimental design is determined by the uncontrolled operation of the market, one group of variables that may move together in a systematic way are the prices of the product and its competitors. Under this situation, it would not be a surprise to find in a sample of data that when one price is going up all the prices are going up, and when one price is falling all are falling.
  • 2.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 2 These systematic, collinear relationships between the prices are the type that are potential problem, and therefore the problem of multicollinearity arises. Which having some serious affects on the regression. Nature of Multicollinearity Consider the general linear regression model in matrix form: εβ += XY Where Y is a (n ×1) vector of responses, and X is a (n×p+1) matrix of predictors, β is a (p+1×1) vector of regression coefficients and ε is a vector of disturbance term of order (n×1), such that ε ≈NID (0, δ2 ) Let the jth column of the matrix 'X' is denoted by xj, exact or perfect colinearity exists when there is an exact linear relationship among the explanatory variables(columns of X). That is, one or more relations of the form 0...332211 =++++ ppcxcxcxcx exists, where the constants cj are not all zero. For example, if 432 2xxx += then the variables 2x , 3x and 4x are exactly linearly related. consider the following case where the X variables are intercorrelated but not perfectly:
  • 3.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 3 0...332211 =+++++ εpp cxcxcxcx where ε is a stochastic error term. To see the difference between exact and non exact multicollinearity consider the following numerical example X1 X2 X2 * 10 20 30 40 50 20 40 60 80 100 22 40 67 89 102 It is clear from above table that X2=2X1. Therefore there is perfect collinearity between X1 and X2. The variable X2 * is created from X2 by simply adding the random numbers 2,0,7,9, and 2 taken from the random number table. Now there is no longer perfect multicollinearity between variables X1 and X2 * , however the two variables are highly correlated. Note that multicollinearity is a question of degree and not of kind. The meaningful distinction is not between the presence and the absence of multicollinearity, but between its various degrees. Since multicollinearity refers to the condition of the explanatory variables that are assumed to be non stochastic, it is feature of the sample and not of the population.
  • 4.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 4 Sources of multicollinearity There are four primary sources of multicollinearity 1. The data collection method employed 2. Constraints on the model or in the population 3. Model specification 4. An over defined model. 1. The data collection method The data collection methods can lead to multicollinearity problems when the researcher samples only a subset of the region of the repressors. 2. Constraints on the model or in the population Constraints on the model or in the population being samples can cause multicollinearity. For example, suppose that an electric utility investigating the effect of family income(x1) and house size (x2) on residential electricity consumption. In this example a physical constraint in the population has caused this phenomenon, namely , families with higher incomes generally have larger homes than families with lower incomes. When physical constraints such as this are present, multicollinearity will exist regardless of the sampling method employed.
  • 5.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 5 3. Model Specification We often encounter situations when two or more repressors are nearly linearly dependent and retaining all these repressors may contribute to multicollinearity. 4. An over defined model An over defined model has more regressor variables than observations. These models are sometimes encountered in medical and behavioral research, where there may be only a small number of subjects(sample units)available, and information is collected for a large number of regressors on each subject. The usual approach to dealing with multicollinearity in this context is to eliminate some of the regressor variables form consideration Effects of Multicollinearity The presence of multicollinearity has a number of potentially serious effects on the least- squares estimates of the regression coefficients. Some of these effects may be easily demonstrated. 1. The least square estimates are indeterminate in case of perfect multicollinearity: Consider that there are only two predictor variables say x1 and x2. The simple linear regression model can be then written as: Yi = β0 + β1X1i + β2X2i + iε - - - - - - (i)
  • 6.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 6 Taking sum and then divide it by 'n'. )(22110 22110 2 2 1 1 1 1 0 1 iiXXY XXY nn X n X nn Y iii iii ii n i i n i n i i −−−−−−++= +++= ∑+ ∑ ++= ∑∑ ∑ == = βββ εβββ ε ββ β Subtract equation (ii) from (i) )( )()( 2211 222111 iiixxy XXXXYY iiii iiiiii −−−−−−++= +−+−=− εββ εββ Equation (iii) is the deviation form. The least square normal equations is given by: ( ) YXXX ′=′ βˆ Now ( ) YXXX ′′= −1ˆβ
  • 7.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 7                     = = 2n1 2212 21 21 x .. .. .. . ][ nx xx xx X xxX 11 xx x xx x x .. .. .. x x ...... ........ 2 221 21 2 1 2 2i21i 21i 2 1 2n1 2212 2111 22221 11211         ∑∑∑ ∑∑∑ =         ∑∑∑ ∑∑∑ =                           =′ x xx x xx x x x xxx xxx XX i ii n n n In case of only two observations. Now to obtain βˆ vector we need ( )XX′ -1 .
  • 8.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 8 Now
  • 9.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 9 0 0 0 1 x-0 1 x- - )()([ 1 )( )()x(- )x(- })({)( 1 get.eequation waboveinthisngsubstitutiSoconstant.zerononiswhere,xthatAssume x- x- )( 1 )(ˆNow x- x- )( 1 x- x- x x 1 )( 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 22 2 22 2 2 2 2 2 222 22 2 2 2 22 2 2 2 2 21 2 1 2 121 21 2 2 2 21 2 2 2 1 1 2 121 21 2 2 2 21 2 2 2 1 2 121 21 2 2 2 221 21 2 1 1 λλ λλ λ λλ λ λ λ λλ λ λλ λλ β       =         ∑∑+∑∑ ∑∑−∑∑ =                 ∑ ∑         ∑∑ ∑∑ ∑−∑ =                 ∑ ∑         ∑∑∑ ∑∑∑ ∑−∑∑ = =                 ∑ ∑         ∑∑∑ ∑∑∑ ∑−∑∑ ′′=         ∑∑∑ ∑∑∑ ∑−∑∑ =         ∑∑∑ ∑∑∑         ∑∑ ∑∑ =′ − − yxxyx yxxyxx yx yx x xx xx yx yx xx xx xxxx x yx yx xx xx xxxx YXXX xx xx xxxx xx xx xx xx XX
  • 10.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 10 ˆ ˆ 0 0 0 0 0 0 0 1 2 1         =             =       = β β Thus in case of perfect multicollinearity the least square estimates are indeterminate. 2. Variances and co-variances of least square estimates are infinite, if there is perfect linear relationship between predictor variables. Var-Cov matrix of least square estimate :bygivenisˆβ
  • 11.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 11 infinite.areestimates OLSofesconvariancandvariancesnearity,multicolliperfectofcaseinthatshows )ˆvar()ˆ,ˆcov( )ˆ,ˆcov()ˆvar( 00 00 0 - 1 )( )()ˆ( 221 211 2 2 222 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 12 21 Which xx xx xx xx XX XXVar         =      ∞∞ ∞∞ =                 ∑−∑− ∑−∑ =                       ∑∑− ∑∑ = ′= ′= − − βββ βββ λδλδ λδδ λ λ δ δ δβ Consequences of multicollinearity In case of near or high multicollinerity, one is likely to have the following consequences: 1. Although the estimators are BLUE(best liner unbiased estimator) but having large variances and covariance's.
  • 12.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 12 2. The confidence interval tend to be much wider leading to the acceptance of the null hypothesis because of the large variances. 3. The t ratio became insignificant. 4. R2 the overall measure of goodness of fit, can be very high. 5. Small changes in the data set can affect the standard errors. Methods of detection of multicollinearity Following are methods of detecting multicollinearity: 1. High R2 but few significant t ratios If R2 is high, say in excess of 0.8, the F test in most cases will reject the hypothesis that the partial slope coefficients are simultaneously equal to zero, but the individual t tests will show that none or very few of the partial slope coefficients are statistically different from zero. 2. High pair-wise correlations among regressors Another suggested rule of thumb is that if the pair-wise or zero-order correlation coefficient between two regressors is high, say , in excess of 0.8, then multicollinearity is a serious problem 3. Examination of the correlation matrix A very simple method to detect multicollinearity is the inspection of the off-diagonal elements rij in )( / XX . If predictor variables xi and xj are really linearly dependent then ijr will be close to 1. Examining the simple correlation rij between the predictor variables is helpful in detecting
  • 13.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 13 near linear dependency between pair of predictor variables only, but is not sufficient for detecting anything more. 4. Auxiliary regressions Since multicollinearity arises because one or more of the regressors are exact or approximately linear combinations of the other regresssors, one way of finding out which X variable is related to other X variables is to regress each X on the remaining X variables and compute the corresponding R2 . Each of these regressions is called an auxiliary regression. Consider the following Fi statistic )1/()1( )2/( ..... 2 ..... 2 321 321 +−− − = knR kR F k k xxxx xxxx i follows the F distribution with k-2 and n-k+1 df, n stands for the sample size, k stands for the number of explanatory variables including the intercept term, kxxxxR ..... 2 321 is the coefficient of determination in the regression of variable Xi on the remaining X variables. If the computed F exceeds the critical F at the chosen level of significance, it is taken to mean that the particular X is collinear with other X's; if it does not exceed the critical F, we say that it is not collinear with other X's, in which case we may retain that variable in the model.
  • 14.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 14 5. Eigenvalues and condition index: The condition number k defined as k=Maximum Eigen value/Minimum Eigen value and the condition index defined as CI= k If 'k' is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds 1000 there is severe multicollinearity. Alternatively, if the CI is between 10 and 30, there is moderate to strong multicollinearity and if it exceeds 30 there is severe multicollinearity 6. Variance inflation factor Variance inflation factor is an indicator of multicollinearity, the larger the value of VIF, the more 'troublesome" or collinear the variable Xj. If the VIF of a variable exceeds 10, which will happen if R2 j exceeds 0.90, that variable is said to be highly collinear. Remedial measures for multicollinearity Several methods have been proposed for dealing with the problem caused by multicollinearity. Some of these methods are given below.
  • 15.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 15 1. Collecting additional data: Collecting additional data has been suggested as the best method of combating multicollinerity. The additional data should be collected in a manner designed to break the multicollinearity in the existing data. Unfortunately, collecting additional data is not always possible because of economic constrains or because the process being studied is no longer available for sampling. Finally, note that collecting additional data is not a viable solution to the multicollinearity problem when the multicollinearity is due to constraints on the model or in the population. 2. Model Respecification Multicollinearity is often caused by the choice of model, such as when two highly correlated regressors are used in the regression equation. In these situation some respecification of the regression equation may lessen the impact of multicollinearity. There are two methods of model respecification: i. Redefine regressors If x1,x2, and x3 are nearly linearly dependent, it may be possible to find some function such as x=(x1+x2)/x3 or x=x1x2x3 that preserves the information content in the original regressors but reduces the ill-conditioning. ii. Variable elimination method Another widely used approach is variable elimination. That is, if x1,x2, and x3 are nearly linearly dependent, eliminating one regressor(say x3) may be helpful in combating multicollinearity. Variable elimination is often a highly effective technique. However, it may not provide a satisfactory
  • 16.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 16 solution if the regressors dropped from the model have significant explanatory power relative to the response y. 3. Transformation of variables Suppose we have time series data on consumption expenditure, income, and wealth. One reason for high multicollinearity between income and wealth is such data is that over time both the variables tend to move in the same direction. One way of minimizing this dependecne is to proceed as follows: If the relation Yt = β0 +β1X1t+β2X2t+ut -------- A holds at time t, it must also hold at time t-1. Therefore we have Yt-1 = β0 +β1X1,t-1+β2X2,t-1+ut-1--------- B If we subtract (B) from (A) Yt - Yt-1= β1(X1t- X1,t-1)+ β2 ( X2t- X2,t-1)+ vt--- C where vt=ut-ut-1
  • 17.
    Muhammad Ali Lecturer inStatistics GPGC Mardan. M.sc(Peshawar) M.phil(AIOU Islamabad) 17 Equation C is known as the first difference form because we run the regression, not on the original variables, but on the differences of successive values of the variables. The first difference regression model often reduces the severity of multicollinearity because, although the levels of X2 and X3 may be highly correlated, there is no a priori reason to belive that their differences will also be highly correlated.