Multicollinearity1

Muhammad Ali
Lecturer in Statistics
GPGC Mardan.
M.sc(Peshawar)
M.phil(AIOU Islamabad)
1
Multicollinearity
Definition of Multicollinearity
One of the assumption of the classical linear regression model (CLRM) is that there is no
relationship among the repressors(independent variables) included in the regression
model. However in most practical situations the predictor variables are nearly perfectly
related. When there are near linear dependency among the predictor variables the
problem of multicollinearity is said to exist. The term multicollinearity is used to denote
the presence of linear relationship among some or all predictor variables of a regression
model.
Example on Multicollinearity
Consider an economic model designed to explain the weekly sales of a specific item by a
supermarket. The factors that affect the sales of the item include, among other things, the price of the
product, the prices of competitive and complementary goods, and the extent of marketing or
advertising efforts devoted to the product. When the experimental design is determined by the
uncontrolled operation of the market, one group of variables that may move together in a systematic
way are the prices of the product and its competitors. Under this situation, it would not be a surprise
to find in a sample of data that when one price is going up all the prices are going up, and when one
price is falling all are falling.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
2
These systematic, collinear relationships between the prices are the type that are potential problem,
and therefore the problem of multicollinearity arises. Which having some serious affects on the
regression.
Nature of Multicollinearity
Consider the general linear regression model in matrix form:
εβ += XY
Where Y is a (n ×1) vector of responses, and X is a (n×p+1) matrix of predictors, β is a (p+1×1)
vector of regression coefficients and ε is a vector of disturbance term of order (n×1), such that
ε ≈NID (0, δ2
)
Let the jth
column of the matrix 'X' is denoted by xj, exact or perfect colinearity exists when there is an
exact linear relationship among the explanatory variables(columns of X). That is, one or more
relations of the form
0...332211 =++++ ppcxcxcxcx
exists, where the constants cj are not all zero. For example, if 432 2xxx += then the variables 2x , 3x
and 4x are exactly linearly related.
consider the following case where the X variables are intercorrelated but not perfectly:

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
3
0...332211 =+++++ εpp cxcxcxcx
where ε is a stochastic error term.
To see the difference between exact and non exact multicollinearity consider the following numerical
example
X1 X2 X2
*
10
20
30
40
50
20
40
60
80
100
22
40
67
89
102
It is clear from above table that X2=2X1. Therefore there is perfect collinearity between X1 and X2.
The variable X2
*
is created from X2 by simply adding the random numbers 2,0,7,9, and 2 taken
from the random number table. Now there is no longer perfect multicollinearity between variables X1
and X2
*
, however the two variables are highly correlated.
Note that multicollinearity is a question of degree and not of kind. The meaningful distinction is not
between the presence and the absence of multicollinearity, but between its various degrees.
Since multicollinearity refers to the condition of the explanatory variables that are assumed to be non
stochastic, it is feature of the sample and not of the population.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
4
Sources of multicollinearity
There are four primary sources of multicollinearity
1. The data collection method employed
2. Constraints on the model or in the population
3. Model specification
4. An over defined model.
1. The data collection method
The data collection methods can lead to multicollinearity problems when the researcher
samples only a subset of the region of the repressors.
2. Constraints on the model or in the population
Constraints on the model or in the population being samples can cause multicollinearity. For example,
suppose that an electric utility investigating the effect of family income(x1) and house size (x2) on
residential electricity consumption. In this example a physical constraint in the population has caused
this phenomenon, namely , families with higher incomes generally have larger homes than families
with lower incomes. When physical constraints such as this are present, multicollinearity will exist
regardless of the sampling method employed.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
5
3. Model Specification
We often encounter situations when two or more repressors are nearly linearly dependent and
retaining all these repressors may contribute to multicollinearity.
4. An over defined model
An over defined model has more regressor variables than observations. These models are
sometimes encountered in medical and behavioral research, where there may be only a small
number of subjects(sample units)available, and information is collected for a large number of
regressors on each subject. The usual approach to dealing with multicollinearity in this context
is to eliminate some of the regressor variables form consideration
Effects of Multicollinearity
The presence of multicollinearity has a number of potentially serious effects on the least-
squares estimates of the regression coefficients. Some of these effects may be easily
demonstrated.
1. The least square estimates are indeterminate in case of perfect multicollinearity:
Consider that there are only two predictor variables say x1 and x2. The simple linear
regression model can be then written as:
Yi = β0 + β1X1i + β2X2i + iε - - - - - - (i)

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
6
Taking sum and then divide it by 'n'.
)(22110
22110
2
2
1
1
1
1
0
1
iiXXY
XXY
nn
X
n
X
nn
Y
iii
iii
ii
n
i
i
n
i
n
i
i
−−−−−−++=
+++=
∑+
∑
++=
∑∑
∑ ==
=
βββ
εβββ
ε
ββ
β
Subtract equation (ii) from (i)
)(
)()(
2211
222111
iiixxy
XXXXYY
iiii
iiiiii
−−−−−−++=
+−+−=−
εββ
εββ
Equation (iii) is the deviation form.
The least square normal equations is given by:
( ) YXXX ′=′ βˆ
Now ( ) YXXX ′′=
−1ˆβ

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
7




















=
=
2n1
2212
21
21
x
..
..
..
.
][
nx
xx
xx
X
xxX
11
xx
x
xx
x
x
..
..
..
x
x
......
........
2
221
21
2
1
2
2i21i
21i
2
1
2n1
2212
2111
22221
11211








∑∑∑
∑∑∑
=








∑∑∑
∑∑∑
=


























=′
x
xx
x
xx
x
x
x
xxx
xxx
XX
i
ii
n
n
n
In case of only two observations.
Now to obtain βˆ vector we need ( )XX′ -1
.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
8
Now

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
9
0
0
0
1
x-0
1
x-
-
)()([
1
)(
)()x(-
)x(-
})({)(
1
get.eequation waboveinthisngsubstitutiSoconstant.zerononiswhere,xthatAssume
x-
x-
)(
1
)(ˆNow
x-
x-
)(
1
x-
x-
x
x
1
)(
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
22
2
2
2
2
2
22
2
22
2
2
2
2
2
222
22
2
2
2
22
2
2
2
2
21
2
1
2
121
21
2
2
2
21
2
2
2
1
1
2
121
21
2
2
2
21
2
2
2
1
2
121
21
2
2
2
221
21
2
1
1
λλ
λλ
λ
λλ
λ
λ
λ
λλ
λ
λλ
λλ
β






=








∑∑+∑∑
∑∑−∑∑
=
















∑
∑








∑∑
∑∑
∑−∑
=
















∑
∑








∑∑∑
∑∑∑
∑−∑∑
=
=
















∑
∑








∑∑∑
∑∑∑
∑−∑∑
′′=








∑∑∑
∑∑∑
∑−∑∑
=








∑∑∑
∑∑∑








∑∑
∑∑
=′
−
−
yxxyx
yxxyxx
yx
yx
x
xx
xx
yx
yx
xx
xx
xxxx
x
yx
yx
xx
xx
xxxx
YXXX
xx
xx
xxxx
xx
xx
xx
xx
XX

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
10
ˆ
ˆ
0
0
0
0
0
0
0
1
2
1








=












=






=
β
β
Thus in case of perfect multicollinearity the least square estimates are indeterminate.
2. Variances and co-variances of least square estimates are infinite, if there is perfect
linear relationship between predictor variables.
Var-Cov matrix of least square estimate :bygivenisˆβ

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
11
infinite.areestimates
OLSofesconvariancandvariancesnearity,multicolliperfectofcaseinthatshows
)ˆvar()ˆ,ˆcov(
)ˆ,ˆcov()ˆvar(
00
00
0
-
1
)(
)()ˆ(
221
211
2
2
222
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
12
21
Which
xx
xx
xx
xx
XX
XXVar








=





∞∞
∞∞
=
















∑−∑−
∑−∑
=






















∑∑−
∑∑
=
′=
′=
−
−
βββ
βββ
λδλδ
λδδ
λ
λ
δ
δ
δβ
Consequences of multicollinearity
In case of near or high multicollinerity, one is likely to have the following consequences:
1. Although the estimators are BLUE(best liner unbiased estimator) but having large variances
and covariance's.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
12
2. The confidence interval tend to be much wider leading to the acceptance of the null hypothesis
because of the large variances.
3. The t ratio became insignificant.
4. R2
the overall measure of goodness of fit, can be very high.
5. Small changes in the data set can affect the standard errors.
Methods of detection of multicollinearity
Following are methods of detecting multicollinearity:
1. High R2 but few significant t ratios
If R2
is high, say in excess of 0.8, the F test in most cases will reject the hypothesis that the
partial slope coefficients are simultaneously equal to zero, but the individual t tests will show
that none or very few of the partial slope coefficients are statistically different from zero.
2. High pair-wise correlations among regressors
Another suggested rule of thumb is that if the pair-wise or zero-order correlation coefficient
between two regressors is high, say , in excess of 0.8, then multicollinearity is a serious
problem
3. Examination of the correlation matrix
A very simple method to detect multicollinearity is the inspection of the off-diagonal elements
rij in )( /
XX . If predictor variables xi and xj are really linearly dependent then ijr will be close
to 1. Examining the simple correlation rij between the predictor variables is helpful in detecting

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
13
near linear dependency between pair of predictor variables only, but is not sufficient for detecting
anything more.
4. Auxiliary regressions
Since multicollinearity arises because one or more of the regressors are exact or approximately
linear combinations of the other regresssors, one way of finding out which X variable is related
to other X variables is to regress each X on the remaining X variables and compute the
corresponding R2
. Each of these regressions is called an auxiliary regression. Consider the
following Fi statistic
)1/()1(
)2/(
.....
2
.....
2
321
321
+−−
−
=
knR
kR
F
k
k
xxxx
xxxx
i
follows the F distribution with k-2 and n-k+1 df, n stands for the sample size, k stands for the
number of explanatory variables including the intercept term, kxxxxR .....
2
321 is the coefficient of
determination in the regression of variable Xi on the remaining X variables.
If the computed F exceeds the critical F at the chosen level of significance, it is taken to mean
that the particular X is collinear with other X's; if it does not exceed the critical F, we say that
it is not collinear with other X's, in which case we may retain that variable in the model.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
14
5. Eigenvalues and condition index:
The condition number k defined as
k=Maximum Eigen value/Minimum Eigen value
and the condition index defined as
CI= k
If 'k' is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds
1000 there is severe multicollinearity. Alternatively, if the CI is between 10 and 30, there is
moderate to strong multicollinearity and if it exceeds 30 there is severe multicollinearity
6. Variance inflation factor
Variance inflation factor is an indicator of multicollinearity, the larger the value of VIF, the
more 'troublesome" or collinear the variable Xj. If the VIF of a variable exceeds 10, which will
happen if R2
j exceeds 0.90, that variable is said to be highly collinear.
Remedial measures for multicollinearity
Several methods have been proposed for dealing with the problem caused by multicollinearity. Some
of these methods are given below.

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
15
1. Collecting additional data:
Collecting additional data has been suggested as the best method of combating multicollinerity.
The additional data should be collected in a manner designed to break the multicollinearity in
the existing data. Unfortunately, collecting additional data is not always possible because of
economic constrains or because the process being studied is no longer available for sampling.
Finally, note that collecting additional data is not a viable solution to the multicollinearity
problem when the multicollinearity is due to constraints on the model or in the population.
2. Model Respecification
Multicollinearity is often caused by the choice of model, such as when two highly correlated
regressors are used in the regression equation. In these situation some respecification of the
regression equation may lessen the impact of multicollinearity. There are two methods of
model respecification:
i. Redefine regressors
If x1,x2, and x3 are nearly linearly dependent, it may be possible to find some function such as
x=(x1+x2)/x3 or x=x1x2x3 that preserves the information content in the original regressors but
reduces the ill-conditioning.
ii. Variable elimination method
Another widely used approach is variable elimination. That is, if x1,x2, and x3 are nearly linearly
dependent, eliminating one regressor(say x3) may be helpful in combating multicollinearity. Variable
elimination is often a highly effective technique. However, it may not provide a satisfactory

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
16
solution if the regressors dropped from the model have significant explanatory power relative
to the response y.
3. Transformation of variables
Suppose we have time series data on consumption expenditure, income, and wealth. One
reason for high multicollinearity between income and wealth is such data is that over time both
the variables tend to move in the same direction. One way of minimizing this dependecne is to
proceed as follows:
If the relation
Yt = β0 +β1X1t+β2X2t+ut -------- A
holds at time t, it must also hold at time t-1. Therefore we have
Yt-1 = β0 +β1X1,t-1+β2X2,t-1+ut-1--------- B
If we subtract (B) from (A)
Yt - Yt-1= β1(X1t- X1,t-1)+ β2 ( X2t- X2,t-1)+ vt--- C
where vt=ut-ut-1

Muhammad Ali
GPGC Mardan.
M.sc(Peshawar)
17
Equation C is known as the first difference form because we run the regression, not on the original
variables, but on the differences of successive values of the variables. The first difference regression
model often reduces the severity of multicollinearity because, although the levels of X2 and X3 may
be highly correlated, there is no a priori reason to belive that their differences will also be highly
correlated.

Multicollinearity1

More Related Content

What's hot

Similar to Multicollinearity1

In this document

Multicollinearity1