Regression Analysis

Regression analysis
Week no 2 - 19th to 23rd Sept, 2011

Course Map
Introduction to Quantitative Analysis, Ch1, RSH (1 Week)

Regression Models Ch4 (1week)

Decision Analysis, Ch3, RSH (2 Weeks)

Linear Programming Models: Graphical & Computer Methods, Ch7, RSH (2
Weeks)

Linear Programming Modeling Applications: With Computer Analyses in Excel,
Ch8, RSH (2 Weeks)

Simulation Modeling, Ch15, RSH (2 Weeks)

Forecasting, Ch5, RSH. (2 Weeks)

Waiting Lines and Queuing Theory Models, Ch14, RSH. (2 Weeks)

regression analysis
A very valuable tool for today’s manager.
Regression Analysis is used to:

Understand the relationship between variables.

Predict the value of one variable based on
another variable.

A regression model has:

dependent, or response, variable - Y axis

an independent, or predictor, variable - X axis

How to perform
Regression analysis

regression analysis
Triple A Construction Company renovates old
homes in Albany. They have found that its dollar
volume of renovation work is dependent on the
Albany area payroll.
Local Payroll Triple A Sales
($100,000,000's) ($100,000's)
3 6
4 8
6 9
4 5
2 4.5
5 9.5

Scatter plot
10

8

6
100,000
Sales

4

2

0
0 1 2 3 4 5 6
Local Payroll
($100,000,000's)

regression analysis model
Regression: Understand & Predict
Create a Scatter Plot
Perform Regression Analysis

some random error
that cannot be
predicted.
Dependent
Variable, Slope
Response Independent
Variable, Predictor
Intercept
(Value of Y when
X=0)

Sample data are used to estimate
the true values for the intercept and
slope.
Y = b0+ b 1X
Where,

Y = predicted value of Y
The difference between the actual
value of Y and the predicted value
(using sample data) is known as
the error.
Error = (actual value) – (predicted value)

e=Y-Y

_ 2
_ _
Sales (Y) Payroll (X) (X - X) (X-X)(Y-Y)
Calculating the required
6 3 1 1 parameters:

8 4 0 0
b 1= !(X-X)(Y-Y) = 12.5 = 1.25
! (X-X) 2 10
9 6 4 4

5 4 0 0 bo= Y – b1X = 7 – (1.25)(4) = 2

4.5 2 4 5 So,

9.5 5 1 2.5 Y = 2 + 1.25 X
Summations for each column:
42 24 10 12.5
_ _
Y = 42/6 = 7 X = 24/6 = 4

Measuring the Fit of
the linear Regression
Model

Measuring the Fit of the linear
Regression Model
To understand how well the X predicts the Y, we
evaluate
Variability in the Y Correlation Standard Residual
variable Coefficient Error Analysis
SSR –> Regression Variability St Deviation
r – Strength of the Validation of
that is explained by the of error
relationship Model
relationship b/w X & Y around the
between Y and X
+ Regression
variables
SSE –> Unexplained Line
Variability, due to factors then
the regression Coefficient of Test for Linearity
------------------------------------ Determination Significance of the
SST –> Total variability about R Sq - Proportion of Regression Model i.e.
the mean explained variation Linear Regression Model

Variability
10 y = 1.25x + 2 SSE SST
R² = 0.6944 SSR
explained
8 variability _
Y
6

4

2

0
0 1 2 3 4 5 6
Local Payroll Regression Line
($100,000,000's)

Variability
Errors (deviations) may be positive or
negative. Summing the errors would be
misleading, thus we square the terms For Triple A Construction:
prior to summing. 2
= 22.5
SST =! (Y-Y)
! Sum of Squares Total (SST) measures the
total variable in Y. SSE =! e 2 = ! (Y-Y) 2
= 6.875
2
SST =! (Y-Y) SSR =!(Y-Y)2 = 15.625

! Sum of the Squared Error (SSE) is less
than the SST because the regression line Note:
reduced the variability. SST = SSR + SSE
SSE =! e 2 = ! (Y-Y) 2
Explained Unexplained
! Sum of Squares due to Regression (SSR) Variability Variability
indicated how much of the total variability
is explained by the regression model.
SSR =!(Y-Y)2

Coefﬁcient of Determination
The coefficient of determination (r2 )
is the proportion of the variability in Y
that is explained by the regression
equation.
r2 = SSR = 1 – SSE
SST, SSR and SSE
SST SST just themselves
provide little direct
For Triple A Construction: interpretation. This
measures the
r2 = 15.625 = 0.6944 usefulness of
22.5 regression

69% of the variability in sales is explained
by the regression based on payroll.

Note: 0 < r2 < 1

Correlation Coefﬁcient
The correlation coefficient (r)
measures the strength of the linear
relationship. Possible
Scatter Diagrams
for values of r.

n!XY-!X!Y Shown as Multiple R in
r= the output of Excel

[n!X -(!X) ][n!Y -(!Y -(!Y) ]
2 2 2 2 2 ﬁle

For Triple A Construction, r = 0.8333

Note: -1 < r < 1

Standard error
The mean squared error (MSE) is
the estimate of the error variance of
the regression equation.

s = MSE = SSE
2

n–k-1
Estimate of Variance. Just like St Dev
(which is around mean), it measures the
Where, variation of Y variation around the
n = number of observations in the sample regression line OR St Dev of error
around the Regression Line. Same units
k = number of independent variables as Y. Means +1.3 x 100,000 USD Sales
error in prediction

For Triple A Construction, s 2= 1.31

Test for linearity
p value is significance level
An F-test is used to statistically alpha = level of significance or
= 1-confidence interval
test the null hypothesis that there
is no linear relationship between If p<alpha
Reject the null hypothesis that
the X and Y variables (i.e. ! 1 = 0). there is no linear relationship
If the significance level for the F between X & Triple A Construction:
For Y
test is low, we reject Ho and conclude
there is a linear relationship. MSR = 15.625 = 15.625
1

F = MSR F = 15.625 = 9.0909
1.7188
MSE The significance level for F = 9.0909 is
0.0394, indicating we reject Ho and
where, MSR = SSR conclude a linear relationship exists
between sales and payroll.
k

Computer Software for
Regression
In Excel, use Tools/
Data Analysis. This
is an ‘add-in’ option.

Regression

Multiple R is
Regression
correlation Estimate of Variance. Just like St Dev (which is around mean), it measures the variation
coefﬁcient of Y variation around the regression line OR St Dev of error around the Regression Line.
Same units as Y. Means +1.3 x 100,000 USD Sales error in prediction
number of independent variables in the model.
The adjusted R Sq takes into account the

p Value < Alpha (0.05 or
0.1) means relationship
between X & Y is linear

Residual Analysis:
to verify regression assumptions
are correct

Assumptions of the
Regression Model
We make certain assumptions about
the errors in a regression model A plot of
which allow for statistical testing. the errors (Real
Value minus predicted
Assumptions: value of Y), also called
! Errors are independent. residuals in excel may
highlight
! Errors are normally distributed.
problems with the
! Errors have a mean of zero.
model.
! Errors have a constant variance.
PITFALLS:
Prediction beyond the range of X values in the sample can be misleading, including
interpretation of the intercept (X=0).
A linear regression model may not be the best model, even in the presence of a significant F
test.

Constant variance
Triple A Construction
Errors have constant
Variance Assumption
Plot Residues w.r.t X values
Pattern should be random!

Non-constant Variation in Error
Residual Plot –violation
0 X

Normal distribution
Histogram of Residuals - Should look like a bell curve


Not possible to see
the bell curve with just
6 observations. Need
more samples

zero mean
Errors have zero Mean

0 X

independent errors
Example: Manager of a package
If samples collected over a
delivery store wants to predict
period of time and not at the weekly sales based on the
same time, then plot the number of customers making
residues w.r.t time to see if purchases for a period of 100
any pattern (Autocorrelation) days. Data is collected over a
exists. period of time so check for
autocorrelation (pattern) effect.

If substantial autocorrelation, Cyclical Pattern!
A Violation
Residues
Regression Model Validity
becomes doubtful
Autocorrelation can also be checked
using Durbin–Watson statistic.
time

Residual analysis for
validating assumptions
Nonlinear Residual Plot –violation

multiple regression
Multiple regression models are
similar to simple linear regression Wilson Realty wants to develop a model to
determine the suggested listing price for a house
models except they include more based on size and age.

than one X variable. Price
35000
Sq. Feet
1926
Age
30
Condition
Good
47000 2069 40 Excellent
49900 1720 30 Excellent
55000 1396 15 Good
58900 1706 32 Mint
60000 1847 38 Mint
Y = b0+ b1 X 1+ b2X 2+…+ bnXn 67000 1950 27 Mint
70000 2323 30 Excellent

slope 78500 2285 26 Mint
79000 3752 35 Good
87500 2300 18 Good
Independent variables 93000 2525 17 Good
95000 3800 40 Excellent
97000 1740 12 Mint

multiple regression

Wilson Realty has found a linear
67% of the variation in
relationship between price and size
sales price is explained by
and age. The coefficient for size
size and age.
Ho: No linear indicates each additional square foot
relationship increases the value by $21.91, while
is rejected each additional year in age decreases
the value by $1449.34.
Y = 60815.45 + 21.91(size) – 1449.34 (age)

For a 1900 square foot house that is 10
years old, the following prediction can be
made:
Y = 60815.45 + 21.91(size) – 1449.34 (age) $87,951 = 21.91(1900) + 1449.34(10)

Ho: !1 = 0 is rejected
Ho: !2 = 0 is rejected

dummy variables
Binary (or dummy) variables Return to Wilson Realty, and let’s
evaluate how to use property
are special variables that are condition in the regression model.
created for qualitative data. There are three categories: Mint,
Excellent, and Good.
! A dummy variable is assigned a
value of 1 if a particular condition is X3= 1 if the house is in excellent condition
= 0 otherwise
met and a value of 0 otherwise. X4 = 1 if the house is in mint condition
! The number of dummy variables = 0 otherwise

must equal one less than the number Note: If both X and X = 0 then the
house is in good condition
of categories of the qualitative
variable.

dummy variables
As more variables are
added to the model, the r2
usually increases. Y = 48329.23 + 28.21 (size) – 1981.41(age) +
23684.62 (if mint) + 16581.32 (if excellent)

adjusted r-Square
The best model is a statistically
significant model with a high r2
and a few variables.

! As more variables are added to the
model, the r2 usually increases.
! The adjusted r2 takes into account
the number of independent variables
in the model.
Note: When variables are added to the model, the
value of r2 can never decrease; however, the
adjusted r2 may decrease.

multicollinearity
Collinearity or multicollinearity Duplication of
exists when an independent variable information occurs

is correlated with another
independent variable. When multicollinearity exists,
the overall F test is still valid, but
! Collinearity and multicollinearity the hypothesis tests related to the
create problems in the coefficients. individual coefficients are not.

! The overall model prediction is still A variable may appear to be
good; however individual significant when it is
interpretation of the variables is insignificant, or a variable may
questionable. appear to be insignificant when it
is significant.

non-linear regression
Engineers at Colonel Motors want to use regression analysis to improve fuel efficiency. They are
studying the impact of weight on miles per gallon (MPG).

Linear regression model:

MPG = 47.8 – 8.2 (weight)

F significance = .0003
r2 = .7446

Nonlinear (transformed variable)
regression model
2
MPG = 79.8 – 30.2(weight) + 3.4(weight)

F significance = .0002
R2 = .8478

We should not try to interpret the coefficients of the variables
due to the correlation between (weight) and (weight squared).
Normally we would interpret the coefficient for as the change
in Y that results from a 1-unit change in X1, while holding all
other variables constant.
Obviously holding one variable constant while changing the
other is impossible in this example since If changes, then must
change also.
This is an example of a problem that exists when
multicollinearity is present.

chapter assignments
on LMS

Regression Analysis

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Regression Analysis (20)

Recently uploaded (20)

Regression Analysis