Simple Linear Regression
Simple Linear Regression
• Our objective is to study the relationship
between two variables X and Y.
• One way is by means of regression.
• Regression analysis is the process of
estimating a functional relationship between
X and Y. A regression equation is often used
to predict a value of Y for a given value of X.
• Another way to study relationship between
two variables is correlation. It involves
measuring the direction and the strength of
the linear relationship.
2
First-Order Linear Model =
Simple Linear Regression Model
where
y = dependent variable
x = independent variable
0= y-intercept
1= slope of the line
 = error variable
i
i
1
0
i X
Y 

 


3
Simple Linear Model
This model is
– Simple: only one X
– Linear in the parameters: No parameter
appears as exponent or is multiplied or
divided by another parameter
– Linear in the predictor variable (X): X
appears only in the first power.
4
i
i
1
0
i X
Y 

 


Examples
• Multiple Linear Regression:
• Polynomial Linear Regression:
• Linear Regression:
• Nonlinear Regression:
5
i
i
2
2
i
1
1
0
i X
X
Y 


 



i
i
2
2
i
1
0
i X
X
Y 


 



i
i
2
i
1
0
i
10 )
X
exp(
X
)
Y
(
log 


 



i
i
2
1
0
i ))
X
exp(
1
/(
Y 


 


Linear or nonlinear in parameters
Deterministic Component of
Model
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20
x
x
ˆ
ˆ
y 1
0 
 

(slope)=∆y/∆x
y-intercept
∆x
∆y
0
ˆ

1
̂
6
Mathematical vs Statistical Relation
y = - 5.3562 + 3.3988x
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20
x
^
x
7
Error
• The scatterplot shows that the points are not
on a line, and so, in addition to the
relationship, we also describe the error:
• The Y’s are the response (or dependent)
variable. The x’s are the predictors or
independent variables, and the epsilon’s are the
errors. We assume that the errors are normal,
mutually independent, and have variance 2
.
0 1 , i=1,2,...,n
i i i
y x
  
  
8
Least Squares:
Minimize
• The quantities are called the
residuals. If we assume a normal error, they
should look normal.
2 2
0
1 1
( )
n n
i i i i
i i
y x
  
 
  
 
0 1
ˆ ˆ
ˆi i
y x
 
 
ˆ
i i i
R y y
 
9
Error: Yi-E(Yi) unknown; Residual: estimated, i.e.
known
ˆ
i i i
R y y
 
Minimizing error
10
• The Simple Linear Regression Model
• The Least Squares Regression Line where
0 1
y x
  
  
0 1
ˆ ˆ
ŷ x
 
 
1
ˆ xy
x
SS
SS
 
  
( )( )
i i
xy i i i i
x y
SS x x y y x y
n
    
 
 
 
2
2 2
( )
i
x i i
x
SS x x x
n
   

 
0 1
ˆ ˆ
y x
 
 
11
What form does the error take?
• Each observation may be decomposed
into two parts:
• The first part is used to determine the
fit, and the second to estimate the error.
• We estimate the standard deviation of
the error by:
ˆ ˆ
( )
y y y y
  
2
2
ˆ
( )
xy
yy
xx
S
SSE Y Y S
S
 
     
 
 

12
Estimate of 2
• We estimate 2
by
2
2
SSE
s MSE
n
 

13
Example
• An educational economist wants to
establish the relationship between an
individual’s income and education. He takes
a random sample of 10 individuals and asks
for their income ( in $1000s) and education
( in years). The results are shown below.
Find the least squares regression line.
11 12 11 15 8 10 11 12 17 11
25 33 22 41 18 28 32 24 53 26
Education
Income
14
Dependent and Independent
Variables
• The dependent variable is the one that we
want to forecast or analyze.
• The independent variable is hypothesized
to affect the dependent variable.
• In this example, we wish to analyze
income and we choose the variable
individual’s education that most affects
income. Hence, y is income and x is
individual’s education
15
First Step:
2
2
118
1450
302
10072
3779
i
i
i
i
i i
x
x
y
y
x y










16
Sum of Squares:
( )( ) (118)(302)
3779 215.4
10
i i
xy i i
x y
SS x y
n
    
 

2
2
2
( ) (118)
1450 57.6
10
i
x i
x
SS x
n
    


Therefore,
1
215.4
ˆ 3.74
57.6
xy
x
SS
SS
   
0 1
302 118
ˆ ˆ 3.74 13.93
10 10
y x
 
    
17
The Least Squares Regression
Line
• The least squares regression line is
• Interpretation of coefficients:
*The sample slope tells us that on
average for each additional year of education, an
individual’s income rises by $3.74 thousand.
• The y-intercept is . This value is the
expected (or average) income for an individual who
has 0 education level (which is meaningless here)
ˆ 13.93 3.74
y x
 
1
ˆ 3.74
 
0
ˆ 13.93
 
18
Example
• Car dealers across North America use the
red book to determine a cars selling price
on the basis of important features. One of
these is the car’s current odometer reading.
• To examine this issue 100 three-year old
cars in mint condition were randomly
selected. Their selling price and odometer
reading were observed.
19
Portion of the data file
Odometer Price
37388 5318
44758 5061
45833 5008
30862 5795
….. …
34212 5283
33190 5259
39196 5356
36392 5133
20
Example (Minitab Output)
Regression Analysis
The regression equation is
Price = 6533 - 0.0312 Odometer
Predictor Coef StDev T P
Constant 6533.38 84.51 77.31 0.000(SIGNIFICANT)
Odometer -0.031158 0.002309 -13.49 0.000(SIGNIFICANT)
S = 151.6 R-Sq = 65.0% R-Sq(adj) = 64.7%
Analysis of Variance
Source DF SS MS F P
Regression 1 4183528 4183528 182.11 0.000
Error 98 2251362 22973
Total 99 6434890
21
Example
• The least squares regression line is
ˆ 6533.38 0.031158
y x
 
50000
40000
30000
20000
6000
5500
5000
Odometer
Price
22
Interpretation of the coefficients
• means that for each additional
mile on the odometer, the price decreases by an
average of 3.1158 cents.
• means that when x = 0 (new car),
the selling price is $6533.38 but x = 0 is not in
the range of x. So, we cannot interpret the value
of y when x=0 for this problem.
• R2
=65.0% means that 65% of the variation of y
can be explained by x. The higher the value of
R2
, the better the model fits the data.
1
ˆ 0.031158
 
0
ˆ 6533.38
 
23
R² and R² adjusted
• R² measures the degree of linear association
between X and Y.
• So, an R² close to 0 does not necessarily indicate
that X and Y are unrelated (relation can be
nonlinear)
• Also, a high R² does not necessarily indicate that
the estimated regression line is a good fit.
• As more and more X’s are added to the model, R²
always increases. R²adj accounts for the number of
parameters in the model.
24
Scatter Plot
Odometer .vs. Price Line Fit Plot
4500
5000
5500
6000
19000 29000 39000 49000
Odometer
Price
25
Testing the slope
• Are X and Y linearly related?
0
:
0
:
1
1
0




A
H
H
•Test Statistic:
1
ˆ
1
1
ˆ



s
t

 where 1
ˆ
x
s
s
SS



26
Testing the slope (continue)
• The Rejection Region:
Reject H0 if t < -t/2,n-2 or t > t/2,n-2.
• If we are testing that high x values lead to high y
values, HA: 1>0. Then, the rejection region is
t > t,n-2.
• If we are testing that high x values lead to low y
values or low x values lead to high y values, HA: 1
<0. Then, the rejection region is t < - t,n-2.
27
Assessing the model
Example:
• Excel output
• Minitab output
Coefficients Standard Error t Stat P-value
Intercept 6533.4 84.512322 77.307 1E-89
Odometer -0.031 0.0023089 -13.49 4E-24
Predictor Coef StDev T P
Constant 6533.38 84.51 77.31 0.000
Odometer -0.031158 0.002309 -13.49 0.000
28
Coefficient of Determination
y
SS
SSE
R 
1
2
For the data in odometer example, we obtain:
6501
.
0
3499
.
0
1
890
,
434
,
6
363
,
251
,
2
1
1
2







y
SS
SSE
R
29
y
2
adj
SS
SSE
)
p
n
1
n
(
1
R




where p is number of predictors in the model.
Using the Regression Equation
• Suppose we would like to predict the
selling price for a car with 40,000 miles on
the odometer
ˆ 6,533 0.0312
6,533 0.0312(40,000)
$5,285
y x
 
 

30
Prediction and Confidence
Intervals
• Prediction Interval of y for x=xg: The
confidence interval for predicting the particular
value of y for a given x
• Confidence Interval of E(y|x=xg): The
confidence interval for estimating the expected
value of y for a given x
x
g
e
n
SS
x
x
n
s
t
y
2
2
,
2
/
)
(
1
1
ˆ



 

x
g
e
n
SS
x
x
n
s
t
y
2
2
,
2
/
)
(
1
ˆ


 

31
Solving by Hand
(Prediction Interval)
• From previous calculations we have the
following estimates:
• Thus a 95% prediction interval for
x=40,000 is:
ˆ 5285, 151.6, 4309340160, 36,009
x
y s SS x

   
2
1 (40,000 36,009)
5,285 1.984(151.6) 1
100 4,309,340,160
5,285 303

  

•The prediction is that the selling price of the car
will fall between $4982 and $5588. 32
Solving by Hand
(Confidence Interval)
• A 95% confidence interval of
E(y| x=40,000) is:
35
285
,
5
160
,
340
,
309
,
4
)
009
,
36
000
,
40
(
100
1
)
6
.
151
(
984
.
1
285
,
5
2




•The mean selling price of the car will fall between $5250
and $5320.
33
Prediction and Confidence
Intervals’ Graph
50000
40000
30000
20000
6300
5800
5300
4800
Odometer
Predicted
Predictioninterval
Confidenceinterval
34
Notes
• No matter how strong is the statistical relation
between X and Y, no cause-and-effect pattern is
necessarily implied by the regression model. Ex:
Although a positive and significant relationship is
observed between vocabulary (X) and writing
speed (Y), this does not imply that an increase in
X causes an increase in Y. Other variables, such
as age, may affect both X and Y. Older children
have a larger vocabulary and faster writing
speed.
35
Regression Diagnostics
Residual Analysis:
Non-normality
Heteroscedasticity (non-constant variance)
Non-independence of the errors
Outlier
Influential observations
36
Standardized Residuals
• The standardized residuals are calculated
as
where .
• The standard deviation of the i-th residual
is
Standardized residual i
r
s

ˆ
i i i
r y y
 
2
( )
1
1 where
i
i
r i i
x
x x
s s h h
n SS


   
37
Non-normality:
• The errors should be normally distributed. To
check the normality of errors, we use histogram
of the residuals or normal probability plot of
residuals or tests such as Shapiro-Wilk test.
• Dealing with non-normality:
– Transformation on Y
– Other types of regression (e.g., Poisson or
Logistic …)
– Nonparametric methods (e.g., nonparametric
regression(i.e. smoothing))
38
Non-constant variance:
• The error variance should be constant.
• To diagnose non-constant variance, one method
is to plot the residuals against the predicted
value of y (or x). If the points are distributed
evenly around the expected value of errors
which is 0, this means that the error variance is
constant. Or, formal tests such as: Breusch-
Pagan test
2


39
Dealing with non-constant variance
• Transform Y
• Re-specify the Model (e.g., Missing
important X’s?)
• Use Weighted Least Squares instead of
Ordinary Least Squares 

n
1
i i
2
i
)
(
Var
min


40
Non-independence of error
variable:
• The values of error should be
independent. When the data are time
series, the errors often are correlated (i.e.,
autocorrelated or serially correlated). To
detect autocorrelation we plot the
residuals against the time periods. If there
is no pattern, this means that errors are
independent. Or, more formal tests such
as Durbin-Watson
41
Outlier:
• An outlier is an observation that is unusually
small or large. Two possibilities which cause
outlier is
1.Error in recording the data. Detect the error and
correct it
The outlier point should not have been included
in the data (belongs to another population) 
Discard the point from the sample
2. The observation is unusually small or large
although it belong to the sample and there is no
recording error.  Do NOT remove it
42
Influential Observations
Scatter Plot of One Influential Observation
0
50
100
150
0 10 20 30 40 50
x
y
Scatter Plot Without the Influential
Observation
0
10
20
30
40
50
60
0 10 20 30 40 50
x
y
43
Influential Observations
• Detection:
Cook’s Distance, DFFITS, DFBETAS (Neter, J.,
Kutner, M.H., Nachtsheim, C.J., and Wasserman, W., (1996)
Applied Linear Statistical Models, 4th edition, Irwin, pp. 378-384)
44
Multicollinearity
• A common issue in multiple regression is
multicollinearity. This exists when some or
all of the predictors in the model are highly
correlated. In such cases, the estimated
coefficient of any variable depends on
which other variables are in the model.
Also, standard errors of the coefficients
are very high…
45
Multicollinearity
• Look into correlation coefficient among X’s: If Cor>0.8,
suspect multicollinearity
• Look into Variance inflation factors (VIF): VIF>10 is
usually a sign of multicollinearity
• If there is multicollinearity:
– Use transformation on X’s, e.g. centering, standardization.
Ex: Cor(X,X²)=0.991; after standardization Cor=0!
– Remove the X that causes multicollinearity
– Factor analysis
– Ridge regression
– …
46
Exercise
• In baseball, the fans are always interested
in determining which factors lead to
successful teams. The table below lists the
team batting average and the team
winning percentage for the 14 league
teams at the end of a recent season.
47
Team-B-A Winning%
0.254 0.414
0.269 0.519
0.255 0.500
0.262 0.537
0.254 0.352
0.247 0.519
0.264 0.506
0.271 0.512
0.280 0.586
0.256 0.438
0.248 0.519
0.255 0.512
0.270 0.525
0.257 0.562
y = winning % and x = team batting average
48
a) LS Regression Line
2
2
3.642, 0.949
7.001, 3.549
1.824562
i i
i i
i i
x x
y y
x y
 
 

 
 

 
2
2
2
( )( ) (3.642)(7.001)
1.824562 0.0033
14
(3.642)
0.948622 0.00118
14
i i
xy i i
i
x i
x y
SS x y
n
x
SS x
n
    
    
 



49
• The least squares regression line is
• The meaning is for each
additional batting average of the team,
the winning percentage increases by an
average of 79.41%.
1
0 1
0.003302
ˆ 0.7941
0.001182
ˆ ˆ 0.5 (0.7941)0.26 0.2935
xy
x
SS
SS
y x

 
  
    
1
ˆ 0.7941
 
ˆ 0.2935 0.7941
y x
 
50
b) Standard Error of Estimate
So,
• Since s=0.0567 is small, we would conclude that
“s” is relatively small, indicating that the regression
line fits the data quite well.
 
2
2 2
2
2 2
7.001 0.003302
(3.548785 ) 0.03856
14 0.00182
i
xy xy
yy i
xx xx
y
S S
SSE S y
S n S
 
   
 
    
   
   
 
   
 
   


2 2
0.03856
0.00321 and 0.0567
2 14 2
SSE
s s s
n
  
    
 
51
c) Do the data provide sufficient evidence
at the 5% significance level to conclude that
higher team batting average lead to higher
winning percentage?
0
:
0
:
1
1
0




A
H
H
1
1 1
ˆ
ˆ
Test statistic: t 1.69 (p-value=.058)
s
 

 
Conclusion: Do not reject H0 at  = 0.05.
The higher team batting average does not
lead to higher winning percentage.
52
d)Coefficient of Determination
2
2 0.03856
1 1 0.1925
0.04778
xy
x y y
SS SSE
R
SS SS SS
     

The 19.25% of the variation in the winning percentage
can be explained by the batting average.
53
e) Predict with 90% confidence the winning
percentage of a team whose batting average
is 0.275.
2
/ 2, 2
2
ˆ 0.2935 0.7941(0.275) 0.5119
( )
1
ˆ 1
1 (0.275 0.2601)
0.5119 (1.782)(0.0567) 1
14 0.001182
0.5119 0.1134
(0.3985,0.6253)
g
n
x
y
x x
y t s
n SS
 

  

   

  

90% PI for y:
•The prediction is that the winning percentage of the
team will fall between 39.85% and 62.53%.
54

15.Simple Linear Regression of case study-530 (2).ppt

  • 1.
  • 2.
    Simple Linear Regression •Our objective is to study the relationship between two variables X and Y. • One way is by means of regression. • Regression analysis is the process of estimating a functional relationship between X and Y. A regression equation is often used to predict a value of Y for a given value of X. • Another way to study relationship between two variables is correlation. It involves measuring the direction and the strength of the linear relationship. 2
  • 3.
    First-Order Linear Model= Simple Linear Regression Model where y = dependent variable x = independent variable 0= y-intercept 1= slope of the line  = error variable i i 1 0 i X Y       3
  • 4.
    Simple Linear Model Thismodel is – Simple: only one X – Linear in the parameters: No parameter appears as exponent or is multiplied or divided by another parameter – Linear in the predictor variable (X): X appears only in the first power. 4 i i 1 0 i X Y      
  • 5.
    Examples • Multiple LinearRegression: • Polynomial Linear Regression: • Linear Regression: • Nonlinear Regression: 5 i i 2 2 i 1 1 0 i X X Y         i i 2 2 i 1 0 i X X Y         i i 2 i 1 0 i 10 ) X exp( X ) Y ( log         i i 2 1 0 i )) X exp( 1 /( Y        Linear or nonlinear in parameters
  • 6.
    Deterministic Component of Model 0 5 10 15 20 25 30 35 40 45 50 05 10 15 20 x x ˆ ˆ y 1 0     (slope)=∆y/∆x y-intercept ∆x ∆y 0 ˆ  1 ̂ 6
  • 7.
    Mathematical vs StatisticalRelation y = - 5.3562 + 3.3988x 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 x ^ x 7
  • 8.
    Error • The scatterplotshows that the points are not on a line, and so, in addition to the relationship, we also describe the error: • The Y’s are the response (or dependent) variable. The x’s are the predictors or independent variables, and the epsilon’s are the errors. We assume that the errors are normal, mutually independent, and have variance 2 . 0 1 , i=1,2,...,n i i i y x       8
  • 9.
    Least Squares: Minimize • Thequantities are called the residuals. If we assume a normal error, they should look normal. 2 2 0 1 1 ( ) n n i i i i i i y x           0 1 ˆ ˆ ˆi i y x     ˆ i i i R y y   9 Error: Yi-E(Yi) unknown; Residual: estimated, i.e. known ˆ i i i R y y  
  • 10.
  • 11.
    • The SimpleLinear Regression Model • The Least Squares Regression Line where 0 1 y x       0 1 ˆ ˆ ŷ x     1 ˆ xy x SS SS      ( )( ) i i xy i i i i x y SS x x y y x y n            2 2 2 ( ) i x i i x SS x x x n        0 1 ˆ ˆ y x     11
  • 12.
    What form doesthe error take? • Each observation may be decomposed into two parts: • The first part is used to determine the fit, and the second to estimate the error. • We estimate the standard deviation of the error by: ˆ ˆ ( ) y y y y    2 2 ˆ ( ) xy yy xx S SSE Y Y S S              12
  • 13.
    Estimate of 2 •We estimate 2 by 2 2 SSE s MSE n    13
  • 14.
    Example • An educationaleconomist wants to establish the relationship between an individual’s income and education. He takes a random sample of 10 individuals and asks for their income ( in $1000s) and education ( in years). The results are shown below. Find the least squares regression line. 11 12 11 15 8 10 11 12 17 11 25 33 22 41 18 28 32 24 53 26 Education Income 14
  • 15.
    Dependent and Independent Variables •The dependent variable is the one that we want to forecast or analyze. • The independent variable is hypothesized to affect the dependent variable. • In this example, we wish to analyze income and we choose the variable individual’s education that most affects income. Hence, y is income and x is individual’s education 15
  • 16.
    First Step: 2 2 118 1450 302 10072 3779 i i i i i i x x y y xy           16
  • 17.
    Sum of Squares: ()( ) (118)(302) 3779 215.4 10 i i xy i i x y SS x y n         2 2 2 ( ) (118) 1450 57.6 10 i x i x SS x n        Therefore, 1 215.4 ˆ 3.74 57.6 xy x SS SS     0 1 302 118 ˆ ˆ 3.74 13.93 10 10 y x        17
  • 18.
    The Least SquaresRegression Line • The least squares regression line is • Interpretation of coefficients: *The sample slope tells us that on average for each additional year of education, an individual’s income rises by $3.74 thousand. • The y-intercept is . This value is the expected (or average) income for an individual who has 0 education level (which is meaningless here) ˆ 13.93 3.74 y x   1 ˆ 3.74   0 ˆ 13.93   18
  • 19.
    Example • Car dealersacross North America use the red book to determine a cars selling price on the basis of important features. One of these is the car’s current odometer reading. • To examine this issue 100 three-year old cars in mint condition were randomly selected. Their selling price and odometer reading were observed. 19
  • 20.
    Portion of thedata file Odometer Price 37388 5318 44758 5061 45833 5008 30862 5795 ….. … 34212 5283 33190 5259 39196 5356 36392 5133 20
  • 21.
    Example (Minitab Output) RegressionAnalysis The regression equation is Price = 6533 - 0.0312 Odometer Predictor Coef StDev T P Constant 6533.38 84.51 77.31 0.000(SIGNIFICANT) Odometer -0.031158 0.002309 -13.49 0.000(SIGNIFICANT) S = 151.6 R-Sq = 65.0% R-Sq(adj) = 64.7% Analysis of Variance Source DF SS MS F P Regression 1 4183528 4183528 182.11 0.000 Error 98 2251362 22973 Total 99 6434890 21
  • 22.
    Example • The leastsquares regression line is ˆ 6533.38 0.031158 y x   50000 40000 30000 20000 6000 5500 5000 Odometer Price 22
  • 23.
    Interpretation of thecoefficients • means that for each additional mile on the odometer, the price decreases by an average of 3.1158 cents. • means that when x = 0 (new car), the selling price is $6533.38 but x = 0 is not in the range of x. So, we cannot interpret the value of y when x=0 for this problem. • R2 =65.0% means that 65% of the variation of y can be explained by x. The higher the value of R2 , the better the model fits the data. 1 ˆ 0.031158   0 ˆ 6533.38   23
  • 24.
    R² and R²adjusted • R² measures the degree of linear association between X and Y. • So, an R² close to 0 does not necessarily indicate that X and Y are unrelated (relation can be nonlinear) • Also, a high R² does not necessarily indicate that the estimated regression line is a good fit. • As more and more X’s are added to the model, R² always increases. R²adj accounts for the number of parameters in the model. 24
  • 25.
    Scatter Plot Odometer .vs.Price Line Fit Plot 4500 5000 5500 6000 19000 29000 39000 49000 Odometer Price 25
  • 26.
    Testing the slope •Are X and Y linearly related? 0 : 0 : 1 1 0     A H H •Test Statistic: 1 ˆ 1 1 ˆ    s t   where 1 ˆ x s s SS    26
  • 27.
    Testing the slope(continue) • The Rejection Region: Reject H0 if t < -t/2,n-2 or t > t/2,n-2. • If we are testing that high x values lead to high y values, HA: 1>0. Then, the rejection region is t > t,n-2. • If we are testing that high x values lead to low y values or low x values lead to high y values, HA: 1 <0. Then, the rejection region is t < - t,n-2. 27
  • 28.
    Assessing the model Example: •Excel output • Minitab output Coefficients Standard Error t Stat P-value Intercept 6533.4 84.512322 77.307 1E-89 Odometer -0.031 0.0023089 -13.49 4E-24 Predictor Coef StDev T P Constant 6533.38 84.51 77.31 0.000 Odometer -0.031158 0.002309 -13.49 0.000 28
  • 29.
    Coefficient of Determination y SS SSE R 1 2 For the data in odometer example, we obtain: 6501 . 0 3499 . 0 1 890 , 434 , 6 363 , 251 , 2 1 1 2        y SS SSE R 29 y 2 adj SS SSE ) p n 1 n ( 1 R     where p is number of predictors in the model.
  • 30.
    Using the RegressionEquation • Suppose we would like to predict the selling price for a car with 40,000 miles on the odometer ˆ 6,533 0.0312 6,533 0.0312(40,000) $5,285 y x      30
  • 31.
    Prediction and Confidence Intervals •Prediction Interval of y for x=xg: The confidence interval for predicting the particular value of y for a given x • Confidence Interval of E(y|x=xg): The confidence interval for estimating the expected value of y for a given x x g e n SS x x n s t y 2 2 , 2 / ) ( 1 1 ˆ       x g e n SS x x n s t y 2 2 , 2 / ) ( 1 ˆ      31
  • 32.
    Solving by Hand (PredictionInterval) • From previous calculations we have the following estimates: • Thus a 95% prediction interval for x=40,000 is: ˆ 5285, 151.6, 4309340160, 36,009 x y s SS x      2 1 (40,000 36,009) 5,285 1.984(151.6) 1 100 4,309,340,160 5,285 303      •The prediction is that the selling price of the car will fall between $4982 and $5588. 32
  • 33.
    Solving by Hand (ConfidenceInterval) • A 95% confidence interval of E(y| x=40,000) is: 35 285 , 5 160 , 340 , 309 , 4 ) 009 , 36 000 , 40 ( 100 1 ) 6 . 151 ( 984 . 1 285 , 5 2     •The mean selling price of the car will fall between $5250 and $5320. 33
  • 34.
    Prediction and Confidence Intervals’Graph 50000 40000 30000 20000 6300 5800 5300 4800 Odometer Predicted Predictioninterval Confidenceinterval 34
  • 35.
    Notes • No matterhow strong is the statistical relation between X and Y, no cause-and-effect pattern is necessarily implied by the regression model. Ex: Although a positive and significant relationship is observed between vocabulary (X) and writing speed (Y), this does not imply that an increase in X causes an increase in Y. Other variables, such as age, may affect both X and Y. Older children have a larger vocabulary and faster writing speed. 35
  • 36.
    Regression Diagnostics Residual Analysis: Non-normality Heteroscedasticity(non-constant variance) Non-independence of the errors Outlier Influential observations 36
  • 37.
    Standardized Residuals • Thestandardized residuals are calculated as where . • The standard deviation of the i-th residual is Standardized residual i r s  ˆ i i i r y y   2 ( ) 1 1 where i i r i i x x x s s h h n SS       37
  • 38.
    Non-normality: • The errorsshould be normally distributed. To check the normality of errors, we use histogram of the residuals or normal probability plot of residuals or tests such as Shapiro-Wilk test. • Dealing with non-normality: – Transformation on Y – Other types of regression (e.g., Poisson or Logistic …) – Nonparametric methods (e.g., nonparametric regression(i.e. smoothing)) 38
  • 39.
    Non-constant variance: • Theerror variance should be constant. • To diagnose non-constant variance, one method is to plot the residuals against the predicted value of y (or x). If the points are distributed evenly around the expected value of errors which is 0, this means that the error variance is constant. Or, formal tests such as: Breusch- Pagan test 2   39
  • 40.
    Dealing with non-constantvariance • Transform Y • Re-specify the Model (e.g., Missing important X’s?) • Use Weighted Least Squares instead of Ordinary Least Squares   n 1 i i 2 i ) ( Var min   40
  • 41.
    Non-independence of error variable: •The values of error should be independent. When the data are time series, the errors often are correlated (i.e., autocorrelated or serially correlated). To detect autocorrelation we plot the residuals against the time periods. If there is no pattern, this means that errors are independent. Or, more formal tests such as Durbin-Watson 41
  • 42.
    Outlier: • An outlieris an observation that is unusually small or large. Two possibilities which cause outlier is 1.Error in recording the data. Detect the error and correct it The outlier point should not have been included in the data (belongs to another population)  Discard the point from the sample 2. The observation is unusually small or large although it belong to the sample and there is no recording error.  Do NOT remove it 42
  • 43.
    Influential Observations Scatter Plotof One Influential Observation 0 50 100 150 0 10 20 30 40 50 x y Scatter Plot Without the Influential Observation 0 10 20 30 40 50 60 0 10 20 30 40 50 x y 43
  • 44.
    Influential Observations • Detection: Cook’sDistance, DFFITS, DFBETAS (Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W., (1996) Applied Linear Statistical Models, 4th edition, Irwin, pp. 378-384) 44
  • 45.
    Multicollinearity • A commonissue in multiple regression is multicollinearity. This exists when some or all of the predictors in the model are highly correlated. In such cases, the estimated coefficient of any variable depends on which other variables are in the model. Also, standard errors of the coefficients are very high… 45
  • 46.
    Multicollinearity • Look intocorrelation coefficient among X’s: If Cor>0.8, suspect multicollinearity • Look into Variance inflation factors (VIF): VIF>10 is usually a sign of multicollinearity • If there is multicollinearity: – Use transformation on X’s, e.g. centering, standardization. Ex: Cor(X,X²)=0.991; after standardization Cor=0! – Remove the X that causes multicollinearity – Factor analysis – Ridge regression – … 46
  • 47.
    Exercise • In baseball,the fans are always interested in determining which factors lead to successful teams. The table below lists the team batting average and the team winning percentage for the 14 league teams at the end of a recent season. 47
  • 48.
    Team-B-A Winning% 0.254 0.414 0.2690.519 0.255 0.500 0.262 0.537 0.254 0.352 0.247 0.519 0.264 0.506 0.271 0.512 0.280 0.586 0.256 0.438 0.248 0.519 0.255 0.512 0.270 0.525 0.257 0.562 y = winning % and x = team batting average 48
  • 49.
    a) LS RegressionLine 2 2 3.642, 0.949 7.001, 3.549 1.824562 i i i i i i x x y y x y             2 2 2 ( )( ) (3.642)(7.001) 1.824562 0.0033 14 (3.642) 0.948622 0.00118 14 i i xy i i i x i x y SS x y n x SS x n                49
  • 50.
    • The leastsquares regression line is • The meaning is for each additional batting average of the team, the winning percentage increases by an average of 79.41%. 1 0 1 0.003302 ˆ 0.7941 0.001182 ˆ ˆ 0.5 (0.7941)0.26 0.2935 xy x SS SS y x            1 ˆ 0.7941   ˆ 0.2935 0.7941 y x   50
  • 51.
    b) Standard Errorof Estimate So, • Since s=0.0567 is small, we would conclude that “s” is relatively small, indicating that the regression line fits the data quite well.   2 2 2 2 2 2 7.001 0.003302 (3.548785 ) 0.03856 14 0.00182 i xy xy yy i xx xx y S S SSE S y S n S                                    2 2 0.03856 0.00321 and 0.0567 2 14 2 SSE s s s n           51
  • 52.
    c) Do thedata provide sufficient evidence at the 5% significance level to conclude that higher team batting average lead to higher winning percentage? 0 : 0 : 1 1 0     A H H 1 1 1 ˆ ˆ Test statistic: t 1.69 (p-value=.058) s      Conclusion: Do not reject H0 at  = 0.05. The higher team batting average does not lead to higher winning percentage. 52
  • 53.
    d)Coefficient of Determination 2 20.03856 1 1 0.1925 0.04778 xy x y y SS SSE R SS SS SS        The 19.25% of the variation in the winning percentage can be explained by the batting average. 53
  • 54.
    e) Predict with90% confidence the winning percentage of a team whose batting average is 0.275. 2 / 2, 2 2 ˆ 0.2935 0.7941(0.275) 0.5119 ( ) 1 ˆ 1 1 (0.275 0.2601) 0.5119 (1.782)(0.0567) 1 14 0.001182 0.5119 0.1134 (0.3985,0.6253) g n x y x x y t s n SS                 90% PI for y: •The prediction is that the winning percentage of the team will fall between 39.85% and 62.53%. 54