University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Linear regression
Lemma Derseh (BSc., MPH)
Scatter Plots and Correlation
 Before trying to fit any model it is better to see its scatter plot
 A scatter plot (or scatter diagram) is used to show the
relationship between two variables
 If a scatter plot once show some sort of linear relationship, we
can use correlation analysis to measure the strength of linear
relationship between two variables
o Only concerned with strength of linear relationship and its
direction
o We consider the two variables equally; as a result no causal
effect is implied
Scatter Plot Examples
y
x
y
x
y
y
x
x
Linear relationships Curvilinear relationships
Scatter Plot Examples
y
x
y
x
y
y
x
x
Strong relationships
Weak relationships
Scatter Plot Examples
y
x
y
x
No relationship at all
Correlation Coefficient
 The population correlation coefficient ρ (rho) measures
the strength of the association between the variables
 The sample correlation coefficient r is an estimate of ρ
and is used to measure the strength of the linear
relationship in the sample observations
Features of ρ and r
 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear relationship
 The closer to 0, the weaker the linear relationship
r = +0.3 r = +1
Examples of Approximate r Values
y
x
y
x
y
x
y
x
y
x
r = -1 r = -.6 r = 0
Calculating the Correlation Coefficient
yy
xx
xy SS
SS
SS
y
y
x
x
y
y
x
x
r /
]
)
(
][
)
(
[
)
)(
(
2
2









where:
r = Sample correlation coefficient
n = Sample size
x = Value of the ‘independent’ variable
y = Value of the ‘dependent’ variable
   
  




]
)
(
)
(
][
)
(
)
(
[ 2
2
2
2
y
y
n
x
x
n
y
x
xy
n
r
Sample correlation coefficient:
or the algebraic equivalent:
Example
Child
Height
(cm)
Child
Weight
(Kg)
x y xy x2 y2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12 14
0.886
]
(321)
][8(14111)
(73)
[8(713)
(73)(321)
8(3142)
]
y)
(
)
y
][n(
x)
(
)
x
[n(
y
x
xy
n
r
2
2
2
2
2
2









   
  
Child weight, x
Child
Height,
y
Calculation Example
r = 0.886 → relatively strong positive
linear association between x and y
SPSS out put
SPSS Correlation Output
Analyze /correlate /bivariate /pearson /OK
Correlation between Child height and weight
Correlations
Child
weight
Child
height
Child Weight Pearson Correlation 1 0.886
Sig. (2-tailed) 0.003
N 8 8
Child height Pearson Correlation 0.886 1
Sig. (2-tailed) 0.003
N 8 8
Significance Test for Correlation
 Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
 Test statistic (with n – 2 degrees of freedom)
2
n
r
1
r
t
2



Here, the degree of freedom is taken to be n-2
because, two points can be joined by a straight line
surely
Example:
Is there evidence of a linear relationship between child
height and weight at the 0.05 level of significance?
H0: ρ = 0 (No correlation)
H1: ρ ≠ 0 (correlation exists)
 = 0.05 , df = 8 - 2 = 6
4.68
2
8
.886
1
.886
2
n
r
1
r
t
2
2







Introduction to Regression Analysis
 Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
 Dependent variable: the variable we wish to explain. In
linear regression it is always continuous variable
 Independent variable: the variable used to explain the
dependent variable. In linear regression it could have any
measurement scale have any measurement scale.
Simple Linear Regression Model
 Only one independent variable, x
 Relationship between x and y is described
by a linear function
 Changes in y are assumed to be caused by
changes in x
ε
x
β
β
y 1
0 


Linear component
Population Linear Regression
The population regression model:
Population
y intercept
Population
Slope
Coefficient
Random
Error
term, or
residual
Dependent
Variable Independent
Variable
Random Error
component
Linear Regression Assumptions
 The relationship between the two variables, x and y is Linear
 Independent observations
 Error values are Normally distributed for any given value of x
 The probability distribution of the errors has Equal variance
 Fixed independent variables (not random = non-stochastic = given
values = deterministic); the only randomness in the values of Y
comes from the error term 
 No autocorrelation of the errors (has some similarities with the 2nd)
 No outlier distortion
Assumptions viewed pictorially
LINE (Liner, Independent, Normal and Equal
variance) assumptions
my|x= +  x
~N(my|x s2
y|x)
Y
X
Identical normal
distributions of
errors, all centered on
the regression line.
Population Linear Regression
Random Error
for this x value
y
x
Observed
Value of y
for xi
Predicted
Value of y
for xi
ε
x
β
β
y 1
0 


xi
Slope = β1
Intercept
= β0
εi
x
b
b
ŷ 1
0
i 

The sample regression line provides an estimate of the
population regression line
Estimated Regression Model
Estimate of
the regression
intercept
Estimate of the
regression slope
Estimated
(or predicted)
y value
Independent
variable
The individual random error terms ei have a mean of zero
Least Squares Criterion
 b0 and b1 are obtained by finding the values of b0 and
b1 that minimize the sum of the squared residuals
2
1
0
2
2
x))
b
(b
(y
)
ŷ
(y
e








The Least Squares Equation
 After some application of calculus (derivation)
and equating it to zero, we can find the
following:
 
  



n
x
x
n
y
x
xy
b 2
2
1
)
(





 2
1
)
(
)
)(
(
x
x
y
y
x
x
b
x
b
y
b 1
0 

and
 b0 is the estimated average value of y when the
value of x is zero (provided that x is inside the data
range considered).
 Otherwise it shows the portion of the variability of
the dependent variable left unexplained by the
independent variables considered
 b1 is the estimated change in the average value of y
as a result of a one-unit change in x
Interpretation of the Slope and the Intercept
Example: Simple Linear Regression
 A researcher wishes to examine the relationship between
the amount of the daily average diets taken by a cohort of
20 sample children and the weight gained by them in one
month (both measured in kg). The content of the food is the
same for all of them.
 Dependent variable (y) = weight gained in one month
measured in kilogram
 Independent variable (x) = average weight of diet taken per
day by a child measured in Kilogram
Sample Data for child weight Model
Weight gained (y)
Diet (x) Weight gained (y) Diet (x)
0.4 0.65 0.86 1.1
0.46 0.66 0.89 1.12
0.55 0.63 0.91 1.20
0.56 0.73 0.93 1.32
0.65 0.78 0.96 1.33
0.67 0.76 0.98 1.35
0.78 0.72 1.02 1.42
0.79 0.84 1.04 1.1
0.80 0.87 1.08 1.5
0.83 0.97 1.11 1.3
Estimation using the computational formula
 
  



n
x
x
n
y
x
xy
b 2
2
1
)
(
From the data we have:
Σx = 20.35, Σy = 16.27, Σxy = 17.58, Σx2 = 22.30
643
.
0
20
/
12
.
414
30
.
22
20
/
27
.
16
*
35
.
20
57
.
17
1 



b
160
.
0
0175
.
1
*
643
.
0
8135
.
0
1
0 



 x
b
y
b
Regression Using SPSS
Analyze/ regression/linear….
Coefficients
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
(Constant) 0.160 .077 2.065 .054
foodweight 0.643 .073 .900 8.772 .000
Weight gained = 0.16 +0.643(food weight)
Interpretation of the Intercept, b0
 Here, no child had had 0 kilogram of food per day, so for
foods within the range of sizes observed, 0.16Kg is the
portion of the weight gained not explained by food.
 Whereas, b1 = 0.643 tells us that the average weight of a
child increases by 0.643, on average, for each additional
one kilogram of food taken each day
Weight gained = 0.16 + 0.643(food weight)
Explained and Unexplained Variation
Total variation is made up of two parts:
SSE
SSR
SST 

Total sum of
Squares
Sum of Squares
Regression
Sum of Squares
Error
 
 2
)
y
y
(
SST  
 2
)
ŷ
y
(
SSE
 
 2
)
y
ŷ
(
SSR
where:
= Average value of the dependent variable
y = Observed values of the dependent variable
= Estimated value of y for the given x value
ŷ
y
Xi
y
x
yi
SST = (yi - y)2
SSE = (yi - yi )2

SSR = (yi - y)2

_
_
_
Explained and Unexplained …
y

y
y
_
Y

 The coefficient of determination is the portion of the total
variation in the dependent variable that is explained by
variation in the independent variable
 The coefficient of determination is also called R-squared and
is denoted as R2
Coefficient of Determination, R2
1
R
0 2


where
squares
of
sum
total
regression
by
explained
squares
of
sum
SST
SSR
R 

2
Coefficient of Determination, R2
In the single independent variable case, the coefficient
of determination is
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
2
2
r
R 
Coefficient of Determination, R2 cont…
 The F-test testes the statistical significance of the
regression of the dependent variable on the
independent variable: H0: β = 0
 However, the reliability of the regression equation is
very commonly measured by the correlation
coefficient R.
 Equivalently one can check the statistical
significance of R or R2 using F-test and can reach
exactly the same F-value as model coefficients’ test
SPSS output
Model R R Square
Adjusted R
Square
1 0.900 0.810 0.800
ANOVA
Model
Sum of
Squares df
Mean
Square F Sig.
Regression 0.658 1 0.658 76.948 .000
Residual 0.154 18 0.009
Total 0.812 19
0.81
0.812
0.658
SST
SSR
R2



Model summary
81% of the variation in
children’s weight
increment is explained
by variation in food
weight they took
SPSS output
R R Square
Adjusted R
Square
Std. Error of the estimate
(the standard deviation of errors)
0.900 0.810 0.800 0.09248
Coefficients
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
(Constant) 0.160 0.077 2.065 .054
foodweight 0.643 0.073 0.900 8.772 .000
ANOVA
Model
Sum of
Squares df
Mean
Square F Sig.
Regression 0.658 1 0.658 76.948 0.000
Residual 0.154 18 0.009
Total 0.812 19
Root of ‘mean square error’ = SƐ
Model summary
Inference about the Slope: t-Test
 t test for a population slope
Is there a linear relationship between x and y?
 Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1  0 (linear relationship does exist)
 Test statistic
– 1
b
1
1
s
β
b
t

 2
n
d.f. 

Where:
b1 = Sample regression slope (coefficient)
β1 = Hypothesized slope, usually 0
sb1 = Estimator of the standard error of the slope
Estimated Regression Equation:
The slope of this model is 0.643
Does weight of food taken per day affect
children’s weight?
We have to test it statistically
Inference about the Slope :t Test
Weight gained = 0.16 +0.643(food)
Inferences about the Slope: t-Test
Example
Conclusion: There is
sufficient evidence that food
weight taken per day affects
children’s weight
1
b
s
b1
Coefficients
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
(Constant) 0.160 0.077 2.065 .054
Food weight 0.643 0.073 0.900 8.772 0.000
The calculated t-test is 8.772
which is greater than the
tabulated one 2.074
Decision: Reject Ho
1
b
1
s
0
b
t


Confidence Interval estimation
Coefficients
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
95% Confidence
Interval for B
B
Std.
Error Beta
Lower
Bound
Upper
Bound
(Constant) 0.160 0.077 2.065 .054 -0.003 0.322
Food weight 0.643 0.073 0.900 8.772 0.000 0.489 0.796
Confidence Interval Estimate of the Slope:
df = n-2 = 18, t(0.025,18) = 2.101
1
b
/2
1 s
t
b 

The 95% confidence interval for the slope is (0.489, 0.796).
Note also that this 95% confidence interval does not include 0.
Look the relationship between all the figures in the blue circles
Multiple linear regression
Multiple Linear Regression (MLR) is a
statistical method for estimating the
relationship between a dependent variable
and two or more independent (or predictor)
variables.
Function: Ypred = a + b1X1 + B2X2 +… + BnXn
Multiple Linear Regression
Simply, MLR is a method for studying the
relationship between a dependent variable
and two or more independent variables.
Purposes:
Prediction
Explanation
Theory building
Predictable variation by
the combination of
independent variables
Variations
Total Variation in Y
Unpredictable
Variation
44
Assumptions of the Linear regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model
(no omitted variables)*
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual
variance)
7. No multicollinearity*
8. No autocorrelation of the errors
9. No outlier distortion
(Most of them, except the 4th and 7th, are mentioned in the simple
linear regression model assumptions)
Multiple Coefficient of Determination, R2
o In multiple regression ,the corresponding correlation
coefficient is called Multiple Correlation Coefficient
 Since there are more than one independent variables, multiple
correlation coefficient R is the correlation between the
observed y and predicted y values while it is between x and y
in the case of r (simple correlation)
 Unlike the situation for simple correlation, 0 < R < 1, because
it would be impossible to have a negative correlation between
the observed and the least-squares predicted values
 The square of a multiple correlation coefficient is of course the
corresponding coefficient of determination

Intercorrelation or collinearlity
 If the two independent variables are uncorrelated, we
can uniquely partition the amount of variance in Y due
to X1 and X2 and bias is avoided.
 Small inter-correlations between the independent
variables will not greatly bias the b coefficients.
 However, large inter-correlations will bias the b
coefficients and for this reason other mathematical
procedures are needed
Multiple regression
%fat age Sex
9.5 23.0 0.0
27.9 23.0 1.0
7.8 27.0 0.0
17.8 27.0 0.0
31.4 39.0 1.0
25.9 41.0 1.0
27.4 45.0 0.0
25.2 49.0 1.0
31.1 50.0 1.0
34.7 53.0 1.0
42.0 53.0 1.0
42.0 54.0 1.0
29.1 54.0 1.0
32.5 56.0 1.0
30.3 57.0 1.0
21.0 57.0 1.0
33.0 58.0 1.0
33.8 58.0 1.0
41.1 60.0 1.0
34.5 61.0 1.0
Example:
Regress the percentage of fat relative
to body on age and sex
SPSS result on the next slide!
Model Summary
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1 .729a .532 .506 6.5656 .532 20.440 1 18 .000
2 .794b .631 .587 5.9986 .099 4.564 1 17 .047
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age
ANOVA
Model Sum of Squares df Mean Square F Sig.
1 Regression 881.128 1 881.128 20.440 .000a
Residual 775.932 18 43.107
Total 1657.060 19
2 Regression 1045.346 2 522.673 14.525 .000b
Residual 611.714 17 35.983
Total 1657.060 19
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age; c. Dependent Variable: %age of body fat
Coefficients
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
95% Confidence
Interval for B
B Std. Error Beta
Lower
Bound
Upper
Bound
1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522
sex 16.594 3.670 .729 4.521 .000 8.883 24.305
2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457
sex 10.130 4.517 .445 2.243 .039 .600 19.659
age .309 .145 .424 2.136 .047 .004 .614
a. Dependent Variable: %age of body fat relative to body

Linear regression.ppt

  • 1.
    University of Gondar Collegeof medicine and health science Department of Epidemiology and Biostatistics Linear regression Lemma Derseh (BSc., MPH)
  • 2.
    Scatter Plots andCorrelation  Before trying to fit any model it is better to see its scatter plot  A scatter plot (or scatter diagram) is used to show the relationship between two variables  If a scatter plot once show some sort of linear relationship, we can use correlation analysis to measure the strength of linear relationship between two variables o Only concerned with strength of linear relationship and its direction o We consider the two variables equally; as a result no causal effect is implied
  • 3.
    Scatter Plot Examples y x y x y y x x Linearrelationships Curvilinear relationships
  • 4.
    Scatter Plot Examples y x y x y y x x Strongrelationships Weak relationships
  • 5.
  • 6.
    Correlation Coefficient  Thepopulation correlation coefficient ρ (rho) measures the strength of the association between the variables  The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations
  • 7.
    Features of ρand r  Unit free  Range between -1 and 1  The closer to -1, the stronger the negative linear relationship  The closer to 1, the stronger the positive linear relationship  The closer to 0, the weaker the linear relationship
  • 8.
    r = +0.3r = +1 Examples of Approximate r Values y x y x y x y x y x r = -1 r = -.6 r = 0
  • 9.
    Calculating the CorrelationCoefficient yy xx xy SS SS SS y y x x y y x x r / ] ) ( ][ ) ( [ ) )( ( 2 2          where: r = Sample correlation coefficient n = Sample size x = Value of the ‘independent’ variable y = Value of the ‘dependent’ variable            ] ) ( ) ( ][ ) ( ) ( [ 2 2 2 2 y y n x x n y x xy n r Sample correlation coefficient: or the algebraic equivalent:
  • 10.
    Example Child Height (cm) Child Weight (Kg) x y xyx2 y2 35 8 280 1225 64 49 9 441 2401 81 27 7 189 729 49 33 6 198 1089 36 60 13 780 3600 169 21 7 147 441 49 45 11 495 2025 121 51 12 612 2601 144 =321 =73 =3142 =14111 =713
  • 11.
    0 10 20 30 40 50 60 70 0 2 46 8 10 12 14 0.886 ] (321) ][8(14111) (73) [8(713) (73)(321) 8(3142) ] y) ( ) y ][n( x) ( ) x [n( y x xy n r 2 2 2 2 2 2                 Child weight, x Child Height, y Calculation Example r = 0.886 → relatively strong positive linear association between x and y
  • 12.
    SPSS out put SPSSCorrelation Output Analyze /correlate /bivariate /pearson /OK Correlation between Child height and weight Correlations Child weight Child height Child Weight Pearson Correlation 1 0.886 Sig. (2-tailed) 0.003 N 8 8 Child height Pearson Correlation 0.886 1 Sig. (2-tailed) 0.003 N 8 8
  • 13.
    Significance Test forCorrelation  Hypotheses H0: ρ = 0 (no correlation) HA: ρ ≠ 0 (correlation exists)  Test statistic (with n – 2 degrees of freedom) 2 n r 1 r t 2    Here, the degree of freedom is taken to be n-2 because, two points can be joined by a straight line surely
  • 14.
    Example: Is there evidenceof a linear relationship between child height and weight at the 0.05 level of significance? H0: ρ = 0 (No correlation) H1: ρ ≠ 0 (correlation exists)  = 0.05 , df = 8 - 2 = 6 4.68 2 8 .886 1 .886 2 n r 1 r t 2 2       
  • 15.
    Introduction to RegressionAnalysis  Regression analysis is used to: Predict the value of a dependent variable based on the value of at least one independent variable Explain the impact of changes in an independent variable on the dependent variable  Dependent variable: the variable we wish to explain. In linear regression it is always continuous variable  Independent variable: the variable used to explain the dependent variable. In linear regression it could have any measurement scale have any measurement scale.
  • 16.
    Simple Linear RegressionModel  Only one independent variable, x  Relationship between x and y is described by a linear function  Changes in y are assumed to be caused by changes in x
  • 17.
    ε x β β y 1 0    Linearcomponent Population Linear Regression The population regression model: Population y intercept Population Slope Coefficient Random Error term, or residual Dependent Variable Independent Variable Random Error component
  • 18.
    Linear Regression Assumptions The relationship between the two variables, x and y is Linear  Independent observations  Error values are Normally distributed for any given value of x  The probability distribution of the errors has Equal variance  Fixed independent variables (not random = non-stochastic = given values = deterministic); the only randomness in the values of Y comes from the error term   No autocorrelation of the errors (has some similarities with the 2nd)  No outlier distortion
  • 19.
    Assumptions viewed pictorially LINE(Liner, Independent, Normal and Equal variance) assumptions my|x= +  x ~N(my|x s2 y|x) Y X Identical normal distributions of errors, all centered on the regression line.
  • 20.
    Population Linear Regression RandomError for this x value y x Observed Value of y for xi Predicted Value of y for xi ε x β β y 1 0    xi Slope = β1 Intercept = β0 εi
  • 21.
    x b b ŷ 1 0 i   Thesample regression line provides an estimate of the population regression line Estimated Regression Model Estimate of the regression intercept Estimate of the regression slope Estimated (or predicted) y value Independent variable The individual random error terms ei have a mean of zero
  • 22.
    Least Squares Criterion b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals 2 1 0 2 2 x)) b (b (y ) ŷ (y e        
  • 23.
    The Least SquaresEquation  After some application of calculus (derivation) and equating it to zero, we can find the following:         n x x n y x xy b 2 2 1 ) (       2 1 ) ( ) )( ( x x y y x x b x b y b 1 0   and
  • 24.
     b0 isthe estimated average value of y when the value of x is zero (provided that x is inside the data range considered).  Otherwise it shows the portion of the variability of the dependent variable left unexplained by the independent variables considered  b1 is the estimated change in the average value of y as a result of a one-unit change in x Interpretation of the Slope and the Intercept
  • 25.
    Example: Simple LinearRegression  A researcher wishes to examine the relationship between the amount of the daily average diets taken by a cohort of 20 sample children and the weight gained by them in one month (both measured in kg). The content of the food is the same for all of them.  Dependent variable (y) = weight gained in one month measured in kilogram  Independent variable (x) = average weight of diet taken per day by a child measured in Kilogram
  • 26.
    Sample Data forchild weight Model Weight gained (y) Diet (x) Weight gained (y) Diet (x) 0.4 0.65 0.86 1.1 0.46 0.66 0.89 1.12 0.55 0.63 0.91 1.20 0.56 0.73 0.93 1.32 0.65 0.78 0.96 1.33 0.67 0.76 0.98 1.35 0.78 0.72 1.02 1.42 0.79 0.84 1.04 1.1 0.80 0.87 1.08 1.5 0.83 0.97 1.11 1.3
  • 27.
    Estimation using thecomputational formula         n x x n y x xy b 2 2 1 ) ( From the data we have: Σx = 20.35, Σy = 16.27, Σxy = 17.58, Σx2 = 22.30 643 . 0 20 / 12 . 414 30 . 22 20 / 27 . 16 * 35 . 20 57 . 17 1     b 160 . 0 0175 . 1 * 643 . 0 8135 . 0 1 0      x b y b
  • 28.
    Regression Using SPSS Analyze/regression/linear…. Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 0.160 .077 2.065 .054 foodweight 0.643 .073 .900 8.772 .000 Weight gained = 0.16 +0.643(food weight)
  • 29.
    Interpretation of theIntercept, b0  Here, no child had had 0 kilogram of food per day, so for foods within the range of sizes observed, 0.16Kg is the portion of the weight gained not explained by food.  Whereas, b1 = 0.643 tells us that the average weight of a child increases by 0.643, on average, for each additional one kilogram of food taken each day Weight gained = 0.16 + 0.643(food weight)
  • 30.
    Explained and UnexplainedVariation Total variation is made up of two parts: SSE SSR SST   Total sum of Squares Sum of Squares Regression Sum of Squares Error    2 ) y y ( SST    2 ) ŷ y ( SSE    2 ) y ŷ ( SSR where: = Average value of the dependent variable y = Observed values of the dependent variable = Estimated value of y for the given x value ŷ y
  • 31.
    Xi y x yi SST = (yi- y)2 SSE = (yi - yi )2  SSR = (yi - y)2  _ _ _ Explained and Unexplained … y  y y _ Y 
  • 32.
     The coefficientof determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable  The coefficient of determination is also called R-squared and is denoted as R2 Coefficient of Determination, R2 1 R 0 2   where squares of sum total regression by explained squares of sum SST SSR R   2
  • 33.
    Coefficient of Determination,R2 In the single independent variable case, the coefficient of determination is Where: R2 = Coefficient of determination r = Simple correlation coefficient 2 2 r R 
  • 34.
    Coefficient of Determination,R2 cont…  The F-test testes the statistical significance of the regression of the dependent variable on the independent variable: H0: β = 0  However, the reliability of the regression equation is very commonly measured by the correlation coefficient R.  Equivalently one can check the statistical significance of R or R2 using F-test and can reach exactly the same F-value as model coefficients’ test
  • 35.
    SPSS output Model RR Square Adjusted R Square 1 0.900 0.810 0.800 ANOVA Model Sum of Squares df Mean Square F Sig. Regression 0.658 1 0.658 76.948 .000 Residual 0.154 18 0.009 Total 0.812 19 0.81 0.812 0.658 SST SSR R2    Model summary 81% of the variation in children’s weight increment is explained by variation in food weight they took
  • 36.
    SPSS output R RSquare Adjusted R Square Std. Error of the estimate (the standard deviation of errors) 0.900 0.810 0.800 0.09248 Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 0.160 0.077 2.065 .054 foodweight 0.643 0.073 0.900 8.772 .000 ANOVA Model Sum of Squares df Mean Square F Sig. Regression 0.658 1 0.658 76.948 0.000 Residual 0.154 18 0.009 Total 0.812 19 Root of ‘mean square error’ = SƐ Model summary
  • 37.
    Inference about theSlope: t-Test  t test for a population slope Is there a linear relationship between x and y?  Null and alternative hypotheses H0: β1 = 0 (no linear relationship) H1: β1  0 (linear relationship does exist)  Test statistic – 1 b 1 1 s β b t   2 n d.f.   Where: b1 = Sample regression slope (coefficient) β1 = Hypothesized slope, usually 0 sb1 = Estimator of the standard error of the slope
  • 38.
    Estimated Regression Equation: Theslope of this model is 0.643 Does weight of food taken per day affect children’s weight? We have to test it statistically Inference about the Slope :t Test Weight gained = 0.16 +0.643(food)
  • 39.
    Inferences about theSlope: t-Test Example Conclusion: There is sufficient evidence that food weight taken per day affects children’s weight 1 b s b1 Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 0.160 0.077 2.065 .054 Food weight 0.643 0.073 0.900 8.772 0.000 The calculated t-test is 8.772 which is greater than the tabulated one 2.074 Decision: Reject Ho 1 b 1 s 0 b t  
  • 40.
    Confidence Interval estimation Coefficients Model Unstandardized Coefficients Standardized Coefficients tSig. 95% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound (Constant) 0.160 0.077 2.065 .054 -0.003 0.322 Food weight 0.643 0.073 0.900 8.772 0.000 0.489 0.796 Confidence Interval Estimate of the Slope: df = n-2 = 18, t(0.025,18) = 2.101 1 b /2 1 s t b   The 95% confidence interval for the slope is (0.489, 0.796). Note also that this 95% confidence interval does not include 0. Look the relationship between all the figures in the blue circles
  • 41.
    Multiple linear regression MultipleLinear Regression (MLR) is a statistical method for estimating the relationship between a dependent variable and two or more independent (or predictor) variables. Function: Ypred = a + b1X1 + B2X2 +… + BnXn
  • 42.
    Multiple Linear Regression Simply,MLR is a method for studying the relationship between a dependent variable and two or more independent variables. Purposes: Prediction Explanation Theory building
  • 43.
    Predictable variation by thecombination of independent variables Variations Total Variation in Y Unpredictable Variation
  • 44.
    44 Assumptions of theLinear regression Model 1. Linear Functional form 2. Fixed independent variables 3. Independent observations 4. Representative sample and proper specification of the model (no omitted variables)* 5. Normality of the residuals or errors 6. Equality of variance of the errors (homogeneity of residual variance) 7. No multicollinearity* 8. No autocorrelation of the errors 9. No outlier distortion (Most of them, except the 4th and 7th, are mentioned in the simple linear regression model assumptions)
  • 45.
    Multiple Coefficient ofDetermination, R2 o In multiple regression ,the corresponding correlation coefficient is called Multiple Correlation Coefficient  Since there are more than one independent variables, multiple correlation coefficient R is the correlation between the observed y and predicted y values while it is between x and y in the case of r (simple correlation)  Unlike the situation for simple correlation, 0 < R < 1, because it would be impossible to have a negative correlation between the observed and the least-squares predicted values  The square of a multiple correlation coefficient is of course the corresponding coefficient of determination 
  • 46.
    Intercorrelation or collinearlity If the two independent variables are uncorrelated, we can uniquely partition the amount of variance in Y due to X1 and X2 and bias is avoided.  Small inter-correlations between the independent variables will not greatly bias the b coefficients.  However, large inter-correlations will bias the b coefficients and for this reason other mathematical procedures are needed
  • 47.
    Multiple regression %fat ageSex 9.5 23.0 0.0 27.9 23.0 1.0 7.8 27.0 0.0 17.8 27.0 0.0 31.4 39.0 1.0 25.9 41.0 1.0 27.4 45.0 0.0 25.2 49.0 1.0 31.1 50.0 1.0 34.7 53.0 1.0 42.0 53.0 1.0 42.0 54.0 1.0 29.1 54.0 1.0 32.5 56.0 1.0 30.3 57.0 1.0 21.0 57.0 1.0 33.0 58.0 1.0 33.8 58.0 1.0 41.1 60.0 1.0 34.5 61.0 1.0 Example: Regress the percentage of fat relative to body on age and sex SPSS result on the next slide!
  • 48.
    Model Summary Model R R Square AdjustedR Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .729a .532 .506 6.5656 .532 20.440 1 18 .000 2 .794b .631 .587 5.9986 .099 4.564 1 17 .047 a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 881.128 1 881.128 20.440 .000a Residual 775.932 18 43.107 Total 1657.060 19 2 Regression 1045.346 2 522.673 14.525 .000b Residual 611.714 17 35.983 Total 1657.060 19 a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age; c. Dependent Variable: %age of body fat Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. 95% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound 1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522 sex 16.594 3.670 .729 4.521 .000 8.883 24.305 2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457 sex 10.130 4.517 .445 2.243 .039 .600 19.659 age .309 .145 .424 2.136 .047 .004 .614 a. Dependent Variable: %age of body fat relative to body