Linear Regression
and
Correlation
Scatter Diagrams
A scatter plot is a graph that may be
used to represent the relationship
between two variables—also
referred to as a scatter diagram.
Dependent and Independent
Variables
A dependent variable is the variable to be
predicted or explained in a regression
model. This variable is assumed to be
functionally related to the independent
variable.
Dependent and Independent
Variables
An independent variable is the variable
related to the dependent variable in a
regression equation. The independent
variable is used in a regression model to
estimate the value of the dependent
variable.
Two Variable Relationships
X
Y
(a) Linear
Two Variable Relationships
X
Y
(b) Linear
Two Variable Relationships
X
Y
(c) Curvilinear
Two Variable Relationships
X
Y
(d) Curvilinear
Two Variable Relationships
X
Y
(e) No Relationship
Correlation
The correlation coefficient is a quantitative
measure of the strength of the linear
relationship between two variables. The
correlation ranges from + 1 to - 1. A
correlation of  1 indicates a perfect linear
relationship, whereas a correlation of 0
indicates no linear relationship.
Correlation
SAMPLE CORRELATION COEFFICIENT
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable








]
)
(
][
)
(
[
)
)(
(
2
2
y
y
x
x
y
y
x
x
r
Correlation
SAMPLE CORRELATION COEFFICIENT
or the algebraic equivalent:
   
  




]
)
(
)
(
][
)
(
)
(
[ 2
2
2
2
y
y
n
x
x
n
y
x
xy
n
r
Correlation
Grade mathscore
y x yx y2
x2
65 39 2,535 4,225 1521
78 43 3,354 6,084 1849
52 21 1,092 2,704 441
82 64 5,248 6,724 4096
92 57 5,244 8,464 3249
89 47 4,183 7,921 2209
73 28 2,044 5,329 784
98 75 7,350 9,604 5625
56 34 1,904 3,136 1156
75 52 3,900 5,625 2704
760 460  816
,
59
 854
,
36  634
,
23
Correlation
   
  




]
)
(
)
(
][
)
(
)
(
[ 2
2
2
2
y
y
n
x
x
n
y
x
xy
n
r
84
.
0
]
)
760
(
)
816
,
59
(
10
][
)
460
(
)
634
,
23
(
10
[
)
760
(
460
)
854
,
36
(
10
2
2





r
Correlation
Grade Math Score
Grade 1
Math Score 0.839785887 1
Excel Correlation Output
Correlation between Math Score and Grade
Correlation
TEST STATISTIC FOR CORRELATION
where:
t = Number of standard deviations r is from 0
r = Simple correlation coefficient
n = Sample size
2
1 2



n
r
r
t
2

n
df
306
.
2
025
. 
t
0
Correlation Significance Test
Rejection Region
 /2 = 0.025
Since t=4.37 > 2.306, reject H0, there is a significant
linear relationship
306
.
2
025
. 

t
Rejection Region
 /2 = 0.025
05
.
0
0
:
)
(
0
:
0






A
H
n
correlatio
no
H
37
.
4
8
7052
.
0
1
8398
.
0
2
1 2






n
r
r
t
Simple Linear Regression
Analysis
Simple linear regression analysis
analyzes the linear relationship that
exists between a dependent variable
and a single independent variable.
Simple Linear Regression
Analysis
SIMPLE LINEAR REGRESSION MODEL
(POPULATION MODEL)
where:
y = Value of the dependent variable
x = Value of the independent variable
= Population’s y-intercept
= Slope of the population regression line
= Error term, or residual


 

 x
y 1
0
0

1


Simple Linear Regression
Analysis
The simple linear regression model has four
assumptions:
 Individual values of the error terms, i, are
statistically independent of one another.
 The distribution of all possible values of  is
normal.
 The distributions of possible i values have equal
variances for all value of x.
 The means of the dependent variable, for all specified
values of the independent variable, y, can be
connected by a straight line called the population
regression model.
Simple Linear Regression
Analysis
REGRESSION COEFFICIENTS
In the simple regression model, there
are two coefficients: the intercept and
the slope.
Simple Linear Regression
Analysis
The interpretation of the regression slope
coefficient is that is gives the average change
in the dependent variable for a unit change in
the independent variable. The slope
coefficient may be positive or negative,
depending on the relationship between the
two variables.
Simple Linear Regression
Analysis
The least squares criterion is used
for determining a regression line
that minimizes the sum of squared
residuals.
Simple Linear Regression
Analysis
A residual is the difference between
the actual value of the dependent
variable and the value predicted by
the regression model.
y
y ˆ

Simple Linear Regression
Analysis
ESTIMATED REGRESSION MODEL
(SAMPLE MODEL)
where:
= Estimated, or predicted, y value
b0 = Unbiased estimate of the regression intercept
b1 = Unbiased estimate of the regression slope
x = Value of the independent variable
x
b
b
yi 1
0
ˆ 

ŷ
Simple Linear Regression
Analysis
LEAST SQUARES EQUATIONS
algebraic equivalent:
and  
  



n
x
x
n
y
x
xy
b 2
2
1
)
(





 2
1
)
(
)
)(
(
x
x
y
y
x
x
b
x
b
y
b 1
0 

Simple Linear Regression Analysis
Grade mathscore
y x yx y2
x2
65 39 2,535 4,225 1521
78 43 3,354 6,084 1849
52 21 1,092 2,704 441
82 64 5,248 6,724 4096
92 57 5,244 8,464 3249
89 47 4,183 7,921 2209
73 28 2,044 5,329 784
98 75 7,350 9,604 5625
56 34 1,904 3,136 1156
75 52 3,900 5,625 2704
760 460  816
,
59
 854
,
36  634
,
23
Simple Linear Regression
Analysis
76556
.
0
10
)
460
(
634
,
23
10
)
760
(
460
854
,
36
)
( 2
2
2
1 






 
  
n
x
x
n
y
x
xy
b
78
.
40
)
46
(
76556
.
0
76
1
0 



 x
b
y
b
The least squares regression line is:
)
(
766
.
0
78
.
40
ˆ x
y 

Interpretation of Results:
Example
The slope of 0.766 means that for each increase of
one unit in X, we predict the average of Y to
increase by an estimated 0.766 units.
The equation estimates that for each increase of 1
point on the math achievement test, the expected
final calculus grades are predicted to increase by
0.766 points.
)
(
766
.
0
78
.
40
ˆ x
y 

Simple Linear Regression Analysis
Linear Regression
20.00 30.00 40.00 50.00 60.00 70.00
mathscor
60.00
70.00
80.00
90.00
grad
e
A
A
A
A
A
A
A
A
A
A
grade = 40.78 + 0.77 * mathscor
Simple Linear Regression
Analysis
The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by its
relationship with the independent variable.
The coefficient of determination is also
called R-squared and is denoted as R2
.
Simple Linear Regression
Analysis
COEFFICIENT OF DETERMINATION (R2
)
TSS
SSR
R 
2
Simple Linear Regression
Analysis
COEFFICIENT OF DETERMINATION
SINGLE INDEPENDENT VARIABLE CASE
where:
R2
= Coefficient of determination
r = Simple correlation coefficient
2
2
r
R 
Coefficients of Determination (r 2
)
and Correlation (r)
r2
= 1, r2
= 1,
r2
= .81, r2
= 0,
Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
r = +1 r = -1
r = +0.9 r = 0
Simple Regression Steps
Develop a scatter plot of y and x. You are
looking for a linear relationship between
the two variables.
Calculate the least squares regression line
for the sample data.
Calculate the correlation coefficient and the
simple coefficient of determination, R2
.
Conduct one of the significance tests.

8. Correlation and Linear Regression.pptx

  • 1.
  • 2.
    Scatter Diagrams A scatterplot is a graph that may be used to represent the relationship between two variables—also referred to as a scatter diagram.
  • 3.
    Dependent and Independent Variables Adependent variable is the variable to be predicted or explained in a regression model. This variable is assumed to be functionally related to the independent variable.
  • 4.
    Dependent and Independent Variables Anindependent variable is the variable related to the dependent variable in a regression equation. The independent variable is used in a regression model to estimate the value of the dependent variable.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Correlation The correlation coefficientis a quantitative measure of the strength of the linear relationship between two variables. The correlation ranges from + 1 to - 1. A correlation of  1 indicates a perfect linear relationship, whereas a correlation of 0 indicates no linear relationship.
  • 11.
    Correlation SAMPLE CORRELATION COEFFICIENT where: r= Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable         ] ) ( ][ ) ( [ ) )( ( 2 2 y y x x y y x x r
  • 12.
    Correlation SAMPLE CORRELATION COEFFICIENT orthe algebraic equivalent:            ] ) ( ) ( ][ ) ( ) ( [ 2 2 2 2 y y n x x n y x xy n r
  • 13.
    Correlation Grade mathscore y xyx y2 x2 65 39 2,535 4,225 1521 78 43 3,354 6,084 1849 52 21 1,092 2,704 441 82 64 5,248 6,724 4096 92 57 5,244 8,464 3249 89 47 4,183 7,921 2209 73 28 2,044 5,329 784 98 75 7,350 9,604 5625 56 34 1,904 3,136 1156 75 52 3,900 5,625 2704 760 460  816 , 59  854 , 36  634 , 23
  • 14.
    Correlation           ] ) ( ) ( ][ ) ( ) ( [ 2 2 2 2 y y n x x n y x xy n r 84 . 0 ] ) 760 ( ) 816 , 59 ( 10 ][ ) 460 ( ) 634 , 23 ( 10 [ ) 760 ( 460 ) 854 , 36 ( 10 2 2      r
  • 15.
    Correlation Grade Math Score Grade1 Math Score 0.839785887 1 Excel Correlation Output Correlation between Math Score and Grade
  • 16.
    Correlation TEST STATISTIC FORCORRELATION where: t = Number of standard deviations r is from 0 r = Simple correlation coefficient n = Sample size 2 1 2    n r r t 2  n df
  • 17.
    306 . 2 025 .  t 0 Correlation SignificanceTest Rejection Region  /2 = 0.025 Since t=4.37 > 2.306, reject H0, there is a significant linear relationship 306 . 2 025 .   t Rejection Region  /2 = 0.025 05 . 0 0 : ) ( 0 : 0       A H n correlatio no H 37 . 4 8 7052 . 0 1 8398 . 0 2 1 2       n r r t
  • 18.
    Simple Linear Regression Analysis Simplelinear regression analysis analyzes the linear relationship that exists between a dependent variable and a single independent variable.
  • 19.
    Simple Linear Regression Analysis SIMPLELINEAR REGRESSION MODEL (POPULATION MODEL) where: y = Value of the dependent variable x = Value of the independent variable = Population’s y-intercept = Slope of the population regression line = Error term, or residual       x y 1 0 0  1  
  • 20.
    Simple Linear Regression Analysis Thesimple linear regression model has four assumptions:  Individual values of the error terms, i, are statistically independent of one another.  The distribution of all possible values of  is normal.  The distributions of possible i values have equal variances for all value of x.  The means of the dependent variable, for all specified values of the independent variable, y, can be connected by a straight line called the population regression model.
  • 21.
    Simple Linear Regression Analysis REGRESSIONCOEFFICIENTS In the simple regression model, there are two coefficients: the intercept and the slope.
  • 22.
    Simple Linear Regression Analysis Theinterpretation of the regression slope coefficient is that is gives the average change in the dependent variable for a unit change in the independent variable. The slope coefficient may be positive or negative, depending on the relationship between the two variables.
  • 23.
    Simple Linear Regression Analysis Theleast squares criterion is used for determining a regression line that minimizes the sum of squared residuals.
  • 24.
    Simple Linear Regression Analysis Aresidual is the difference between the actual value of the dependent variable and the value predicted by the regression model. y y ˆ 
  • 25.
    Simple Linear Regression Analysis ESTIMATEDREGRESSION MODEL (SAMPLE MODEL) where: = Estimated, or predicted, y value b0 = Unbiased estimate of the regression intercept b1 = Unbiased estimate of the regression slope x = Value of the independent variable x b b yi 1 0 ˆ   ŷ
  • 26.
    Simple Linear Regression Analysis LEASTSQUARES EQUATIONS algebraic equivalent: and         n x x n y x xy b 2 2 1 ) (       2 1 ) ( ) )( ( x x y y x x b x b y b 1 0  
  • 27.
    Simple Linear RegressionAnalysis Grade mathscore y x yx y2 x2 65 39 2,535 4,225 1521 78 43 3,354 6,084 1849 52 21 1,092 2,704 441 82 64 5,248 6,724 4096 92 57 5,244 8,464 3249 89 47 4,183 7,921 2209 73 28 2,044 5,329 784 98 75 7,350 9,604 5625 56 34 1,904 3,136 1156 75 52 3,900 5,625 2704 760 460  816 , 59  854 , 36  634 , 23
  • 28.
    Simple Linear Regression Analysis 76556 . 0 10 ) 460 ( 634 , 23 10 ) 760 ( 460 854 , 36 ) (2 2 2 1             n x x n y x xy b 78 . 40 ) 46 ( 76556 . 0 76 1 0      x b y b The least squares regression line is: ) ( 766 . 0 78 . 40 ˆ x y  
  • 29.
    Interpretation of Results: Example Theslope of 0.766 means that for each increase of one unit in X, we predict the average of Y to increase by an estimated 0.766 units. The equation estimates that for each increase of 1 point on the math achievement test, the expected final calculus grades are predicted to increase by 0.766 points. ) ( 766 . 0 78 . 40 ˆ x y  
  • 30.
    Simple Linear RegressionAnalysis Linear Regression 20.00 30.00 40.00 50.00 60.00 70.00 mathscor 60.00 70.00 80.00 90.00 grad e A A A A A A A A A A grade = 40.78 + 0.77 * mathscor
  • 31.
    Simple Linear Regression Analysis Thecoefficient of determination is the portion of the total variation in the dependent variable that is explained by its relationship with the independent variable. The coefficient of determination is also called R-squared and is denoted as R2 .
  • 32.
    Simple Linear Regression Analysis COEFFICIENTOF DETERMINATION (R2 ) TSS SSR R  2
  • 33.
    Simple Linear Regression Analysis COEFFICIENTOF DETERMINATION SINGLE INDEPENDENT VARIABLE CASE where: R2 = Coefficient of determination r = Simple correlation coefficient 2 2 r R 
  • 34.
    Coefficients of Determination(r 2 ) and Correlation (r) r2 = 1, r2 = 1, r2 = .81, r2 = 0, Y Yi = b0 + b1Xi X ^ Y Yi = b0 + b1Xi X ^ Y Yi = b0 + b1Xi X ^ Y Yi = b0 + b1Xi X ^ r = +1 r = -1 r = +0.9 r = 0
  • 35.
    Simple Regression Steps Developa scatter plot of y and x. You are looking for a linear relationship between the two variables. Calculate the least squares regression line for the sample data. Calculate the correlation coefficient and the simple coefficient of determination, R2 . Conduct one of the significance tests.