Correlation and Regression

Presented by:
Shubham Mehta
Regression and
Correlation

A researcher believes that there is a linear
relationship between BMI (Kg/m2
) of pregnant
mothers and the birth-weight (BW in Kg) of
their newborn
The following data set provide information on
15 pregnant mothers who were contacted for
a study :
Example

BMI (Kg/m2
) Birth-weight
(Kg)
20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8

Scatter diagram is a graphical method to
display the relationship between two
variables.
Scatter diagram plots pairs of bivariate
observations (x, y) on the X-Y plane
Y is called the dependent variable
X is called an independent variable
Scatter Diagram

Scatter diagram of BMI and Birth weight

Scatter diagrams are important for initial
exploration of the relationship between two
quantitative variables
In the above example, we may wish to
summarize this relationship by a straight line
drawn through the scatter of points
Is there a linear relationship between
BMI and BW?

Although we could fit a line "by eye" e.g. using a
transparent ruler, this would be a subjective
approach and therefore unsatisfactory.
An objective, and therefore better, way of
determining the position of a straight line is to use
the method of least squares.
Using this method, we choose a line such that the
sum of squares of vertical distances of all points
from the line is minimized.
Simple Linear Regression

These vertical distances, i.e., the distance between
y values and their corresponding estimated values
on the line are called residuals
The line which fits the best is called the regression line
or, sometimes, the least-squares line
The line always passes through the point defined by
the mean of Y and the mean of X.
Least-squares or regression line

The method of least-squares is available in
most of the statistical packages (and also on
some scientific calculators) and is usually
referred to as linear regression
Y is also known as an outcome variable
(dependent variable)
X is also called as a predictor (independent
variable)
Linear Regression Model

Linear regression assumes that :-
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same
(homogeneity of variances)
4. The observations are independent
Assumptions

Estimated Regression Line
ˆy = ˆα+ ˆβ x = 1.775351+ 0. 0330817x
ˆα. =1.775351−i s .c a l l e d . y −inte r c e p t
ˆβ = 0. 0330817−i s .c a l l e d .t h e .s l o p e

This equation allows you to estimate BW of
other newborns when the BMI is given.
e.g., for a mother who has BMI=40, i.e. X =
40 we predict BW to be
Application of Regression Line
ˆy = ˆα + ˆβ x = 1.775351+ 0.0330187(40) = 3.096

R is a measure of strength of the linear
association between two variables, x and y.
Most statistical packages and some hand
calculators can calculate R
For the data in our example, R=0.94
R has some unique characteristics
Correlation Coefficient, R

Correlation
measures and describes the strength and
direction of the relationship
bivariate techniques requires two variable scores
from the same individuals (dependent and
independent variables)
multivariate when more than two independent
variables (e.g effect of advertising and prices on
sales)

cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
Interpreting Covariance

Correlation coefficient
 Pearson’s Correlation Coefficient is standardized
covariance (unit less):
yx
yxariance
r
varvar
),(cov
=

Measures the relative strength of the linear relationship
between two variables
Unit-less
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker any positive linear relationship
Correlation

The Difference
In correlation, the two variables are treated as
equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other
the dependent (=outcome) variable Y.

Remember this:
Y=mX+B?
m: slope
A slope of 2 means that every 1- unit change in X yields a 2-unit change in
Y.
What is “Linear”?
B
m

If you know something about X, this
knowledge helps you predict something about
Y.
(Sound familiar?…sound like conditional
probabilities?)
Prediction

Regression equation…
E(yi / xi ) = α + βxi
Expected value of y at a given level of x=

yi= α + β*xi + random errori
Predicted value for an individual
Follows a
normal
distribution
Fixed –
exactly
on the
line
Random Error is often denoted by ei

Scatter Plots of Data with Various
Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
X
r = 0

Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation

Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation

Linear Correlation
Y
X
Y
X
No relationship

Correlation Coefficient “r”
A measure of the strength and direction of a linear
relationship between two variables
The range of r is from –1 to 1.
If r is close to
1 there is a
strong
positive
correlation.
If r is close to –1
there is a strong
negative
correlation.
If r is close to
0 there is no
linear
correlation.
–1 0 1

R takes values between -1 and +1
R=0 represents no linear relationship between the
two variables
R>0 implies a direct linear relationship
R<0 implies an inverse linear relationship
The closer R comes to either +1 or -1, the
stronger is the linear relationship
Correlation Coefficient, R

Though R measures how closely the two variables
approximate a straight line, it does not validly
measures the strength of non-linear relationship
When the sample size, n, is small we also have to
be careful with the reliability of the correlation
Outliers could have a marked effect on R
Limitations of the correlation
coefficient

Introduction
Spearman's rank correlation coefficient or Spearman's rho is named after Charles
Spearman
Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical
dependence between two variables)
Assesses how well the relationship between two variables can be described using a
monotonic function
Monotonic is a function (or monotone function) in mathematic that preserves the
given order.
If there are no repeated data values, a perfect Spearman correlation of +1 or −1
occurs when each of the variables is a perfect monotone function of the other
Spearman Rho Correlation

 A correlation coefficient is a numerical measure or
index of the amount of association between two sets
of scores. It ranges in size from a maximum of +1.00
through 0.00 to -1.00
 The ‘+’ sign indicates a positive correlation (the
scores on one variable increase as the scores on the
other variable increase)
 The ‘-’ sign indicates a negative correlation (the
scores on one variable increase, the scores on the
other variable decrease)

Calculation
Often thought of as being the Pearson correlation coefficient
between the ranked (relationship between two item)
variables
The n raw scores Xi, Yi are converted to ranks xi, yi, and the
differences di = xi − yi between the ranks of each
observation on the two variables are calculated
If there are no tied ranks, then ρ is given by this formula:

Interpretation
The sign of the Spearman correlation indicates the direction of
association between X (the independent variable) and Y (the
dependent variable)
If Y tends to increase when X increases, the Spearman correlation
coefficient is positive
If Y tends to decrease when X increases, the Spearman correlation
coefficient is negative
A Spearman correlation of zero indicates that there is no tendency for
Y to either increase or decrease when X increases

Interpretation cont…/
Alternative name for the Spearman rank correlation is the
"grade correlation” the "rank" of an observation is replaced
by the "grade"
When X and Y are perfectly monotonically related, the
Spearman correlation coefficient becomes 1
A perfect monotone increasing relationship implies that for
any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj
and Yi − Yj always have the same sign

Example 1
Calculate the correlation between the IQ of
a person with the number of hours spent in
the class per week
Find the value of the term d²i:
1. Sort the data by the first column
(Xi). Create a new column xi and assign it
the ranked values 1,2,3,...n.
2. Sort the data by the second column
(Yi). Create a fourth column yi and similarly
assign it the ranked values 1,2,3,...n.
3. Create a fifth column di to hold the
differences between the two rank columns
(xi and yi).
IQ, Xi Hours of class per
week, Yi
106 7
86 0
100 27
101 50
99 28
103 29
97 20
113 12
112 6
110 17

Example # 1 cont…/
4. Create one final column to hold the value
of column di squared.
IQ
(Xi )
Hours of class per week
(Yi)
rank xi rank yi di d²i
86 0 1 1 0 0
97 20 2 6 -4 16
99 28 3 8 -5 25
100 27 4 7 -3 9
101 50 5 10 -5 25
103 29 6 9 -3 9
106 7 7 3 4 16
110 17 8 5 3 9
112 6 9 2 7 49
113 12 10 4 6 36

Example # 1- Result
With d²i found, we can add them to find ∑ d²i = 194
The value of n is 10, so;
=ρ 1- 6 x 194
10(10² - 1)
ρ = −0.18
The low value shows that the correlation between IQ and hours

Outliers.....
Outliers are dangerous
Here we have a spurious
correlation of r=0.68
without IBM, r=0.48
without IBM & GE,
r=0.21

r is the correlation coefficient for the sample. The
correlation coefficient for the population is (rho).
The sampling distribution for r is a t-distribution
with n – 2 d.f.
Standardized test
statistic
For a two tail test for significance:
Hypothesis Test for Significance
(The correlation is not significant)
(The correlation is significant)

A t-distribution with 5 degrees of freedom
Test of Significance
The correlation between the number of times absent and a
final grade r = –0.975. There were seven pairs of data. Test
the significance of this correlation. Use = 0.01.
1. Write the null and alternative hypothesis.
2. State the level of significance.
3. Identify the sampling distribution.
(The correlation is not significant)
(The correlation is significant)
= 0.01

t
0 4.032–4.032
Rejection Regions
Critical Values ± t0
4. Find the critical value.
5. Find the rejection region.
6. Find the test statistic.
dfp 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
1
0.32492
0
1.00000
0
3.07768
4
6.31375
2
12.7062
0
31.8205
2
63.6567
4
636.619
2
2
0.28867
5
0.81649
7
1.88561
8
2.91998
6
4.30265 6.96456 9.92484 31.5991
3
0.27667
1
0.76489
2
1.63774
4
2.35336
3
3.18245 4.54070 5.84091 12.9240
4
0.27072
2
0.74069
7
1.53320
6
2.13184
7
2.77645 3.74695 4.60409 8.6103
5
0.26718
1
0.72668
7
1.47588
4
2.01504
8
2.57058 3.36493 4.03214 6.8688

t
0–4.032 +4.032
t = –9.811 falls in the rejection region. Reject the null
hypothesis.
There is a significant negative correlation between the
number of times absent and final grades.
7. Make your decision.
8. Interpret your decision.

The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-intercept.
The line of regression is:
The slope m is:
The y-intercept is:
Regression indicates the degree to which the variation in one
variable X, is related to or can be explained by the variation in
another variable Y
Once you know there is a significant linear correlation, you can
write an equation describing the relationship between the x and y
variables. This equation is called the line of regression or least
squares line.
The Line of Regression

180
190
200
210
220
230
240
250
260
1.5 2.0 2.5 3.0
Ad $
= a residual
(xi,yi) = a data pointrevenue
= a point on the line with the same x-value
Best fitting straight line

Calculating manually
ˆr =
covariance(x, y)
var(x) var(y)
=
(xi − x)(yi − y)
i=1
n
∑
n −1
(xi − x)2
i=1
n
∑
n −1
(yi − y)2
i=1
n
∑
n −1

Simpler calculation formula…
ˆr =
(xi − x)(yi − y)
i=1
n
∑
n −1
(xi − x)2
i=1
n
∑
n −1
(yi − y)2
i=1
n
∑
n −1
=
(xi − x)(yi − y)
i=1
n
∑
(xi − x)2
i=1
n
∑ (yi − y)2
i=1
n
∑
=
SSxy
SSxSSy
yx
xy
SSSS
SS
r =ˆ
Numerator
of
covariance
Numerators
of variance

*Note - like a proportion, the variance of the correlation coefficient depends
on the correlation coefficient itselfsubstitute in estimated r
Distribution of the correlation
coefficient:
2
1
)ˆ(
2
−
−
=
n
r
rSE
The sample correlation coefficient follows a
T-distribution with n-2 degrees of freedom
(since you have to estimate the standard
error).

R2
is another important measure of linear
association between x and y (0 < R2
< 1)
R2
measures the proportion of the total variation in
y which is explained by x
For example r2
= 0.8751, indicates that 87.51% of
the variation in BW is explained by the
independent variable x (BMI).
Coefficient of Determination

The correlation coefficient of number of times absent and
final grade is r = –0.975. The coefficient of determination
is r2
= (–0.975)2
= 0.9506.
Interpretation: About 95% of the variation in final grades can
be explained by the number of times a student is absent. The
other 5% is unexplained and can be due to sampling error or
other variables such as intelligence, amount of time studied,
etc.
Strength of the Association
The coefficient of determination, r2
, measures the strength of
the association and is the ratio of explained variation in y to
the total variation in y.

The standard error of Y given X is the average
variability around the regression line at any given
value of X. It is assumed to be equal at all values
of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x

Correlation Coefficient, R, measures the
strength of bivariate association
The regression line is a prediction
equation that estimates the values of y for
any given x
Difference between Correlation and
Regression

Correlation and Regression

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Correlation and Regression (20)

Correlation and Regression

Editor's Notes