Presented by:
Shubham Mehta
Regression and
Correlation
A researcher believes that there is a linear
relationship between BMI (Kg/m2
) of pregnant
mothers and the birth-weight (BW in Kg) of
their newborn
The following data set provide information on
15 pregnant mothers who were contacted for
a study :
Example
BMI (Kg/m2
) Birth-weight
(Kg)
20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8
Scatter diagram is a graphical method to
display the relationship between two
variables.
Scatter diagram plots pairs of bivariate
observations (x, y) on the X-Y plane
Y is called the dependent variable
X is called an independent variable
Scatter Diagram
Scatter diagram of BMI and Birth weight
Scatter diagrams are important for initial
exploration of the relationship between two
quantitative variables
In the above example, we may wish to
summarize this relationship by a straight line
drawn through the scatter of points
Is there a linear relationship between
BMI and BW?
Although we could fit a line "by eye" e.g. using a
transparent ruler, this would be a subjective
approach and therefore unsatisfactory.
An objective, and therefore better, way of
determining the position of a straight line is to use
the method of least squares.
Using this method, we choose a line such that the
sum of squares of vertical distances of all points
from the line is minimized.
Simple Linear Regression
These vertical distances, i.e., the distance between
y values and their corresponding estimated values
on the line are called residuals
The line which fits the best is called the regression line
or, sometimes, the least-squares line
The line always passes through the point defined by
the mean of Y and the mean of X.
Least-squares or regression line
The method of least-squares is available in
most of the statistical packages (and also on
some scientific calculators) and is usually
referred to as linear regression
Y is also known as an outcome variable
(dependent variable)
X is also called as a predictor (independent
variable)
Linear Regression Model
Linear regression assumes that :-
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same
(homogeneity of variances)
4. The observations are independent
Assumptions
Estimated Regression Line
ˆy = ˆα+ ˆβ x = 1.775351+ 0. 0330817x
ˆα. =1.775351−i s .c a l l e d . y −inte r c e p t
ˆβ = 0. 0330817−i s .c a l l e d .t h e .s l o p e
This equation allows you to estimate BW of
other newborns when the BMI is given.
e.g., for a mother who has BMI=40, i.e. X =
40 we predict BW to be
Application of Regression Line
ˆy = ˆα + ˆβ x = 1.775351+ 0.0330187(40) = 3.096
R is a measure of strength of the linear
association between two variables, x and y.
Most statistical packages and some hand
calculators can calculate R
For the data in our example, R=0.94
R has some unique characteristics
Correlation Coefficient, R
Correlation
measures and describes the strength and
direction of the relationship
bivariate techniques requires two variable scores
from the same individuals (dependent and
independent variables)
multivariate when more than two independent
variables (e.g effect of advertising and prices on
sales)
cov(X,Y) > 0 X and Y are positively correlated
cov(X,Y) < 0 X and Y are inversely correlated
cov(X,Y) = 0 X and Y are independent
Interpreting Covariance
Correlation coefficient
 Pearson’s Correlation Coefficient is standardized
covariance (unit less):
yx
yxariance
r
varvar
),(cov
=
Measures the relative strength of the linear relationship
between two variables
Unit-less
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker any positive linear relationship
Correlation
The Difference
In correlation, the two variables are treated as
equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other
the dependent (=outcome) variable Y.
Remember this:
Y=mX+B?
m: slope
A slope of 2 means that every 1- unit change in X yields a 2-unit change in
Y.
What is “Linear”?
B
m
If you know something about X, this
knowledge helps you predict something about
Y.
(Sound familiar?…sound like conditional
probabilities?)
Prediction
Regression equation…
E(yi / xi ) = α + βxi
Expected value of y at a given level of x=
yi= α + β*xi + random errori
Predicted value for an individual
Follows a
normal
distribution
Fixed –
exactly
on the
line
Random Error is often denoted by ei
Scatter Plots of Data with Various
Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
X
r = 0
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
Linear Correlation
Y
X
Y
X
No relationship
Correlation Coefficient “r”
A measure of the strength and direction of a linear
relationship between two variables
The range of r is from –1 to 1.
If r is close to
1 there is a
strong
positive
correlation.
If r is close to –1
there is a strong
negative
correlation.
If r is close to
0 there is no
linear
correlation.
–1 0 1
R takes values between -1 and +1
R=0 represents no linear relationship between the
two variables
R>0 implies a direct linear relationship
R<0 implies an inverse linear relationship
The closer R comes to either +1 or -1, the
stronger is the linear relationship
 Correlation Coefficient, R 
Though R measures how closely the two variables
approximate a straight line, it does not validly
measures the strength of non-linear relationship 
When the sample size, n, is small we also have to
be careful with the reliability of the correlation
Outliers could have a marked effect on R
Limitations of the correlation 
coefficient
Introduction
Spearman's rank correlation coefficient or Spearman's rho is named after Charles
Spearman
Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical
dependence between two variables)
Assesses how well the relationship between two variables can be described using a
monotonic function
Monotonic is a function (or monotone function) in mathematic that preserves the
given order.
If there are no repeated data values, a perfect Spearman correlation of +1 or −1
occurs when each of the variables is a perfect monotone function of the other
Spearman Rho Correlation
Spearman Rho Correlation
 A correlation coefficient is a numerical measure or
index of the amount of association between two sets
of scores. It ranges in size from a maximum of +1.00
through 0.00 to -1.00
 The ‘+’ sign indicates a positive correlation (the
scores on one variable increase as the scores on the
other variable increase)
 The ‘-’ sign indicates a negative correlation (the
scores on one variable increase, the scores on the
other variable decrease)
Spearman Rho Correlation
Calculation
Often thought of as being the Pearson correlation coefficient
between the ranked (relationship between two item)
variables
The n raw scores Xi, Yi are converted to ranks xi, yi, and the
differences di = xi − yi between the ranks of each
observation on the two variables are calculated
If there are no tied ranks, then ρ is given by this formula:
Interpretation
The sign of the Spearman correlation indicates the direction of
association between X (the independent variable) and Y (the
dependent variable)
If Y tends to increase when X increases, the Spearman correlation
coefficient is positive
If Y tends to decrease when X increases, the Spearman correlation
coefficient is negative
A Spearman correlation of zero indicates that there is no tendency for
Y to either increase or decrease when X increases
Spearman Rho Correlation
Interpretation cont…/
Alternative name for the Spearman rank correlation is the
"grade correlation” the "rank" of an observation is replaced
by the "grade"
When X and Y are perfectly monotonically related, the
Spearman correlation coefficient becomes 1
A perfect monotone increasing relationship implies that for
any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj
and Yi − Yj always have the same sign
Spearman Rho Correlation
Spearman Rho Correlation
Example 1
Calculate the correlation between the IQ of
a person with the number of hours spent in
the class per week
Find the value of the term d²i:
1. Sort the data by the first column
(Xi). Create a new column xi and assign it
the ranked values 1,2,3,...n.
2. Sort the data by the second column
(Yi). Create a fourth column yi and similarly
assign it the ranked values 1,2,3,...n.
3. Create a fifth column di to hold the
differences between the two rank columns
(xi and yi).
IQ, Xi Hours of class per
week, Yi
106 7
86 0
100 27
101 50
99 28
103 29
97 20
113 12
112 6
110 17
Spearman Rho Correlation
Example # 1 cont…/
4. Create one final column to hold the value
of column di squared.
IQ
(Xi )
Hours of class per week
(Yi)
rank xi rank yi di d²i
86 0 1 1 0 0
97 20 2 6 -4 16
99 28 3 8 -5 25
100 27 4 7 -3 9
101 50 5 10 -5 25
103 29 6 9 -3 9
106 7 7 3 4 16
110 17 8 5 3 9
112 6 9 2 7 49
113 12 10 4 6 36
Example # 1- Result
With d²i found, we can add them to find ∑ d²i = 194
The value of n is 10, so;
=ρ 1- 6 x 194
10(10² - 1)
ρ =  −0.18
The low value shows that the correlation between IQ and hours
Spearman Rho Correlation
Outliers.....
Outliers are dangerous
Here we have a spurious
correlation of r=0.68
without IBM, r=0.48
without IBM & GE,
r=0.21
r is the correlation coefficient for the sample. The
correlation coefficient for the population is (rho).
The sampling distribution for r is a t-distribution
with n – 2 d.f.
Standardized test
statistic
For a two tail test for significance:
Hypothesis Test for Significance
(The correlation is not significant)
(The correlation is significant)
A t-distribution with 5 degrees of freedom
Test of Significance
The correlation between the number of times absent and a
final grade r = –0.975. There were seven pairs of data. Test
the significance of this correlation. Use = 0.01.
1. Write the null and alternative hypothesis.
2. State the level of significance.
3. Identify the sampling distribution.
(The correlation is not significant)
(The correlation is significant)
= 0.01
t
0 4.032–4.032
Rejection Regions
Critical Values ± t0
4. Find the critical value.
5. Find the rejection region.
6. Find the test statistic.
dfp 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
1
0.32492
0
1.00000
0
3.07768
4
6.31375
2
12.7062
0
31.8205
2
63.6567
4
636.619
2
2
0.28867
5
0.81649
7
1.88561
8
2.91998
6
4.30265 6.96456 9.92484 31.5991
3
0.27667
1
0.76489
2
1.63774
4
2.35336
3
3.18245 4.54070 5.84091 12.9240
4
0.27072
2
0.74069
7
1.53320
6
2.13184
7
2.77645 3.74695 4.60409 8.6103
5
0.26718
1
0.72668
7
1.47588
4
2.01504
8
2.57058 3.36493 4.03214 6.8688
t
0–4.032 +4.032
t = –9.811 falls in the rejection region. Reject the null
hypothesis.
There is a significant negative correlation between the
number of times absent and final grades.
7. Make your decision.
8. Interpret your decision.
The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-intercept.
The line of regression is:
The slope m is:
The y-intercept is:
Regression indicates the degree to which the variation in one
variable X, is related to or can be explained by the variation in
another variable Y
Once you know there is a significant linear correlation, you can
write an equation describing the relationship between the x and y
variables. This equation is called the line of regression or least
squares line.
The Line of Regression
180
190
200
210
220
230
240
250
260
1.5 2.0 2.5 3.0
Ad $
= a residual
(xi,yi) = a data pointrevenue
= a point on the line with the same x-value
Best fitting straight line
Calculating manually
ˆr =
covariance(x, y)
var(x) var(y)
=
(xi − x)(yi − y)
i=1
n
∑
n −1
(xi − x)2
i=1
n
∑
n −1
(yi − y)2
i=1
n
∑
n −1
Simpler calculation formula…
ˆr =
(xi − x)(yi − y)
i=1
n
∑
n −1
(xi − x)2
i=1
n
∑
n −1
(yi − y)2
i=1
n
∑
n −1
=
(xi − x)(yi − y)
i=1
n
∑
(xi − x)2
i=1
n
∑ (yi − y)2
i=1
n
∑
=
SSxy
SSxSSy
yx
xy
SSSS
SS
r =ˆ
Numerator
of
covariance
Numerators
of variance
*Note - like a proportion, the variance of the correlation coefficient depends
on the correlation coefficient itselfsubstitute in estimated r
Distribution of the correlation
coefficient:
2
1
)ˆ(
2
−
−
=
n
r
rSE
The sample correlation coefficient follows a
T-distribution with n-2 degrees of freedom
(since you have to estimate the standard
error).
R2
is another important measure of linear
association between x and y (0 < R2
< 1)
R2
measures the proportion of the total variation in
y which is explained by x
For example r2
= 0.8751, indicates that 87.51% of
the variation in BW is explained by the
independent variable x (BMI).
Coefficient of Determination
The correlation coefficient of number of times absent and
final grade is r = –0.975. The coefficient of determination
is r2
= (–0.975)2
= 0.9506.
Interpretation: About 95% of the variation in final grades can
be explained by the number of times a student is absent. The
other 5% is unexplained and can be due to sampling error or
other variables such as intelligence, amount of time studied,
etc.
Strength of the Association
The coefficient of determination, r2
, measures the strength of
the association and is the ratio of explained variation in y to
the total variation in y.
The standard error of Y given X is the average
variability around the regression line at any given
value of X. It is assumed to be equal at all values
of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Correlation Coefficient, R, measures the
strength of bivariate association
The regression line is a prediction
equation that estimates the values of y for
any given x
Difference between Correlation and
Regression
Thank You

More Related Content

PPTX
Regression analysis
PPTX
Regression Analysis
PDF
Least Squares Regression Method | Edureka
PPTX
Regression analysis
PPT
Linear regression
DOCX
Probability distribution
PPT
Regression analysis
PPT
Simple Correlation : Karl Pearson’s Correlation co- efficient and Spearman’s ...
Regression analysis
Regression Analysis
Least Squares Regression Method | Edureka
Regression analysis
Linear regression
Probability distribution
Regression analysis
Simple Correlation : Karl Pearson’s Correlation co- efficient and Spearman’s ...

What's hot (20)

PPTX
Regression
PPTX
Regression and corelation (Biostatistics)
PPTX
Multivariate analysis - Multiple regression analysis
PDF
Hypothesis testing
PPTX
Regression analysis
PPTX
Presentation On Regression
PPTX
Karl pearson's correlation
PDF
Multiple regression
PPT
Stat 4 the normal distribution & steps of testing hypothesis
PDF
Analysis of Variance (ANOVA)
PPTX
Regression Analysis.pptx
PPTX
Bernoullis Random Variables And Binomial Distribution
PDF
Confidence Intervals: Basic concepts and overview
PPT
correlation.ppt
PPT
Scatter plots
PPTX
Non Linear Equation
PPTX
Regression analysis.
PDF
Correlation and Regression
PPT
Probability concept and Probability distribution
PDF
Introduction to correlation and regression analysis
Regression
Regression and corelation (Biostatistics)
Multivariate analysis - Multiple regression analysis
Hypothesis testing
Regression analysis
Presentation On Regression
Karl pearson's correlation
Multiple regression
Stat 4 the normal distribution & steps of testing hypothesis
Analysis of Variance (ANOVA)
Regression Analysis.pptx
Bernoullis Random Variables And Binomial Distribution
Confidence Intervals: Basic concepts and overview
correlation.ppt
Scatter plots
Non Linear Equation
Regression analysis.
Correlation and Regression
Probability concept and Probability distribution
Introduction to correlation and regression analysis
Ad

Viewers also liked (20)

PPTX
Chapter 16: Correlation (enhanced by VisualBee)
PPTX
Correlation and regression
PPT
Correlation analysis ppt
PPT
Chapter35
PPT
Correlation and regression
PPTX
Correlating test scores
PPTX
Correlation and regression
PPT
Assessment compiled
PPTX
Measures of correlation (pearson's r correlation coefficient and spearman rho)
PPTX
Correlation and Regression
ODP
Correlation
PPT
Spearman Rank Correlation Presentation
PPTX
Correlation
PPTX
Correlation analysis
PPTX
Correlation & Regression
PDF
Pearson Correlation, Spearman Correlation &Linear Regression
PPT
Correlation
PDF
Correlation and Simple Regression
PPTX
Correlation ppt...
PPS
Correlation and regression
Chapter 16: Correlation (enhanced by VisualBee)
Correlation and regression
Correlation analysis ppt
Chapter35
Correlation and regression
Correlating test scores
Correlation and regression
Assessment compiled
Measures of correlation (pearson's r correlation coefficient and spearman rho)
Correlation and Regression
Correlation
Spearman Rank Correlation Presentation
Correlation
Correlation analysis
Correlation & Regression
Pearson Correlation, Spearman Correlation &Linear Regression
Correlation
Correlation and Simple Regression
Correlation ppt...
Correlation and regression
Ad

Similar to Correlation and Regression (20)

PPTX
5.-SIMPLE-LINEAR-REGRESSION-MEASURES-OF-CORRELATION.pptx
PDF
Correlation analysis
PPT
Corelation and regression PowerPoint presentation for basic understanding
PPT
PPT
5 regressionand correlation
PPTX
Correlation and Regression.pptx
PPTX
Introduction to Educational statistics and measurement Unit 2
PPTX
Module 2_ Regression Models..pptx
PDF
P G STAT 531 Lecture 9 Correlation
PPTX
Correlation.pptx
PPTX
Statistics ppt
PPTX
3.2 correlation use in biostatistics .pptx
PPTX
Inferential statistics correlations
PDF
Analyzing Relations between Data Set - Part I
PPTX
4_Correlation and and Regression (1).pptx
PPTX
CORRELATION-CMC.PPTX
PPT
"Understanding Correlation and Regression: Key Concepts for Data Analysis"
PPT
Correlation and regression
PPT
PPT
lecture13.ppt
5.-SIMPLE-LINEAR-REGRESSION-MEASURES-OF-CORRELATION.pptx
Correlation analysis
Corelation and regression PowerPoint presentation for basic understanding
5 regressionand correlation
Correlation and Regression.pptx
Introduction to Educational statistics and measurement Unit 2
Module 2_ Regression Models..pptx
P G STAT 531 Lecture 9 Correlation
Correlation.pptx
Statistics ppt
3.2 correlation use in biostatistics .pptx
Inferential statistics correlations
Analyzing Relations between Data Set - Part I
4_Correlation and and Regression (1).pptx
CORRELATION-CMC.PPTX
"Understanding Correlation and Regression: Key Concepts for Data Analysis"
Correlation and regression
lecture13.ppt

Correlation and Regression

  • 2. A researcher believes that there is a linear relationship between BMI (Kg/m2 ) of pregnant mothers and the birth-weight (BW in Kg) of their newborn The following data set provide information on 15 pregnant mothers who were contacted for a study : Example
  • 3. BMI (Kg/m2 ) Birth-weight (Kg) 20 2.7 30 2.9 50 3.4 45 3.0 10 2.2 30 3.1 40 3.3 25 2.3 50 3.5 20 2.5 10 1.5 55 3.8 60 3.7 50 3.1 35 2.8
  • 4. Scatter diagram is a graphical method to display the relationship between two variables. Scatter diagram plots pairs of bivariate observations (x, y) on the X-Y plane Y is called the dependent variable X is called an independent variable Scatter Diagram
  • 5. Scatter diagram of BMI and Birth weight
  • 6. Scatter diagrams are important for initial exploration of the relationship between two quantitative variables In the above example, we may wish to summarize this relationship by a straight line drawn through the scatter of points Is there a linear relationship between BMI and BW?
  • 7. Although we could fit a line "by eye" e.g. using a transparent ruler, this would be a subjective approach and therefore unsatisfactory. An objective, and therefore better, way of determining the position of a straight line is to use the method of least squares. Using this method, we choose a line such that the sum of squares of vertical distances of all points from the line is minimized. Simple Linear Regression
  • 8. These vertical distances, i.e., the distance between y values and their corresponding estimated values on the line are called residuals The line which fits the best is called the regression line or, sometimes, the least-squares line The line always passes through the point defined by the mean of Y and the mean of X. Least-squares or regression line
  • 9. The method of least-squares is available in most of the statistical packages (and also on some scientific calculators) and is usually referred to as linear regression Y is also known as an outcome variable (dependent variable) X is also called as a predictor (independent variable) Linear Regression Model
  • 10. Linear regression assumes that :- 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The observations are independent Assumptions
  • 11. Estimated Regression Line ˆy = ˆα+ ˆβ x = 1.775351+ 0. 0330817x ˆα. =1.775351−i s .c a l l e d . y −inte r c e p t ˆβ = 0. 0330817−i s .c a l l e d .t h e .s l o p e
  • 12. This equation allows you to estimate BW of other newborns when the BMI is given. e.g., for a mother who has BMI=40, i.e. X = 40 we predict BW to be Application of Regression Line ˆy = ˆα + ˆβ x = 1.775351+ 0.0330187(40) = 3.096
  • 13. R is a measure of strength of the linear association between two variables, x and y. Most statistical packages and some hand calculators can calculate R For the data in our example, R=0.94 R has some unique characteristics Correlation Coefficient, R
  • 14. Correlation measures and describes the strength and direction of the relationship bivariate techniques requires two variable scores from the same individuals (dependent and independent variables) multivariate when more than two independent variables (e.g effect of advertising and prices on sales)
  • 15. cov(X,Y) > 0 X and Y are positively correlated cov(X,Y) < 0 X and Y are inversely correlated cov(X,Y) = 0 X and Y are independent Interpreting Covariance
  • 16. Correlation coefficient  Pearson’s Correlation Coefficient is standardized covariance (unit less): yx yxariance r varvar ),(cov =
  • 17. Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship Correlation
  • 18. The Difference In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.
  • 19. Remember this: Y=mX+B? m: slope A slope of 2 means that every 1- unit change in X yields a 2-unit change in Y. What is “Linear”? B m
  • 20. If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?) Prediction
  • 21. Regression equation… E(yi / xi ) = α + βxi Expected value of y at a given level of x=
  • 22. yi= α + β*xi + random errori Predicted value for an individual Follows a normal distribution Fixed – exactly on the line Random Error is often denoted by ei
  • 23. Scatter Plots of Data with Various Correlation Coefficients Y X Y X Y X Y X Y X r = -1 r = -.6 r = 0 r = +.3r = +1 Y X r = 0
  • 24. Y X Y X Y Y X X Linear relationships Curvilinear relationships Linear Correlation
  • 25. Y X Y X Y Y X X Strong relationships Weak relationships Linear Correlation
  • 27. Correlation Coefficient “r” A measure of the strength and direction of a linear relationship between two variables The range of r is from –1 to 1. If r is close to 1 there is a strong positive correlation. If r is close to –1 there is a strong negative correlation. If r is close to 0 there is no linear correlation. –1 0 1
  • 28. R takes values between -1 and +1 R=0 represents no linear relationship between the two variables R>0 implies a direct linear relationship R<0 implies an inverse linear relationship The closer R comes to either +1 or -1, the stronger is the linear relationship  Correlation Coefficient, R 
  • 29. Though R measures how closely the two variables approximate a straight line, it does not validly measures the strength of non-linear relationship  When the sample size, n, is small we also have to be careful with the reliability of the correlation Outliers could have a marked effect on R Limitations of the correlation  coefficient
  • 30. Introduction Spearman's rank correlation coefficient or Spearman's rho is named after Charles Spearman Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical dependence between two variables) Assesses how well the relationship between two variables can be described using a monotonic function Monotonic is a function (or monotone function) in mathematic that preserves the given order. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other Spearman Rho Correlation
  • 31. Spearman Rho Correlation  A correlation coefficient is a numerical measure or index of the amount of association between two sets of scores. It ranges in size from a maximum of +1.00 through 0.00 to -1.00  The ‘+’ sign indicates a positive correlation (the scores on one variable increase as the scores on the other variable increase)  The ‘-’ sign indicates a negative correlation (the scores on one variable increase, the scores on the other variable decrease)
  • 32. Spearman Rho Correlation Calculation Often thought of as being the Pearson correlation coefficient between the ranked (relationship between two item) variables The n raw scores Xi, Yi are converted to ranks xi, yi, and the differences di = xi − yi between the ranks of each observation on the two variables are calculated If there are no tied ranks, then ρ is given by this formula:
  • 33. Interpretation The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable) If Y tends to increase when X increases, the Spearman correlation coefficient is positive If Y tends to decrease when X increases, the Spearman correlation coefficient is negative A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases Spearman Rho Correlation
  • 34. Interpretation cont…/ Alternative name for the Spearman rank correlation is the "grade correlation” the "rank" of an observation is replaced by the "grade" When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1 A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj and Yi − Yj always have the same sign Spearman Rho Correlation
  • 35. Spearman Rho Correlation Example 1 Calculate the correlation between the IQ of a person with the number of hours spent in the class per week Find the value of the term d²i: 1. Sort the data by the first column (Xi). Create a new column xi and assign it the ranked values 1,2,3,...n. 2. Sort the data by the second column (Yi). Create a fourth column yi and similarly assign it the ranked values 1,2,3,...n. 3. Create a fifth column di to hold the differences between the two rank columns (xi and yi). IQ, Xi Hours of class per week, Yi 106 7 86 0 100 27 101 50 99 28 103 29 97 20 113 12 112 6 110 17
  • 36. Spearman Rho Correlation Example # 1 cont…/ 4. Create one final column to hold the value of column di squared. IQ (Xi ) Hours of class per week (Yi) rank xi rank yi di d²i 86 0 1 1 0 0 97 20 2 6 -4 16 99 28 3 8 -5 25 100 27 4 7 -3 9 101 50 5 10 -5 25 103 29 6 9 -3 9 106 7 7 3 4 16 110 17 8 5 3 9 112 6 9 2 7 49 113 12 10 4 6 36
  • 37. Example # 1- Result With d²i found, we can add them to find ∑ d²i = 194 The value of n is 10, so; =ρ 1- 6 x 194 10(10² - 1) ρ =  −0.18 The low value shows that the correlation between IQ and hours Spearman Rho Correlation
  • 38. Outliers..... Outliers are dangerous Here we have a spurious correlation of r=0.68 without IBM, r=0.48 without IBM & GE, r=0.21
  • 39. r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho). The sampling distribution for r is a t-distribution with n – 2 d.f. Standardized test statistic For a two tail test for significance: Hypothesis Test for Significance (The correlation is not significant) (The correlation is significant)
  • 40. A t-distribution with 5 degrees of freedom Test of Significance The correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data. Test the significance of this correlation. Use = 0.01. 1. Write the null and alternative hypothesis. 2. State the level of significance. 3. Identify the sampling distribution. (The correlation is not significant) (The correlation is significant) = 0.01
  • 41. t 0 4.032–4.032 Rejection Regions Critical Values ± t0 4. Find the critical value. 5. Find the rejection region. 6. Find the test statistic. dfp 0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005 1 0.32492 0 1.00000 0 3.07768 4 6.31375 2 12.7062 0 31.8205 2 63.6567 4 636.619 2 2 0.28867 5 0.81649 7 1.88561 8 2.91998 6 4.30265 6.96456 9.92484 31.5991 3 0.27667 1 0.76489 2 1.63774 4 2.35336 3 3.18245 4.54070 5.84091 12.9240 4 0.27072 2 0.74069 7 1.53320 6 2.13184 7 2.77645 3.74695 4.60409 8.6103 5 0.26718 1 0.72668 7 1.47588 4 2.01504 8 2.57058 3.36493 4.03214 6.8688
  • 42. t 0–4.032 +4.032 t = –9.811 falls in the rejection region. Reject the null hypothesis. There is a significant negative correlation between the number of times absent and final grades. 7. Make your decision. 8. Interpret your decision.
  • 43. The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y-intercept. The line of regression is: The slope m is: The y-intercept is: Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable Y Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line. The Line of Regression
  • 44. 180 190 200 210 220 230 240 250 260 1.5 2.0 2.5 3.0 Ad $ = a residual (xi,yi) = a data pointrevenue = a point on the line with the same x-value Best fitting straight line
  • 45. Calculating manually ˆr = covariance(x, y) var(x) var(y) = (xi − x)(yi − y) i=1 n ∑ n −1 (xi − x)2 i=1 n ∑ n −1 (yi − y)2 i=1 n ∑ n −1
  • 46. Simpler calculation formula… ˆr = (xi − x)(yi − y) i=1 n ∑ n −1 (xi − x)2 i=1 n ∑ n −1 (yi − y)2 i=1 n ∑ n −1 = (xi − x)(yi − y) i=1 n ∑ (xi − x)2 i=1 n ∑ (yi − y)2 i=1 n ∑ = SSxy SSxSSy yx xy SSSS SS r =ˆ Numerator of covariance Numerators of variance
  • 47. *Note - like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itselfsubstitute in estimated r Distribution of the correlation coefficient: 2 1 )ˆ( 2 − − = n r rSE The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error).
  • 48. R2 is another important measure of linear association between x and y (0 < R2 < 1) R2 measures the proportion of the total variation in y which is explained by x For example r2 = 0.8751, indicates that 87.51% of the variation in BW is explained by the independent variable x (BMI). Coefficient of Determination
  • 49. The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r2 = (–0.975)2 = 0.9506. Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc. Strength of the Association The coefficient of determination, r2 , measures the strength of the association and is the ratio of explained variation in y to the total variation in y.
  • 50. The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X. Sy/x Sy/x Sy/x Sy/x Sy/x Sy/x
  • 51. Correlation Coefficient, R, measures the strength of bivariate association The regression line is a prediction equation that estimates the values of y for any given x Difference between Correlation and Regression

Editor's Notes

  • #28: Give several examples r = -0.97, r = 0.02 and ask for the strength of the correlation. For values like 0.63 a hypothesis test is necessary to determine whether it is strong or not.
  • #40: Another way to determine whether the correlation is significant is to compare the value of r with the values in the table. If |r| is greater than the value in the table, you can assume the correlation is significant. Notice the standardized statistic represents the difference between the hypothesized value (zero) and the test value divided by the standard error.
  • #41: You loose one degree of freedom for each variable. This accounts for the n-2 degrees of freedom. Since there are 7 ordered pairs, the sampling distribution for r has 5 d.f.
  • #42: Detailed calculations are shown. Depending on your calculator you can use parentheses and take fewer steps.
  • #43: Remind students that the null hypothesis states the correlation coefficient is 0. To find a significant correlation you must reject the null hypothesis.
  • #44: Once the correlation coefficient has been calculated, no new results need to be used to find m and b. Note that the regression line always passes through the point (x-bar, y-bar).
  • #45: The value of d can be positive, negative or 0. Discuss the circumstances for each. The sum of the values of d will be 0 for the regression line. Squaring d eliminates negative values. Criteria for the Best Fit Line: The sum of the squares of the distances will be minimized.
  • #50: The proof that the coefficient of determination is equal to the square of the correlation coefficient is beyond the scope of the text.