CORRELATION &REGRESSION
ANALYSIS Using SPSS
Dr Parag Shah | M.Sc., M.Phil., Ph.D. ( Statistics)
www.paragstatistics.wordpress.com
Correlation
Correlation analysis is used to study the strength of
relationship between two or more quantitative
variables. Correlation shows the degree of linear
dependence between the two variables.
Correlation doesn’t imply causation.
If variables are not related by cause and effect
relationship but show correlation then such
correlation is called Spurious or Non-sense
correlation.
Correlation
Correlation can be positive, negative or zero
depending on the change between two variables.
If the change in two variables is in the same
direction it is positive correlation.
If the change in two variables is in the opposite
direction it is negative correlation.
If the change in one variable does not affect the
change in the other variable it is zero correlation.
Correlation
Coefficient
Correlation coefficient (r) is the measure of extent
of correlation between two variables.
There are several types of correlation coefficient
but the most popular is Karl Pearson’s correlation
coefficient.
Testing
Correlation
Coefficient
Null Hypothesis H0: 𝜌 = 0
[There is no significant linear correlation between two variables]
Alternative Hypothesis H1: 𝜌≠ 0
[There is significant linear correlation between two variables]
Test statistics: 𝐭 =
𝑟 𝑛−2
1−𝑟2
The test statistics t follows Student’s t distribution with 𝒏 − 𝟐
degrees of freedom.
Case Study
The body temperature (in 0
𝐹) for 100 adults were measured along with
their gender, age, and heart rate. Data: body_temp.xlsx .
Obtain correlation coefficient between body temperature and heart rate.
Also check its significance.
Null & Alternative
Hypothesis
Null Hypothesis H0: 𝜌 = 0
[There is no significant linear correlation between body
temperature and heart rate]
Alternative Hypothesis H1: 𝜌≠ 0
[There is significant linear correlation between body temperature
and heart rate]
Test Statistics t
and p value
Test Statistics t
and p value
Correlation coefficient (r) between two variables heart rate
and temperature is 0.448.
Here p value = 0.000 < 0.05, so null hypothesis is rejected.
Thus, there is significant linear correlation between Heart rate
and Temperature
Regression
Regression analysis is a set of statistical processes
for estimating the relationships between
a dependent variable (often called the 'outcome' or
'response' variable) and one or more independent
variables (often called 'predictors', 'covariates',
'explanatory variables' or 'features’).
Regression
Analysis
Regression analysis helps you understand how the
dependent variable changes when one of the
independent variables varies and allows to
mathematically determine which of those
variables really has an impact.
Regression analysis includes several variations,
such as linear, multiple linear, and nonlinear. The
most common models are simple linear and
multiple linear.
Types of Regression
Dependent variable Independent variable Type of
Regression
Relationship
between variables
One
(Scale )
One
(Scale)
Simple Linear Linear
One
(Scale)
Two or more
(Continuous / Categorical)
Multiple Linear Linear
One
( Categorical – binary)
Two or more
(Continuous / Categorical)
Logistic Need not be linear
One
( Categorical )
Two or more
(Continuous / Categorical)
Multinomial
Logistic
Need not be linear
Simple
Regression
The simple linear regression model is used to predict one
response (dependent) variable based on one predictor
(independent) variable.
The linear regression model can be stated as follows
𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝑒𝑖 , 𝑖 = 1, 2, · · · , n.
where
• 𝑦𝑖 is value of the response variable,
• 𝑥𝑖 is the value of the predictor variable,
• 𝛽0 , 𝛽1are the parameters (regression coefficients),
• 𝑒𝑖 is random error term with E(𝑒𝑖 ) = 0 and V (𝑒𝑖 ) = 𝜎2.
Random Error
for this Xi value
Y
X
Observed Value
of Y for Xi
Predicted Value
of Y for Xi
i
i
1
0
i ε
x
β
β
y 


Xi
Slope = β1
Intercept = β0
εi
Graphical representation
Assumptions of
Simple
Regression
The four important assumptions for a simple linear
regression model are :
• The regression model is Linear in parameter.
• The errors are Independently distributed.
• The errors are Normally distributed.
• The errors have Equal variances. i.e. V (𝑒𝑖 ) = 𝜎2
.
( Homoscedasticity)
Method
The best line of fit can be obtained by the method of
least squares. It calculates the best line of fit for the
observed data by minimizing the sum of squares of the
vertical deviations from each data point to the line,
i.e., (𝑦𝑖 − 𝑦𝑖)2
Total variation is made up of two parts:
SSE
SSR
SST 

Total Sum of
Squares
Regression Sum
of Squares
Error Sum of
Squares
 
 2
i )
Y
Y
(
SST  
 2
i
i )
Ŷ
Y
(
SSE
 
 2
i )
Y
Ŷ
(
SSR
where: = Mean value of the dependent variable
Yi = Observed value of the dependent variable
= Predicted value of Y for the given Xi value
i
Y
ˆ
Y
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean 𝑌
• SSR = regression sum of squares (Explained Variation)
• Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
• Variation in Y attributable to factors other than X
Measures of Variations
Xi
Y
X
Yi
SST = (Yi - Y)2
SSE = (Yi - Yi )2

SSR = (Yi - Y)2

_
_
_
Y

Y
Y
_
Y

Measures of Variations
The Coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent variable.
The coefficient of determination is denoted as R2
1
R
0 2


Note:


SST
SSR
R2
Coefficient of Determination
𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
The Adjusted R-squared is a modified
version of R-squared that adjusts for
predictors that are not significant in a
regression model.
Adjusted R Square
R-squared increases every time you add an
independent variable to the model. Adjusted R-
squared value increases only when the new term
improves the model fit more than expected by
chance alone. The adjusted R-squared value
actually decreases when the term doesn’t
improve the model fit by a sufficient amount.
Multiple
Regression
The multiple linear regression model is used to predict a
response (independent) variable based on two or more
predictor variable (dependent) variable.
The multiple linear regression model can be stated as follows
𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖1 + 𝛽2𝑥𝑖2 + ⋯ … … + 𝛽𝑝𝑥𝑖𝑝 + 𝑒𝑖 , 𝑖 = 1,2, · · , n.
where
• 𝑦𝑖 is 𝑖𝑡ℎvalue of the response variable,
• 𝑥𝑖𝑗 is the 𝑖𝑡ℎ
observation of 𝑗𝑡ℎ
predictor variable,
• 𝛽0, 𝛽1, 𝛽2 …. 𝛽𝑝 are the parameters (regression coefficients),
• 𝑒𝑖 is random error term with E(𝑒𝑖 ) = 0 and V (𝑒𝑖 ) = 𝜎2
.
Case Study 1
The body temperature (in 0
𝐹) for 100 adults were measured along with
their gender, age, and heart rate. The data is stored in body_temp.xlsx file.
Built a linear regression model for body temperature using heart rate as a
predictor.
Regression
Regression
Multiple R = Correlation Coefficient = 0.45
R Square = Coefficient of Determination = 0.20
R Square = 0.20 shows that 20% of variations in temperature due to Heart Rate.
Model Summary
p value = 0 < 0.05.
So, there is enough evidence that fitted regression model is significant.
The regression model predicts the dependent variable – Temperature,
significantly well.
ANOVA
H0: 𝛽1=0 [Regression coefficient for Heart Rate is
not significant]
H1: 𝛽1≠ 0 [Regression coefficient for Heart Rate is
significant]
p value of regression coefficient of Heart Rate = 0
< 0.05, H0 is rejected.
So , regression coefficient of Heart Rate is
significant.
Regression Coefficients
Regression Model:
Temperature = 92.391 + 0.081 Heart Rate
Checking
Assumptions
• The regression model is Linear in parameter.
• The errors are Independently distributed.
• The errors are Normally distributed.
• The errors have Equal variances. That is V (𝑒𝑖 ) = 𝜎2
.
( Homoscedasticity)
Linearity Assumption
Linearity Assumption
Assumption - Errors are Independently distributed
Assumption - Errors are Independently distributed
Value of Durbin-Watson is
1.804,which is close to 2.
So, the assumption that errors
are independently distributed is
met
Normality & Homoscedasticity Assumptions
Normality Assumptions
Points are very close to the
diagonal line, so the variable -
temperature is normally distributed
Homoscedastic Assumptions
The data does not have an obvious
pattern, there are points equally
distributed above and below zero on the
X axis, and to the left and right of zero
on the Y axis.
So homoscedasticity assumption is met.
Case Study 2
The data were collected on a simple random sample of 20
patients with hypertension. The dataset is in arterialBp.csv.
The variables are
Y = mean arterial blood pressure (mm Hg)
X1 = age (years), X2 = weight (kgs)
X3 = body surface area (sq. m)
X4 = duration of hypertension (years)
X5 = basal pulse (beats /min), X6 = measure of stress
Fit an appropriate regression equation.
Case Study 2
Regression
Regression
Multiple R = Correlation Coefficient = 0.997
R Square = Coefficient of Determination = 0.995
R Square = 0.995 shows that 99.5% of variations in blood pressure is due to age,
weight, bsa, hypertension, pulse and stress.
Model Summary
p value = 0 < 0.05.
So, there is enough evidence that fitted regression model is significant.
The regression model predicts the dependent variable – blood pressure,
significantly well.
ANOVA
Regression Coefficients
Running the regression again after removing the insignificant variables:
hyper, pulse and stress
Multiple R = Correlation Coefficient = 0.997
R Square = Coefficient of Determination = 0.993
R Square = 0.993 shows that 99.3% of variations in blood pressure is due to age,
weight, bsa.
Model Summary
p value = 0 < 0.05.
So, there is enough evidence that fitted regression model is significant.
The regression model predicts the dependent variable – blood pressure,
significantly well.
ANOVA
Regression Coefficients
Regression Model:
Bp = -13.401 + 0.718 * Age + 0.896 * weight + 4.553 * bsa
Checking
Assumptions
• The regression model is Linear in parameter.
• The errors are Independently distributed.
• The errors are Normally distributed.
• The errors have Equal variances. That is V (𝑒𝑖 ) = 𝜎2
.
( Homoscedasticity)
• There is no Multicollinearity
(No significant correlation between independent variables)
Linearity Assumptions
Linearity Assumptions
Linearity Assumptions
Normality & Homoscedasticity Assumptions
Normality Assumptions
Points are very close to the
diagonal line, so the variable - Bp is
normally distributed
Homoscedastic Assumptions
The data does not have an obvious
pattern, there are points equally
distributed above and below zero on the
X axis, and to the left and right of zero
on the Y axis.
So homoscedasticity assumption is met.
Assumption - Errors are Independently distributed
Assumption - Errors are Independently distributed
Value of Durbin-Watson is
1.537,which is close to 2.
So, the assumption that errors
are independently distributed
is met
Multicollinearity Assumptions
Multicollinearity Assumptions
Variance Inflation Factor(VIF) for all variables lie between 1 & 10, so there is no
multicollinearity. i.e. independent variables are do not have significant correlation between
them.
THANK YOU
Dr Parag Shah | M.Sc., M.Phil., Ph.D. ( Statistics)
www.paragstatistics.wordpress.com

Correlation & Regression Analysis using SPSS

  • 1.
    CORRELATION &REGRESSION ANALYSIS UsingSPSS Dr Parag Shah | M.Sc., M.Phil., Ph.D. ( Statistics) www.paragstatistics.wordpress.com
  • 2.
    Correlation Correlation analysis isused to study the strength of relationship between two or more quantitative variables. Correlation shows the degree of linear dependence between the two variables. Correlation doesn’t imply causation. If variables are not related by cause and effect relationship but show correlation then such correlation is called Spurious or Non-sense correlation.
  • 3.
    Correlation Correlation can bepositive, negative or zero depending on the change between two variables. If the change in two variables is in the same direction it is positive correlation. If the change in two variables is in the opposite direction it is negative correlation. If the change in one variable does not affect the change in the other variable it is zero correlation.
  • 4.
    Correlation Coefficient Correlation coefficient (r)is the measure of extent of correlation between two variables. There are several types of correlation coefficient but the most popular is Karl Pearson’s correlation coefficient.
  • 5.
    Testing Correlation Coefficient Null Hypothesis H0:𝜌 = 0 [There is no significant linear correlation between two variables] Alternative Hypothesis H1: 𝜌≠ 0 [There is significant linear correlation between two variables] Test statistics: 𝐭 = 𝑟 𝑛−2 1−𝑟2 The test statistics t follows Student’s t distribution with 𝒏 − 𝟐 degrees of freedom.
  • 6.
    Case Study The bodytemperature (in 0 𝐹) for 100 adults were measured along with their gender, age, and heart rate. Data: body_temp.xlsx . Obtain correlation coefficient between body temperature and heart rate. Also check its significance.
  • 7.
    Null & Alternative Hypothesis NullHypothesis H0: 𝜌 = 0 [There is no significant linear correlation between body temperature and heart rate] Alternative Hypothesis H1: 𝜌≠ 0 [There is significant linear correlation between body temperature and heart rate]
  • 8.
  • 10.
    Test Statistics t andp value Correlation coefficient (r) between two variables heart rate and temperature is 0.448. Here p value = 0.000 < 0.05, so null hypothesis is rejected. Thus, there is significant linear correlation between Heart rate and Temperature
  • 11.
    Regression Regression analysis isa set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features’).
  • 12.
    Regression Analysis Regression analysis helpsyou understand how the dependent variable changes when one of the independent variables varies and allows to mathematically determine which of those variables really has an impact. Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most common models are simple linear and multiple linear.
  • 13.
    Types of Regression Dependentvariable Independent variable Type of Regression Relationship between variables One (Scale ) One (Scale) Simple Linear Linear One (Scale) Two or more (Continuous / Categorical) Multiple Linear Linear One ( Categorical – binary) Two or more (Continuous / Categorical) Logistic Need not be linear One ( Categorical ) Two or more (Continuous / Categorical) Multinomial Logistic Need not be linear
  • 14.
    Simple Regression The simple linearregression model is used to predict one response (dependent) variable based on one predictor (independent) variable. The linear regression model can be stated as follows 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖 + 𝑒𝑖 , 𝑖 = 1, 2, · · · , n. where • 𝑦𝑖 is value of the response variable, • 𝑥𝑖 is the value of the predictor variable, • 𝛽0 , 𝛽1are the parameters (regression coefficients), • 𝑒𝑖 is random error term with E(𝑒𝑖 ) = 0 and V (𝑒𝑖 ) = 𝜎2.
  • 15.
    Random Error for thisXi value Y X Observed Value of Y for Xi Predicted Value of Y for Xi i i 1 0 i ε x β β y    Xi Slope = β1 Intercept = β0 εi Graphical representation
  • 16.
    Assumptions of Simple Regression The fourimportant assumptions for a simple linear regression model are : • The regression model is Linear in parameter. • The errors are Independently distributed. • The errors are Normally distributed. • The errors have Equal variances. i.e. V (𝑒𝑖 ) = 𝜎2 . ( Homoscedasticity)
  • 17.
    Method The best lineof fit can be obtained by the method of least squares. It calculates the best line of fit for the observed data by minimizing the sum of squares of the vertical deviations from each data point to the line, i.e., (𝑦𝑖 − 𝑦𝑖)2
  • 18.
    Total variation ismade up of two parts: SSE SSR SST   Total Sum of Squares Regression Sum of Squares Error Sum of Squares    2 i ) Y Y ( SST    2 i i ) Ŷ Y ( SSE    2 i ) Y Ŷ ( SSR where: = Mean value of the dependent variable Yi = Observed value of the dependent variable = Predicted value of Y for the given Xi value i Y ˆ Y • SST = total sum of squares (Total Variation) • Measures the variation of the Yi values around their mean 𝑌 • SSR = regression sum of squares (Explained Variation) • Variation attributable to the relationship between X and Y • SSE = error sum of squares (Unexplained Variation) • Variation in Y attributable to factors other than X Measures of Variations
  • 19.
    Xi Y X Yi SST = (Yi- Y)2 SSE = (Yi - Yi )2  SSR = (Yi - Y)2  _ _ _ Y  Y Y _ Y  Measures of Variations
  • 20.
    The Coefficient ofdetermination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. The coefficient of determination is denoted as R2 1 R 0 2   Note:   SST SSR R2 Coefficient of Determination 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
  • 21.
    The Adjusted R-squaredis a modified version of R-squared that adjusts for predictors that are not significant in a regression model. Adjusted R Square R-squared increases every time you add an independent variable to the model. Adjusted R- squared value increases only when the new term improves the model fit more than expected by chance alone. The adjusted R-squared value actually decreases when the term doesn’t improve the model fit by a sufficient amount.
  • 22.
    Multiple Regression The multiple linearregression model is used to predict a response (independent) variable based on two or more predictor variable (dependent) variable. The multiple linear regression model can be stated as follows 𝑦𝑖 = 𝛽0 + 𝛽1𝑥𝑖1 + 𝛽2𝑥𝑖2 + ⋯ … … + 𝛽𝑝𝑥𝑖𝑝 + 𝑒𝑖 , 𝑖 = 1,2, · · , n. where • 𝑦𝑖 is 𝑖𝑡ℎvalue of the response variable, • 𝑥𝑖𝑗 is the 𝑖𝑡ℎ observation of 𝑗𝑡ℎ predictor variable, • 𝛽0, 𝛽1, 𝛽2 …. 𝛽𝑝 are the parameters (regression coefficients), • 𝑒𝑖 is random error term with E(𝑒𝑖 ) = 0 and V (𝑒𝑖 ) = 𝜎2 .
  • 23.
    Case Study 1 Thebody temperature (in 0 𝐹) for 100 adults were measured along with their gender, age, and heart rate. The data is stored in body_temp.xlsx file. Built a linear regression model for body temperature using heart rate as a predictor.
  • 24.
  • 25.
  • 26.
    Multiple R =Correlation Coefficient = 0.45 R Square = Coefficient of Determination = 0.20 R Square = 0.20 shows that 20% of variations in temperature due to Heart Rate. Model Summary
  • 27.
    p value =0 < 0.05. So, there is enough evidence that fitted regression model is significant. The regression model predicts the dependent variable – Temperature, significantly well. ANOVA
  • 28.
    H0: 𝛽1=0 [Regressioncoefficient for Heart Rate is not significant] H1: 𝛽1≠ 0 [Regression coefficient for Heart Rate is significant] p value of regression coefficient of Heart Rate = 0 < 0.05, H0 is rejected. So , regression coefficient of Heart Rate is significant. Regression Coefficients Regression Model: Temperature = 92.391 + 0.081 Heart Rate
  • 29.
    Checking Assumptions • The regressionmodel is Linear in parameter. • The errors are Independently distributed. • The errors are Normally distributed. • The errors have Equal variances. That is V (𝑒𝑖 ) = 𝜎2 . ( Homoscedasticity)
  • 30.
  • 31.
  • 32.
    Assumption - Errorsare Independently distributed
  • 33.
    Assumption - Errorsare Independently distributed Value of Durbin-Watson is 1.804,which is close to 2. So, the assumption that errors are independently distributed is met
  • 34.
  • 35.
    Normality Assumptions Points arevery close to the diagonal line, so the variable - temperature is normally distributed
  • 36.
    Homoscedastic Assumptions The datadoes not have an obvious pattern, there are points equally distributed above and below zero on the X axis, and to the left and right of zero on the Y axis. So homoscedasticity assumption is met.
  • 37.
    Case Study 2 Thedata were collected on a simple random sample of 20 patients with hypertension. The dataset is in arterialBp.csv. The variables are Y = mean arterial blood pressure (mm Hg) X1 = age (years), X2 = weight (kgs) X3 = body surface area (sq. m) X4 = duration of hypertension (years) X5 = basal pulse (beats /min), X6 = measure of stress Fit an appropriate regression equation.
  • 38.
  • 39.
  • 40.
  • 41.
    Multiple R =Correlation Coefficient = 0.997 R Square = Coefficient of Determination = 0.995 R Square = 0.995 shows that 99.5% of variations in blood pressure is due to age, weight, bsa, hypertension, pulse and stress. Model Summary
  • 42.
    p value =0 < 0.05. So, there is enough evidence that fitted regression model is significant. The regression model predicts the dependent variable – blood pressure, significantly well. ANOVA
  • 43.
    Regression Coefficients Running theregression again after removing the insignificant variables: hyper, pulse and stress
  • 44.
    Multiple R =Correlation Coefficient = 0.997 R Square = Coefficient of Determination = 0.993 R Square = 0.993 shows that 99.3% of variations in blood pressure is due to age, weight, bsa. Model Summary
  • 45.
    p value =0 < 0.05. So, there is enough evidence that fitted regression model is significant. The regression model predicts the dependent variable – blood pressure, significantly well. ANOVA
  • 46.
    Regression Coefficients Regression Model: Bp= -13.401 + 0.718 * Age + 0.896 * weight + 4.553 * bsa
  • 47.
    Checking Assumptions • The regressionmodel is Linear in parameter. • The errors are Independently distributed. • The errors are Normally distributed. • The errors have Equal variances. That is V (𝑒𝑖 ) = 𝜎2 . ( Homoscedasticity) • There is no Multicollinearity (No significant correlation between independent variables)
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
    Normality Assumptions Points arevery close to the diagonal line, so the variable - Bp is normally distributed
  • 53.
    Homoscedastic Assumptions The datadoes not have an obvious pattern, there are points equally distributed above and below zero on the X axis, and to the left and right of zero on the Y axis. So homoscedasticity assumption is met.
  • 54.
    Assumption - Errorsare Independently distributed
  • 55.
    Assumption - Errorsare Independently distributed Value of Durbin-Watson is 1.537,which is close to 2. So, the assumption that errors are independently distributed is met
  • 56.
  • 57.
    Multicollinearity Assumptions Variance InflationFactor(VIF) for all variables lie between 1 & 10, so there is no multicollinearity. i.e. independent variables are do not have significant correlation between them.
  • 58.
    THANK YOU Dr ParagShah | M.Sc., M.Phil., Ph.D. ( Statistics) www.paragstatistics.wordpress.com