BIOSTATISTICS
LECT6: CORRELATION AND REGRESSION ANALYSIS
DR. ECEM YEĞİN
Correlation and Regression Analysis
These methods allow us to understand the relationships
between variables, determine the strength and direction of
these relationships, and even estimate the value of one
variable over another.
They play a critical role in many areas, especially in medical
research, such as determining risk factors, evaluating the
effectiveness of diagnostic tests, predicting treatment
outcomes, and understanding disease etiology.
Correlation and Regression Analysis
We can discover whether there is a relationship between two or
more variables, and if there is a relationship, the direction and
strength of the relationship, with "correlation analysis".
The analysis that examines how one variable changes when the
other changes by a certain unit is "regression analysis".
Correlation Coefficient (r):
• The most commonly used statistical value to measure the strength
and direction of a linear relationship between two variables is the
Pearson correlation coefficient (r).
• This coefficient takes values between -1 and +1.
• r = +1: Perfect positive correlation. Example: Under ideal
conditions, as the dose of a drug increases, the blood level
increases at the same rate.
• r = -1: Perfect negative correlation. Example: As the dose of a
drug increases, the pain score decreases at the same rate.
• r = 0: No correlation. There is no linear relationship between the
variables.
Correlation Coefficient (r):
• 0 < |r| < 1: A weak, moderate, or strong linear relationship.
***The interpretation of values in this range may vary depending on
the domain and context being studied, but as a general guide:
• |r| < 0.3: Weak correlation
• 0.3 ≤ |r| < 0.7: Moderate correlation
• |r| ≥ 0.7: Strong correlation
Correlation Coefficient (r):
r= -1 r= 0 r= +1
Perfect No relationship Perfect
negative relationship positive relationship
Scatter plots provide general information about the relationship
between two variables. However, in order to comment on the
amount of relationship, we need to calculate the correlation
coefficient.
NOTES:
Important Points:
• Correlation Does Not Mean Causality! A strong correlation between two variables does
not mean that one causes the other. There may be a third factor (confounding variable) or
the relationship may be completely coincidental. Classic example: A positive correlation can
be observed between ice cream sales and drownings in the summer months, but it cannot
be concluded that eating ice cream causes drowning. Both events are associated with warm
weather and increased water activity.
• Measures Linear Relationship: The Pearson correlation coefficient only measures linear
relationships between variables.
• Suitable for Continuous Variables: Correlation analysis is generally used for continuous
(measurable, numerical) variables. Different methods such as the Chi-Square test are used
to examine relationships between categorical variables.
EXAMPLE
• A research team wanted to study the relationship between
children's height and shoe size. The height (cm) and shoe
size of 5 randomly selected children were recorded as
follows:
Child No. Height (cm) Shoe Size
1 110 30
2 115 32
3 120 33
4 125 35
5 130 36
Answer:
• Now let's calculate the Pearson correlation coefficient (r)
between these two variables.
• Step 1: Calculate the mean of each variable.
• Average Height ( xˉ ): (110 + 115 + 120 + 125 + 130) / 5
= 600 / 5 = 120 cm
• Average Shoe Size ( yˉ): (30 + 32 + 33 + 35 + 36) / 5
= 166 / 5 = 33.2
Answer
• Step 2: Calculate the difference of each data point from the
mean.
Child No. Height (x) x−xˉ Show Size (y) y−yˉ
1 110 -10 30 -3.2
2 115 -5 32 -1.2
3 120 0 33 -0.2
4 125 5 35 1.8
5 130 10 36 2.8
Answer
• Step 3: Calculate the product terms and squares.
Child No. x−xˉ y−yˉ (x−xˉ)(y−yˉ) (x−xˉ)2 (y−yˉ)2
1 -10 -3.2 32 100 10.24
2 -5 -1.2 6 25 1.44
3 0 -0.2 0 0 0.04
4 5 1.8 9 25 3.24
5 10 2.8 28 100 7.84
Total 75 250 22.8
Answer
• Step 4: Calculate the Pearson correlation coefficient (r).
Answer
• Step 5: Comment; The correlation coefficient (r) we obtained is
approximately 0.993. Since this value is very close to +1, it shows
that there is a very strong and positive linear relationship between
children's height and shoe size.
• We can say that as height increases, shoe size also tends to
increase. This simple example shows how correlation measures the
direction and strength of a linear relationship between two
continuous variables.
• NOTE: This strong correlation does not mean that height directly
"causes" shoe size, but it can suggest that these two variables are
related to the growth process.
Regression Analysis:
It aims to predict the value of a
dependent variable (output It also helps us understand
variable or response variable) how much change a unit
using one or more independent change in the independent
variables (predictor variables) variables causes in the
and express this relationship dependent variable.
with a mathematical model.
(+) directional (-) directional
linear relationship linear relationship
nonlinear no relationship
relationship
Simple Linear Regression:
• The most basic type of regression. It examines the linear
relationship between a single continuous independent variable and
a single continuous dependent variable. This relationship is
expressed mathematically as a straight line equation:
y= +x
• y = Value of the dependent variable
• = Intercept of the regression line (Constant value)
• = Slope of the regression line
• x = Value of the independent variable
Example: Does blood sugar increase as BMI increases?
Multiple Linear Regression:
• It is used to examine the effect of more than one continuous
independent variable on the dependent variable.
• The model is expressed as follows:
• x1,x2,...,xprepresent the independent variables
• b1,b2,...,bprepresent the coefficients.
Example: Estimating HbA1c level based on age, BMI and physical
activity
Assumptions of Regression Analysis:
• In order for the results of regression analysis to be reliable, some basic
assumptions must be met:
• Linearity: The relationship between the independent variables and the
dependent variable is expected to be linear.
• Independence of Residuals: The residuals must be independent of each
other (there should be no autocorrelation). This assumption is especially
important in time series data.
• Homoscedasticity of Residuals: The variance of the residuals must be
constant for all independent variable values. Heteroscedasticity (variance)
can affect the reliability of the model.
• Normal Distribution of Residuals: The residuals are assumed to have a
normal distribution. This assumption is especially important for hypothesis
testing and confidence intervals.
Example
• A researcher is studying the relationship between sleep duration
(hours) and a student's performance on an exam (score). The sleep
durations and exam scores of 3 randomly selected students are
recorded as follows: The researcher wants to perform simple linear
regression analysis to predict exam scores based on sleep
duration.
Sleep Duration
Student No. (hours) (x) Exam Score (y)
1 6 60
2 7 70
3 8 80
Example
• Step 1: Calculate Averages.
• Average Sleep Duration ( xˉ ): (6 + 7 + 8) / 3 = 21 / 3 = 7 hours
• Average Exam Score ( yˉ): (60 + 70 + 80) / 3 = 210 / 3 = 70 points
Example
• Step 2: Calculate the Slope () and Y-intercept ().
Student No. xi yi xi−xˉ yi−yˉ (xi−xˉ)(yi−yˉ) (xi−xˉ)2
1 6 60 -1 -10 10 1
2 7 70 0 0 0 0
3 8 80 1 10 10 1
Total 20 2
Example
• Step 3: Write the Regression Equation.
• The predicted test score ( y^) can be modeled with sleep duration
(x) as follows:
Example
• Step 4: Interpret the Equation.
• Y-intercept (=0): Theoretically, if a student sleeps 0 hours, the test
score would be expected to be 0.
• Slope (=10): For every 1 hour increase in sleep time, the student's
test score would be expected to increase by 10 points, on average.
Example
• Step 5: Make a Prediction (Example).
• If a student sleeps for 7.5 hours, we can predict what their test
score will be: y^=10×7.5=75 points
• According to our simple model, a student who sleeps for 7.5 hours
is expected to score 75 on the test.
Applications of Correlation and Regression in Medical
Research
• Identifying Risk Factors: For example, how much smoking increases the risk of
lung cancer)
• Evaluating Diagnostic Tests: Examining the correlation between the results of
a new diagnostic test and the results of a gold standard test and evaluating how
reliable the new test is.
• Predicting Treatment Effectiveness: Modeling the relationship between
patients' baseline characteristics (age, disease severity, etc.) and treatment
outcomes (recovery time, risk of complications, etc.) using regression analysis
and determining the factors that affect treatment success.
• Examining Drug Dose-Response Relationships: Evaluating the effect of
different doses of a drug on patient response using regression analysis and
helping determine the optimal dose.
• Epidemiological Studies: Analyzing the correlation of environmental factors or
lifestyle habits with disease incidence and the strength of this relationship.
CONCLUSION
• Correlation and regression analysis are powerful and widely used
tools in medical research to understand relationships between
variables, determine the strength and direction of these
relationships, and predict future values.
• However, applying these methods correctly, checking their
assumptions, and carefully interpreting their results are critical to
obtaining clinically meaningful and reliable results, especially
keeping in mind that correlation does not imply causation.
Thank you.
Dr. Ecem YEĞİN