Simple linear regression
Itis statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:
• One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as the response, outcome, or dependent variable.
• We will examine the relationship between quantitative variables x and y via a mathematical equation.
• The model has a deterministic and a statistical components
House
Cost
Most lots sell
for $25,000
Building a house costs about
$75 per square foot.
House cost = 25000 + 75(Size)
House cost = 25000 + 75(Size)
House size
House
Cost
Most lots sell
for $25,000
+ e
Since cost behave unpredictably,
we add a random component.
House size
2.
Simple linear regression
•The simplest deterministic mathematical relationship between two variables x and y is a linear
relationship: y = β0 + β1x. (True regression line)
• The objective is to develop an equivalent linear probabilistic model.
• If the two (random) variables are probabilistically related, then for a fixed value of x, there is
uncertainty in the value of the second variable.
• So, we assume y = β0 + β1x + ε, where ε is a random variable.
3.
Simple linear regression
•The points (x1, y1), …, (xn, yn) resulting from n independent observations will then be scattered
about the true regression line:
x
y 1
0
4.
Simple linear regression
EstimatingModel parameters:
• The values of β0, β1 and ε will almost never be known to an investigator.
• Instead, sample data consists of n observed pairs (x1, y1), … , (xn, yn), from which
the model parameters and the true regression line itself can be estimated.
• Where Yi = β0 +β1xi + εi for i = 1, 2, … , n and the n deviations ε1, ε2,…, εn are
independent r.v.’ s.
• Aim is to find the Best Fit Line: the sum of the squared vertical distances
(deviations) from the observed points to that line is as small as it can be.
5.
Simple linear regression
Thesum of squared vertical deviations from the points (x1, y1), … , (xn, yn), to the
line is then
1
0 1
2
2 2
( 1)
xy
xx
i i
xy i i
i
xx i x
SS
b
SS
b y b x
x y
SS x y
n
x
SS x n s
n
The point estimates of β0 and β1, denoted by b1
and b0, are called the least squares estimates –
they are those values that minimize using
partial derivatives.
The predicted values are obtained using:
0 1
ŷ b b x
Simple linear regression
Linearregression, while a powerful tool, has certain limitations that should be considered:
• Linearity: Assumes a linear relationship between the dependent and independent variables. If the relationship is non-
linear, the model may not accurately capture the underlying pattern.
• Independence: Assumes that the errors are independent of each other. If there is autocorrelation in the errors, the
model's estimates may be biased and inefficient.
• Homoscedasticity: Assumes that the variance of the errors is constant across all levels of the independent variable. If
the variance is not constant (heteroscedasticity), the model's estimates may be inefficient.
• Normality: Assumes that the errors are normally distributed. If the errors are not normally distributed, the model's
inferences may be invalid.
• Sensitivity to Outliers: Linear regression can be sensitive to outliers, which can have a significant impact on the
model's estimates. Outliers can distort the relationship between the variables and lead to biased results.
• Limited Flexibility: Linear regression can only model linear relationships. If the relationship between the variables is
complex or non-linear, linear regression may not be able to adequately capture the pattern.