Lecture 3 - Linear Regression_imran .pdf

What is Regression
• Regression analysis is a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.
• It helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other. In other words analyze
the specific relationships between the independent variables and the
dependent variable.
• In regression, we normally have one dependent variable and one or more
independent variables. Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
• We try to “regress” the value of dependent variable “Y” with the help of
the independent variables.

Types of Regression approaches
• There are many types of regression approaches we will study some of
them here
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector for Regression (SVR)
• Decision Tree Regression
• Random Forest Regression

Simple linear regression
• In statistics, simple linear regression is a linear regression model with
a single explanatory variable.
• It concerns two-dimensional sample points with one independent
variable and one dependent variable (conventionally, the x and y
coordinates in a Cartesian coordinate system)
• It finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a
function of the independent variable.
• Simply put, Simple linear regression is used to estimate the
relationship between two quantitative variables.

What can simple linear regression be used
for?
• You can use simple linear regression when you want to know:
• How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
• The value of the dependent variable at a certain value of the independent
variable (e.g. the amount of soil erosion at a certain level of rainfall).

Model for simple linear regression
• Consider the equation of line given as,
• Where y is the dependent variable, x is the independent variable, α is
the y-intercept and ꞵ is the slope of the line.
• We need to find α and ꞵ to estimate y using x , such that the error Ɛ is
minimized between the predicted value of y and original value of y.

ˆ
Y = b0 + b1X

House size
House
Cost
Most lots sell
for $25,000
The Model
The model has a deterministic and a probabilistic components
ˆ
Y = b0 + b1X

ˆ
Y = b0 + b1X

House size
House
Cost
Most lots sell
for $25,000
However, house cost vary even among same size
houses! Since cost behave unpredictably,
we add a random component.

• The first order linear model
Y = dependent variable
X = independent variable
b0 = Y-intercept
b1 = slope of the line
e = error variable
X
Y
b0
Run
Rise b1 = Rise/Run
and b are unknown population
parameters, therefore are estimated
from the data.

Y = b0 + b1X + e

Estimating the Coefficients
• The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.
w
w
w
w
w w w w
w
w w
w
w w
w
Question: What should be
considered a good line?
X
Y

The Estimated Coefficients
To calculate the estimates of the line
coefficients, that minimize the
differences between the data points
and the line, use the formulas:

b1 =
cov(X,Y)
sX
2 =
sXY
sX
2






b0 = Y − b1 X
The regression equation that
estimates
the equation of the first order linear
model
is:

ˆ
Y = b0 + b1X

Working concept of simple linear regression
• Ordinary least squares (OLS) method is
usually used to implement simple linear
regression.
• A good line is one that minimizes the sum
of squared differences between the
points and the line.
• The accuracy of each predicted value is
measured by its squared residual (vertical
distance between the point of the data
set and the fitted line), and the goal is to
make the sum of these squared
deviations as small as possible.
w
w
w
w
w w w w
w
w w
w
w w
w
X
Y

• Example 17.2 (Xm17-02)
• A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
• A random sample of 100 cars is
selected, and the data
recorded.
• Find the regression line.
Car Odometer Price
1 37388 14636
2 44758 14122
3 45833 14016
4 30862 15590
5 31705 15568
6 34010 14718
. . .
. . .
. . .
Independent
variable X
Dependent
variable Y
The Simple Linear Regression Line

• Solution
– Solving by hand: Calculate a number of statistics

X = 36,009.45;
Y =14,822.823;

sX
2
=
(Xi − X)2

n −1
= 43,528,690
cov(X,Y) =
(Xi − X)(Yi −Y )

n −1
= −2,712,511
where n = 100.

b1 =
cov(X,Y)
sX
2
=
−1,712,511
43,528,690
= −.06232
b0 = Y − b1X =14,822.82 − (−.06232)(36,009.45) =17,067
ˆ
Y = b0 + b1X =17,067 − .0623X

This is the slope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.0623
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Price

ˆ
Y =17,067 − .0623 X
Interpreting the Linear Regression -
Equation
The intercept is b0 = $17067.
0 No data
Do not interpret the intercept as the
“Price of cars that have not been driven”
17067

Error Variable: Required Conditions for better
performance of simple linear regression
• The error e is a critical part of the regression model.
• Four requirements involving the distribution of e must
be satisfied.
• The probability distribution of e is normal.
• The mean of e is zero: E(e) = 0.
• The standard deviation of e is se for all values of X.
• The set of errors associated with different values of Y are all
independent.

The Normality of e
From the first three assumptions
we
have: Y is normally distributed
with mean E(Y) = b0 + b1X, and a
constant standard deviation se
m3
b0 + b1X1
b0 + b1 X2
b0 + b1 X3
E(Y|X2)
E(Y|X3)
X1 X2 X3
m1
E(Y|X1)
m2
but the mean value changes with X
The standard deviation remains constant,

Assessing the Model
• The least squares method will produces a regression line whether or not
there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits the
data.
• Several methods are used to assess the model. All are based on the sum of
squares for errors, SSE.
• RMSE
• Coefficient of variation of RMSE
• Normalized MBE (Mean difference between actual values and model prediction)
• Coefficient of determination
• Corrected coefficient of determination
• Durbin Watson statistics
• T-test

• 𝑅𝑀𝑆𝐸 = (σ𝑖=1
𝑛
(𝑦 − 𝑦𝑝𝑟𝑒𝑑)2)1/2
• Where Zavg is the average of the original values
• Where

How to determine over fitting and under
fitting
• Durbin-Watson statistic,Є (0, 4):
• DW= 2 well fitted; DW <2 underfitted; DW>4 overfitted

• This is the sum of differences between the points and
the regression line.
• It can serve as a measure of how well the line fits the
data. SSE is defined by

SSE = (Yi − ˆ
Yi)2
i=1
n
 .
Sum of Squares for Errors
SSE = (n −1)sY
2
−
cov(X,Y)
 2
sX
2
– A shortcut formula

3
3
w
w
w
w
4
1
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squared differences =(2 - 1)2 +(4 - 2)2 +
(1.5 - 3)2 +
(4,3.2)
(3.2 - 4)2 = 6.89
Sum of squared differences =(2 -2.5)2 +(4 - 2.5)2 +
(1.5 - 2.5)2 +
(3.2 - 2.5)2 = 3.99
2.5
Let us compare two lines
The second line is horizontal
The smaller the sum of
squared differences
the better the fit of the
line to the data.

• The mean error is equal to zero.
• If se is small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well.
• Therefore, we can, use se as a measure of the
suitability of using a linear model.
• An estimator of se is given by se
Standard Error of Estimate
se =
SSE
n − 2
Standard Error of Estimate

Example: Data that doesn’t meet the
assumptions
• You think there is a linear relationship between meat consumption
and the incidence of cancer in the U.S.
• However, you find that much more data has been collected at high
rates of meat consumption than at low rates of meat consumption,
• With the result that there is much more variation in the estimate of
cancer rates at the low range than at the high range.
• Because the data violate the assumption of homoscedasticity, it
doesn’t work for regression.

Implementing simple linear regression in
Python
1.Import the packages and classes .
2.Import the data
3.Visualize the data
4.Handle missing values and clean the data
5.Split the data into training and test sets
6.Build the regression model and train it.
7.Check the results of model fitting to know whether the model is
satisfactory using plots.
8.Make predictions using unseen data.
9.Evaluate the model

Handle missing values and clean the data
• Missing data present
• Data cleaning is required as
salary cannot be negative

Visualizing the processed data

Split the data into training and test sets

Build the regression model and train it.
• Import the linear regression class from the linear model
• Make an instance of the linear regression class
• The train the model using training data

Check the results of model fitting to know
whether the model is satisfactory using plots.

Make predictions using unseen data.

Evaluating the model
• RMSE
• R^2

Durbin Watson statistical test

Predicting the output for unseen data

Lecture 3 - Linear Regression_imran .pdf

More Related Content

Similar to Lecture 3 - Linear Regression_imran .pdf

More from imrensindhu

Recently uploaded

Lecture 3 - Linear Regression_imran .pdf