Regression
What is Regression
• Regression analysis is a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.
• It helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other. In other words analyze
the specific relationships between the independent variables and the
dependent variable.
• In regression, we normally have one dependent variable and one or more
independent variables. Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
• We try to “regress” the value of dependent variable “Y” with the help of
the independent variables.
Types of Regression approaches
• There are many types of regression approaches we will study some of
them here
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector for Regression (SVR)
• Decision Tree Regression
• Random Forest Regression
Simple linear regression
• In statistics, simple linear regression is a linear regression model with
a single explanatory variable.
• It concerns two-dimensional sample points with one independent
variable and one dependent variable (conventionally, the x and y
coordinates in a Cartesian coordinate system)
• It finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a
function of the independent variable.
• Simply put, Simple linear regression is used to estimate the
relationship between two quantitative variables.
What can simple linear regression be used
for?
• You can use simple linear regression when you want to know:
• How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
• The value of the dependent variable at a certain value of the independent
variable (e.g. the amount of soil erosion at a certain level of rainfall).
Model for simple linear regression
• Consider the equation of line given as,
• Where y is the dependent variable, x is the independent variable, α is
the y-intercept and ꞵ is the slope of the line.
• We need to find α and ꞵ to estimate y using x , such that the error Ɛ is
minimized between the predicted value of y and original value of y.

ˆ
Y = b0 + b1X
House size
House
Cost
Most lots sell
for $25,000
The Model
The model has a deterministic and a probabilistic components
ˆ
Y = b0 + b1X

ˆ
Y = b0 + b1X
House size
House
Cost
Most lots sell
for $25,000
However, house cost vary even among same size
houses! Since cost behave unpredictably,
we add a random component.
• The first order linear model
Y = dependent variable
X = independent variable
b0 = Y-intercept
b1 = slope of the line
e = error variable
X
Y
b0
Run
Rise b1 = Rise/Run
and b are unknown population
parameters, therefore are estimated
from the data.

Y = b0 + b1X + e
Estimating the Coefficients
• The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.
w
w
w
w
w w w w
w
w w
w
w w
w
Question: What should be
considered a good line?
X
Y
The Estimated Coefficients
To calculate the estimates of the line
coefficients, that minimize the
differences between the data points
and the line, use the formulas:

b1 =
cov(X,Y)
sX
2 =
sXY
sX
2






b0 = Y − b1 X
The regression equation that
estimates
the equation of the first order linear
model
is:

ˆ
Y = b0 + b1X
Working concept of simple linear regression
• Ordinary least squares (OLS) method is
usually used to implement simple linear
regression.
• A good line is one that minimizes the sum
of squared differences between the
points and the line.
• The accuracy of each predicted value is
measured by its squared residual (vertical
distance between the point of the data
set and the fitted line), and the goal is to
make the sum of these squared
deviations as small as possible.
w
w
w
w
w w w w
w
w w
w
w w
w
X
Y
• Example 17.2 (Xm17-02)
• A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
• A random sample of 100 cars is
selected, and the data
recorded.
• Find the regression line.
Car Odometer Price
1 37388 14636
2 44758 14122
3 45833 14016
4 30862 15590
5 31705 15568
6 34010 14718
. . .
. . .
. . .
Independent
variable X
Dependent
variable Y
The Simple Linear Regression Line
• Solution
– Solving by hand: Calculate a number of statistics

X = 36,009.45;
Y =14,822.823;

sX
2
=
(Xi − X)2

n −1
= 43,528,690
cov(X,Y) =
(Xi − X)(Yi −Y )

n −1
= −2,712,511
where n = 100.

b1 =
cov(X,Y)
sX
2
=
−1,712,511
43,528,690
= −.06232
b0 = Y − b1X =14,822.82 − (−.06232)(36,009.45) =17,067
ˆ
Y = b0 + b1X =17,067 − .0623X
This is the slope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.0623
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Price

ˆ
Y =17,067 − .0623 X
Interpreting the Linear Regression -
Equation
The intercept is b0 = $17067.
0 No data
Do not interpret the intercept as the
“Price of cars that have not been driven”
17067
Error Variable: Required Conditions for better
performance of simple linear regression
• The error e is a critical part of the regression model.
• Four requirements involving the distribution of e must
be satisfied.
• The probability distribution of e is normal.
• The mean of e is zero: E(e) = 0.
• The standard deviation of e is se for all values of X.
• The set of errors associated with different values of Y are all
independent.
The Normality of e
From the first three assumptions
we
have: Y is normally distributed
with mean E(Y) = b0 + b1X, and a
constant standard deviation se
m3
b0 + b1X1
b0 + b1 X2
b0 + b1 X3
E(Y|X2)
E(Y|X3)
X1 X2 X3
m1
E(Y|X1)
m2
but the mean value changes with X
The standard deviation remains constant,
Assessing the Model
• The least squares method will produces a regression line whether or not
there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits the
data.
• Several methods are used to assess the model. All are based on the sum of
squares for errors, SSE.
• RMSE
• Coefficient of variation of RMSE
• Normalized MBE (Mean difference between actual values and model prediction)
• Coefficient of determination
• Corrected coefficient of determination
• Durbin Watson statistics
• T-test
• 𝑅𝑀𝑆𝐸 = (σ𝑖=1
𝑛
(𝑦 − 𝑦𝑝𝑟𝑒𝑑)2)1/2
• Where Zavg is the average of the original values
• Where
How to determine over fitting and under
fitting
• Durbin-Watson statistic,Є (0, 4):
• DW= 2 well fitted; DW <2 underfitted; DW>4 overfitted
• This is the sum of differences between the points and
the regression line.
• It can serve as a measure of how well the line fits the
data. SSE is defined by

SSE = (Yi − ˆ
Yi)2
i=1
n
 .
Sum of Squares for Errors
SSE = (n −1)sY
2
−
cov(X,Y)
 2
sX
2
– A shortcut formula
3
3
w
w
w
w
4
1
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squared differences =(2 - 1)2 +(4 - 2)2 +
(1.5 - 3)2 +
(4,3.2)
(3.2 - 4)2 = 6.89
Sum of squared differences =(2 -2.5)2 +(4 - 2.5)2 +
(1.5 - 2.5)2 +
(3.2 - 2.5)2 = 3.99
2.5
Let us compare two lines
The second line is horizontal
The smaller the sum of
squared differences
the better the fit of the
line to the data.
• The mean error is equal to zero.
• If se is small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well.
• Therefore, we can, use se as a measure of the
suitability of using a linear model.
• An estimator of se is given by se
Standard Error of Estimate
se =
SSE
n − 2
Standard Error of Estimate
Example: Data that doesn’t meet the
assumptions
• You think there is a linear relationship between meat consumption
and the incidence of cancer in the U.S.
• However, you find that much more data has been collected at high
rates of meat consumption than at low rates of meat consumption,
• With the result that there is much more variation in the estimate of
cancer rates at the low range than at the high range.
• Because the data violate the assumption of homoscedasticity, it
doesn’t work for regression.
Implementing simple linear regression in
Python
1.Import the packages and classes .
2.Import the data
3.Visualize the data
4.Handle missing values and clean the data
5.Split the data into training and test sets
6.Build the regression model and train it.
7.Check the results of model fitting to know whether the model is
satisfactory using plots.
8.Make predictions using unseen data.
9.Evaluate the model
Importing packages and data
Visualize the data
Handle missing values and clean the data
• Missing data present
• Data cleaning is required as
salary cannot be negative
code
Visualizing the processed data
Split the data into training and test sets
Build the regression model and train it.
• Import the linear regression class from the linear model
• Make an instance of the linear regression class
• The train the model using training data
Check the results of model fitting to know
whether the model is satisfactory using plots.
Make predictions using unseen data.
Evaluating the model
• RMSE
• R^2
Durbin Watson statistical test
Another Example
Import dataset and visualize
Data cleaning
Visualize the data
Splitting the data
Build model and train it
Predicting the output for unseen data

Lecture 3 - Linear Regression_imran .pdf

  • 1.
  • 2.
    What is Regression •Regression analysis is a statistical method that helps us to analyze and understand the relationship between two or more variables of interest. • It helps to understand which factors are important, which factors can be ignored, and how they are influencing each other. In other words analyze the specific relationships between the independent variables and the dependent variable. • In regression, we normally have one dependent variable and one or more independent variables. Forecast the value of a dependent variable (Y) from the value of independent variables (X1, X2,…Xk.). • We try to “regress” the value of dependent variable “Y” with the help of the independent variables.
  • 3.
    Types of Regressionapproaches • There are many types of regression approaches we will study some of them here • Simple Linear Regression • Multiple Linear Regression • Polynomial Regression • Support Vector for Regression (SVR) • Decision Tree Regression • Random Forest Regression
  • 4.
    Simple linear regression •In statistics, simple linear regression is a linear regression model with a single explanatory variable. • It concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) • It finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. • Simply put, Simple linear regression is used to estimate the relationship between two quantitative variables.
  • 6.
    What can simplelinear regression be used for? • You can use simple linear regression when you want to know: • How strong the relationship is between two variables (e.g. the relationship between rainfall and soil erosion). • The value of the dependent variable at a certain value of the independent variable (e.g. the amount of soil erosion at a certain level of rainfall).
  • 7.
    Model for simplelinear regression • Consider the equation of line given as, • Where y is the dependent variable, x is the independent variable, α is the y-intercept and ꞵ is the slope of the line. • We need to find α and ꞵ to estimate y using x , such that the error Ɛ is minimized between the predicted value of y and original value of y.  ˆ Y = b0 + b1X
  • 8.
    House size House Cost Most lotssell for $25,000 The Model The model has a deterministic and a probabilistic components ˆ Y = b0 + b1X  ˆ Y = b0 + b1X
  • 9.
    House size House Cost Most lotssell for $25,000 However, house cost vary even among same size houses! Since cost behave unpredictably, we add a random component.
  • 10.
    • The firstorder linear model Y = dependent variable X = independent variable b0 = Y-intercept b1 = slope of the line e = error variable X Y b0 Run Rise b1 = Rise/Run and b are unknown population parameters, therefore are estimated from the data.  Y = b0 + b1X + e
  • 11.
    Estimating the Coefficients •The estimates are determined by • drawing a sample from the population of interest, • calculating sample statistics. • producing a straight line that cuts into the data. w w w w w w w w w w w w w w w Question: What should be considered a good line? X Y
  • 12.
    The Estimated Coefficients Tocalculate the estimates of the line coefficients, that minimize the differences between the data points and the line, use the formulas:  b1 = cov(X,Y) sX 2 = sXY sX 2       b0 = Y − b1 X The regression equation that estimates the equation of the first order linear model is:  ˆ Y = b0 + b1X
  • 13.
    Working concept ofsimple linear regression • Ordinary least squares (OLS) method is usually used to implement simple linear regression. • A good line is one that minimizes the sum of squared differences between the points and the line. • The accuracy of each predicted value is measured by its squared residual (vertical distance between the point of the data set and the fitted line), and the goal is to make the sum of these squared deviations as small as possible. w w w w w w w w w w w w w w w X Y
  • 14.
    • Example 17.2(Xm17-02) • A car dealer wants to find the relationship between the odometer reading and the selling price of used cars. • A random sample of 100 cars is selected, and the data recorded. • Find the regression line. Car Odometer Price 1 37388 14636 2 44758 14122 3 45833 14016 4 30862 15590 5 31705 15568 6 34010 14718 . . . . . . . . . Independent variable X Dependent variable Y The Simple Linear Regression Line
  • 15.
    • Solution – Solvingby hand: Calculate a number of statistics  X = 36,009.45; Y =14,822.823;  sX 2 = (Xi − X)2  n −1 = 43,528,690 cov(X,Y) = (Xi − X)(Yi −Y )  n −1 = −2,712,511 where n = 100.  b1 = cov(X,Y) sX 2 = −1,712,511 43,528,690 = −.06232 b0 = Y − b1X =14,822.82 − (−.06232)(36,009.45) =17,067 ˆ Y = b0 + b1X =17,067 − .0623X
  • 16.
    This is theslope of the line. For each additional mile on the odometer, the price decreases by an average of $0.0623 Odometer Line Fit Plot 13000 14000 15000 16000 Odometer Price  ˆ Y =17,067 − .0623 X Interpreting the Linear Regression - Equation The intercept is b0 = $17067. 0 No data Do not interpret the intercept as the “Price of cars that have not been driven” 17067
  • 17.
    Error Variable: RequiredConditions for better performance of simple linear regression • The error e is a critical part of the regression model. • Four requirements involving the distribution of e must be satisfied. • The probability distribution of e is normal. • The mean of e is zero: E(e) = 0. • The standard deviation of e is se for all values of X. • The set of errors associated with different values of Y are all independent.
  • 18.
    The Normality ofe From the first three assumptions we have: Y is normally distributed with mean E(Y) = b0 + b1X, and a constant standard deviation se m3 b0 + b1X1 b0 + b1 X2 b0 + b1 X3 E(Y|X2) E(Y|X3) X1 X2 X3 m1 E(Y|X1) m2 but the mean value changes with X The standard deviation remains constant,
  • 19.
    Assessing the Model •The least squares method will produces a regression line whether or not there are linear relationship between X and Y. • Consequently, it is important to assess how well the linear model fits the data. • Several methods are used to assess the model. All are based on the sum of squares for errors, SSE. • RMSE • Coefficient of variation of RMSE • Normalized MBE (Mean difference between actual values and model prediction) • Coefficient of determination • Corrected coefficient of determination • Durbin Watson statistics • T-test
  • 20.
    • 𝑅𝑀𝑆𝐸 =(σ𝑖=1 𝑛 (𝑦 − 𝑦𝑝𝑟𝑒𝑑)2)1/2 • Where Zavg is the average of the original values • Where
  • 23.
    How to determineover fitting and under fitting • Durbin-Watson statistic,Є (0, 4): • DW= 2 well fitted; DW <2 underfitted; DW>4 overfitted
  • 24.
    • This isthe sum of differences between the points and the regression line. • It can serve as a measure of how well the line fits the data. SSE is defined by  SSE = (Yi − ˆ Yi)2 i=1 n  . Sum of Squares for Errors SSE = (n −1)sY 2 − cov(X,Y)  2 sX 2 – A shortcut formula
  • 25.
    3 3 w w w w 4 1 1 4 (1,2) 2 2 (2,4) (3,1.5) Sum of squareddifferences =(2 - 1)2 +(4 - 2)2 + (1.5 - 3)2 + (4,3.2) (3.2 - 4)2 = 6.89 Sum of squared differences =(2 -2.5)2 +(4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99 2.5 Let us compare two lines The second line is horizontal The smaller the sum of squared differences the better the fit of the line to the data.
  • 26.
    • The meanerror is equal to zero. • If se is small the errors tend to be close to zero (close to the mean error). Then, the model fits the data well. • Therefore, we can, use se as a measure of the suitability of using a linear model. • An estimator of se is given by se Standard Error of Estimate se = SSE n − 2 Standard Error of Estimate
  • 27.
    Example: Data thatdoesn’t meet the assumptions • You think there is a linear relationship between meat consumption and the incidence of cancer in the U.S. • However, you find that much more data has been collected at high rates of meat consumption than at low rates of meat consumption, • With the result that there is much more variation in the estimate of cancer rates at the low range than at the high range. • Because the data violate the assumption of homoscedasticity, it doesn’t work for regression.
  • 28.
    Implementing simple linearregression in Python 1.Import the packages and classes . 2.Import the data 3.Visualize the data 4.Handle missing values and clean the data 5.Split the data into training and test sets 6.Build the regression model and train it. 7.Check the results of model fitting to know whether the model is satisfactory using plots. 8.Make predictions using unseen data. 9.Evaluate the model
  • 29.
  • 30.
  • 31.
    Handle missing valuesand clean the data • Missing data present • Data cleaning is required as salary cannot be negative
  • 32.
  • 33.
  • 34.
    Split the datainto training and test sets
  • 35.
    Build the regressionmodel and train it. • Import the linear regression class from the linear model • Make an instance of the linear regression class • The train the model using training data
  • 36.
    Check the resultsof model fitting to know whether the model is satisfactory using plots.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
    Predicting the outputfor unseen data