What is Regression
•Regression analysis is a statistical method that helps us to analyze and
understand the relationship between two or more variables of interest.
• It helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other. In other words analyze
the specific relationships between the independent variables and the
dependent variable.
• In regression, we normally have one dependent variable and one or more
independent variables. Forecast the value of a dependent variable (Y) from
the value of independent variables (X1, X2,…Xk.).
• We try to “regress” the value of dependent variable “Y” with the help of
the independent variables.
3.
Types of Regressionapproaches
• There are many types of regression approaches we will study some of
them here
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector for Regression (SVR)
• Decision Tree Regression
• Random Forest Regression
4.
Simple linear regression
•In statistics, simple linear regression is a linear regression model with
a single explanatory variable.
• It concerns two-dimensional sample points with one independent
variable and one dependent variable (conventionally, the x and y
coordinates in a Cartesian coordinate system)
• It finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a
function of the independent variable.
• Simply put, Simple linear regression is used to estimate the
relationship between two quantitative variables.
6.
What can simplelinear regression be used
for?
• You can use simple linear regression when you want to know:
• How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
• The value of the dependent variable at a certain value of the independent
variable (e.g. the amount of soil erosion at a certain level of rainfall).
7.
Model for simplelinear regression
• Consider the equation of line given as,
• Where y is the dependent variable, x is the independent variable, α is
the y-intercept and ꞵ is the slope of the line.
• We need to find α and ꞵ to estimate y using x , such that the error Ɛ is
minimized between the predicted value of y and original value of y.
ˆ
Y = b0 + b1X
8.
House size
House
Cost
Most lotssell
for $25,000
The Model
The model has a deterministic and a probabilistic components
ˆ
Y = b0 + b1X
ˆ
Y = b0 + b1X
9.
House size
House
Cost
Most lotssell
for $25,000
However, house cost vary even among same size
houses! Since cost behave unpredictably,
we add a random component.
10.
• The firstorder linear model
Y = dependent variable
X = independent variable
b0 = Y-intercept
b1 = slope of the line
e = error variable
X
Y
b0
Run
Rise b1 = Rise/Run
and b are unknown population
parameters, therefore are estimated
from the data.
Y = b0 + b1X + e
11.
Estimating the Coefficients
•The estimates are determined by
• drawing a sample from the population of interest,
• calculating sample statistics.
• producing a straight line that cuts into the data.
w
w
w
w
w w w w
w
w w
w
w w
w
Question: What should be
considered a good line?
X
Y
12.
The Estimated Coefficients
Tocalculate the estimates of the line
coefficients, that minimize the
differences between the data points
and the line, use the formulas:
b1 =
cov(X,Y)
sX
2 =
sXY
sX
2
b0 = Y − b1 X
The regression equation that
estimates
the equation of the first order linear
model
is:
ˆ
Y = b0 + b1X
13.
Working concept ofsimple linear regression
• Ordinary least squares (OLS) method is
usually used to implement simple linear
regression.
• A good line is one that minimizes the sum
of squared differences between the
points and the line.
• The accuracy of each predicted value is
measured by its squared residual (vertical
distance between the point of the data
set and the fitted line), and the goal is to
make the sum of these squared
deviations as small as possible.
w
w
w
w
w w w w
w
w w
w
w w
w
X
Y
14.
• Example 17.2(Xm17-02)
• A car dealer wants to find
the relationship between
the odometer reading and
the selling price of used cars.
• A random sample of 100 cars is
selected, and the data
recorded.
• Find the regression line.
Car Odometer Price
1 37388 14636
2 44758 14122
3 45833 14016
4 30862 15590
5 31705 15568
6 34010 14718
. . .
. . .
. . .
Independent
variable X
Dependent
variable Y
The Simple Linear Regression Line
15.
• Solution
– Solvingby hand: Calculate a number of statistics
X = 36,009.45;
Y =14,822.823;
sX
2
=
(Xi − X)2
n −1
= 43,528,690
cov(X,Y) =
(Xi − X)(Yi −Y )
n −1
= −2,712,511
where n = 100.
b1 =
cov(X,Y)
sX
2
=
−1,712,511
43,528,690
= −.06232
b0 = Y − b1X =14,822.82 − (−.06232)(36,009.45) =17,067
ˆ
Y = b0 + b1X =17,067 − .0623X
16.
This is theslope of the line.
For each additional mile on the odometer,
the price decreases by an average of $0.0623
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Price
ˆ
Y =17,067 − .0623 X
Interpreting the Linear Regression -
Equation
The intercept is b0 = $17067.
0 No data
Do not interpret the intercept as the
“Price of cars that have not been driven”
17067
17.
Error Variable: RequiredConditions for better
performance of simple linear regression
• The error e is a critical part of the regression model.
• Four requirements involving the distribution of e must
be satisfied.
• The probability distribution of e is normal.
• The mean of e is zero: E(e) = 0.
• The standard deviation of e is se for all values of X.
• The set of errors associated with different values of Y are all
independent.
18.
The Normality ofe
From the first three assumptions
we
have: Y is normally distributed
with mean E(Y) = b0 + b1X, and a
constant standard deviation se
m3
b0 + b1X1
b0 + b1 X2
b0 + b1 X3
E(Y|X2)
E(Y|X3)
X1 X2 X3
m1
E(Y|X1)
m2
but the mean value changes with X
The standard deviation remains constant,
19.
Assessing the Model
•The least squares method will produces a regression line whether or not
there are linear relationship between X and Y.
• Consequently, it is important to assess how well the linear model fits the
data.
• Several methods are used to assess the model. All are based on the sum of
squares for errors, SSE.
• RMSE
• Coefficient of variation of RMSE
• Normalized MBE (Mean difference between actual values and model prediction)
• Coefficient of determination
• Corrected coefficient of determination
• Durbin Watson statistics
• T-test
20.
• 𝑅𝑀𝑆𝐸 =(σ𝑖=1
𝑛
(𝑦 − 𝑦𝑝𝑟𝑒𝑑)2)1/2
• Where Zavg is the average of the original values
• Where
23.
How to determineover fitting and under
fitting
• Durbin-Watson statistic,Є (0, 4):
• DW= 2 well fitted; DW <2 underfitted; DW>4 overfitted
24.
• This isthe sum of differences between the points and
the regression line.
• It can serve as a measure of how well the line fits the
data. SSE is defined by
SSE = (Yi − ˆ
Yi)2
i=1
n
.
Sum of Squares for Errors
SSE = (n −1)sY
2
−
cov(X,Y)
2
sX
2
– A shortcut formula
25.
3
3
w
w
w
w
4
1
1
4
(1,2)
2
2
(2,4)
(3,1.5)
Sum of squareddifferences =(2 - 1)2 +(4 - 2)2 +
(1.5 - 3)2 +
(4,3.2)
(3.2 - 4)2 = 6.89
Sum of squared differences =(2 -2.5)2 +(4 - 2.5)2 +
(1.5 - 2.5)2 +
(3.2 - 2.5)2 = 3.99
2.5
Let us compare two lines
The second line is horizontal
The smaller the sum of
squared differences
the better the fit of the
line to the data.
26.
• The meanerror is equal to zero.
• If se is small the errors tend to be close to zero (close to
the mean error). Then, the model fits the data well.
• Therefore, we can, use se as a measure of the
suitability of using a linear model.
• An estimator of se is given by se
Standard Error of Estimate
se =
SSE
n − 2
Standard Error of Estimate
27.
Example: Data thatdoesn’t meet the
assumptions
• You think there is a linear relationship between meat consumption
and the incidence of cancer in the U.S.
• However, you find that much more data has been collected at high
rates of meat consumption than at low rates of meat consumption,
• With the result that there is much more variation in the estimate of
cancer rates at the low range than at the high range.
• Because the data violate the assumption of homoscedasticity, it
doesn’t work for regression.
28.
Implementing simple linearregression in
Python
1.Import the packages and classes .
2.Import the data
3.Visualize the data
4.Handle missing values and clean the data
5.Split the data into training and test sets
6.Build the regression model and train it.
7.Check the results of model fitting to know whether the model is
satisfactory using plots.
8.Make predictions using unseen data.
9.Evaluate the model
Build the regressionmodel and train it.
• Import the linear regression class from the linear model
• Make an instance of the linear regression class
• The train the model using training data
36.
Check the resultsof model fitting to know
whether the model is satisfactory using plots.