me310_5_regression.pdf numerical method for engineering
1. 1
ME 310
Numerical Methods
Least Squares Regression
These presentations are prepared by
Dr. Cuneyt Sert
Mechanical Engineering Department
Middle East Technical University
Ankara, Turkey
[email protected]
They can not be used without the permission of the author
2. 2
• Curve fitting is expressing a discrete set of data points as a continuous function.
• It is frequently used in engineering. For example the emprical relations that we use in heat transfer
and fluid mechanics are functions fitted to experimental data.
• Regression: Mainly used with experimental data, which might have significant amount of error
(noise). No need to find a function that passes through all discrete points.
About Curve Fitting
x
f(x) Linear
Regression
x
f(x)
Polynomial
Regression
• Interpolation: Used if the data is known to be very precise. Find a function (or a series of
functions) that passes through all discrete points.
x
f(x)
Polynomial Interpolation
A single
function
x
f(x)
Spline Interpolation
Four different
functions
3. 3
x
y
a0 + a1 x
x i
yi
a0 + a1 xi
ei
(Read the statistics review from the book.)
• Fitting a straight line to a set of data set (paired data points).
(x1, y1), (x2, y2), (x3, y3), . . . , (xn, yn)
Least Squares Regression
a0 : y-intercept (unknown)
a1 : slope (unknown)
ei = yi - a0 - a1 xi
Error (deviation) for the
ith data point
• Minimize the error (deviation) to get a best-fit line (to find a0 and a1). Several posibilities are:
• Minimize the sum of individual errors.
• Minimize the sum of absolute values of individual errors.
• Minimize the maximum error.
• Minimize the sum of squares of individual errors. This is the preferred strategy (Check the
bookto see why others fail).
4. 4
x
a
-
y
a
)
x
(
x
n
y
x
)
y
x
(
n
a 1
0
2
i
2
i
i
i
i
i
1
Minimizing the Square of Individual errors
• Determine the unknowns a0 and a1 by minimizing Sr.
• To do this set the derivatives of Sr wrt a0 and a1 to zero.
n
1
i
2
i
1
0
i
n
1
i
2
i
r )
x
a
a
y
(
e
S Sum of squares of the residuals
i
i
1
2
i
0
i
i
i
1
0
i
1
r
i
1
i
0
n
1
i
i
1
0
i
0
r
y
x
a
x
a
x
]
)x
x
a
a
(y
[
2
-
a
S
y
a
x
a
n
0
)
x
a
a
(y
2
-
a
S
n
1
i
equations.
normal
the
called
are
These
or
i
i
i
1
0
2
i
i
i
y
x
y
a
a
x
x
x
n
• Solve these for a0 and a1. The results are
where y-bar and x-bar are the means of
y and x, respectively.
5. 5
906
y
x
1477
x
3
.
7
y
5
.
10
x
73
y
105
x
10
n
i
i
2
i
i
i
Example 24:
Use least-squares regression to fit a straight line to
x 1 3 5 7 10 12 13 16 18 20
y 4 5 6 5 8 7 6 9 12 11
3888
.
3
5
.
10
*
3725
.
0
-
3
.
7
a
3725
.
0
105
1477
*
10
73
*
105
906
*
10
)
x
(
x
n
y
x
)
y
x
(
n
a
0
2
2
i
2
i
i
i
i
i
1
Exercise 24: It is always a good idea to plot the data points and the regression line to see
how well the line represents the points. You can do this with Excel. Excel will calculate a0
and a1 for you.
6. 6
Error of Linear Regression (How good is the best line?)
estimate
of
error
std.
2
n
S
s
)
x
a
a
y
(
e
S
r
x
/
y
n
1
i
2
i
1
0
i
n
1
i
2
i
r
• The improvement obtained by using a regression line instead of the mean gives a maesure of
how good the regression fit is.
coefficient of determination
correlation coefficient
deviation
std.
1
n
S
s
)
y
y
(
S
t
y
n
1
i
2
i
t
x
y
y
x
y
a0 + a1x
Spread of data around the mean Spread of data around the regression line
t
r
t
2
S
S
S
r
2
i
2
i
2
i
2
i
i
i
i
i
)
y
(
y
n
)
x
(
x
n
y
x
)
y
x
(
n
r
7. 7
• Two extreme cases are
• Sr = 0 r=1 describes a perfect fit (straight line passing through all points).
• Sr = St r=0 describes a case with no improvement.
• Usually an r value close to 1 represents a good fit. But be careful and always plot the data points
and the regression line together to see what is going on.
How to interpret the correlation coefficient?
906
y
x
1477
x
3
.
7
y
5
.
10
x
73
y
105
x
10
n
i
i
2
i
i
i
Example 24 (cont’d): Calculate the correlation coefficient.
9
.
0
r
8107
.
0
S
S
S
r
14
.
12
)
x
a
a
y
(
S
1
.
64
)
y
y
(
S
t
r
t
2
n
1
i
2
i
1
0
i
r
n
1
i
2
i
t
8. 8
Example 24 (cont’d): Reverse x and y. Find the linear regression line and calculate r.
x = -5.3869 + 2.1763 y
St = 374.5, Sr = 70.91 (different than before).
r2 = 0.8107, r = 0.9 (same as before).
Exercise 25: When working with experimental data we usually take the variable that is
controlled by us in a precise way as x. The measured or calculated quantities are y. See
Midterm II of Fall 2003 for an example.
• Linear regression is useful to represent a linear relationship.
• If the relation is nonlinear either another technique can be used or the data can be transformed so
that linear regression can still be used. The latter technique is frequently used to fit the the following
nonlinear equations to a set of data.
• Exponential equation ( y=A1 e B1x )
• Power equation ( y=A2 x B2 )
• Saturation-growth rate equation ( y=A3 x / (B3+x) )
Linearization of Nonlinear Behavior
9. 9
(1) Exponential Equation (y = A1 e B1x)
Linearization of Nonlinear Behavior (cont’d)
x
y
y = A1 e B1x
x
ln y
ln y = ln A1 + B1x
or
ln y = a0 + a1x
Linearization
Example 25: Fit an exponential model to the following data set.
x 0.4 0.8 1.2 1.6 2.0 2.3
y 750 1000 1400 2000 2700 3750
• Create the following table. x 0.4 0.8 1.2 1.6 2.0 2.3
ln y 6.62 6.91 7.24 7.60 7.90 8.23
• Fit a straight line to this new data set. Be careful with the notation. You can define z = ln y
• Calculate a0 = 6.25 and a1 = 0.841. Straight line is ln y = 6.25 + 0.841 x
• Switch back to the original equation. A1 = ea0 = 518, B1 = a1 = 0.841.
• Therefore the exponential equation is y = 518 e0.841x. Check this solution with couple of data
points. For example y(1.2) = 518 e0.841 (1.2) = 1421 or y(2.3) = 518 e0.841 (2.3) = 3584. OK.
10. 10
(2) Power Equation (y = A2 x B2)
Linearization of Nonlinear Behavior (cont’d)
Example 26: Fit a power equation to the following data set.
x 2.5 3.5 5 6 7.5 10 12.5 15 17.5 20
y 7 5.5 3.9 3.6 3.1 2.8 2.6 2.4 2.3 2.3
• Fit a straight line to this new data set. Be careful with the notation.
• Calculate a0 = 1.002 and a1 = -0.53. Straight line is log y = 1.002 – 0.53 log x
• Switch back to the original equation. A2 = 10a0 = 10.05, B2 = a1 = -0.53.
• Therefore the power equation is y = 10.05 x-0.53. Check this solution with couple of data points. For
example y(5) = 10.05 * 5-0.53 = 4.28 or y(15) = 10.05 * 15-0.53 = 2.39. OK.
x
y
y = A2 x B2
log x
log y
log y = log A2 + B2 log x
or
log y = a0 + a1 log x
Linearization
log x 0.398 0.544 0.699 0.778 0.875 1.000 1.097 1.176 1.243 1.301
log y 0.845 0.740 0.591 0.556 0.491 0.447 0.415 0.380 0.362 0.362
11. 11
(3) Saturation-growth rate Equation (y = A3 x / (B3+x) )
Linearization of Nonlinear Behavior (cont’d)
Example 27: Fit a saturation-growth-rate equation to the following data set.
x 0.75 2 2.5 4 6 8 8.5
y 0.8 1.3 1.2 1.6 1.7 1.8 1.7
• Fit a straight line to this new data set. Be careful with the notation.
• Calculate a0 = 0.512 and a1 = 0.562. Straight line is 1/y = 0.512 + 0.562 (1/x)
• Switch back to the original equation. A3 = 1/a0 = 1.953, B3 = a1A3 = 1.097.
• Therefore the saturation-growth rate equation is 1/y = 1.953 x/(1.097+x). Check this solution with
couple of data points. For example y(2) = 1.953*2/(1.097+2) = 1.26 OK.
1 / x 1.333 0.5 0.4 0.25 0.1667 0.125 0.118
1 / y 1.25 0.769 0.833 0.625 0.588 0.556 0.588
x
y
y = y=A3 x / (B3+x)
1 / x
1 / y
1/y = 1/A3 + B3/A3 x
or
1/y = a0 + a1x
Linearization
12. 12
• Used to find a best-fit line for a nonlinear behavior.
• This is not nonlinear regression described at page 468 of the book. That section is omitted.
Polynomial Regression (Extension of Linear Least Sqaures)
Example for a second order
polynomial regression
ei = yi - a0 - a1 xi - a2 xi
2
Error (deviation) for the
ith data point
x
y a0 + a1 x + a2 x2
x i
yi
a0 + a1 xi+ a2 xi
2
• Minimize this sum to get the normal equations
• Solve these equations with one of the techniques that we learned to get a0, a1 and a2.
Sum of squares of the residuals
0
0,
0,
2
r
1
r
0
r
a
S
a
S
a
S
n
1
i
2
2
i
2
i
1
0
i
n
1
i
2
i
r )
x
a
x
a
a
y
(
e
S
13. 13
• Find the least-squares parabola that fits to the following data set.
Polynomial Regression Example
x 0 1 2 3 4 5
y 2.1 7.7 13.6 27.2 40.9 61.1
• Normal equations to find a least-squares parabola are
i
2
i
i
i
i
2
1
0
4
i
3
i
2
i
3
i
2
i
i
2
i
i
y
x
y
x
y
a
a
a
x
x
x
x
x
x
x
x
n
979
x
6
.
2488
y
x
225
x
6
.
585
y
x
55
x
6
.
152
y
15
x
6
n
4
i
i
2
i
3
i
i
i
2
i
i
i
999
.
0
r
999
.
0
4
.
2573
75
.
3
4
.
2573
S
S
S
r
x
861
.
1
x
359
.
2
479
.
2
y
861
.
1
a
,
359
.
2
a
,
479
.
2
a
t
r
t
2
2
2
1
0
14. 14
• y = y(x1, x2)
• Individual errors are ei = yi - a0 - a1 x1i - a2 x2i
• Sum of squares of the residuals is
• Minimize this sum to get the normal equations
Multiple Linear Regression
• Solve these equations to get a0, a1 and a2.
0
a
S
0,
a
S
0,
a
S
2
r
1
r
0
r
n
1
i
2
i
2
2
i
1
1
0
i
n
1
i
2
i
r )
x
a
x
a
a
y
(
e
S
i
i
2
i
i
1
i
2
1
0
2
i
2
i
2
i
1
i
2
i
2
i
1
2
i
1
i
1
i
2
i
1
y
x
y
x
y
a
a
a
x
x
x
x
x
x
x
x
x
x
n
15. 15
• Use multiple linear regression to fit
Example 28:
x 0 1 1 2 2 3 3 4 4
y 0 1 2 1 2 1 2 1 2
z 15 18 12.8 25.7 20.6 35 29.8 45.5 40.3
2
.
331
z
y
20
y
661
z
x
12
y
7
.
242
z
60
x
30
y
x
20
x
9
n
i
i
2
i
i
i
i
i
2
i
i
i
i
y
5.62
x
9.03
4
.
14
z
62
.
5
a
,
03
.
9
a
,
40
.
14
a 2
1
0
2
.
331
661
7
.
242
a
a
a
20
30
12
30
60
20
12
20
9
2
1
0
Exercise 26: Calculate the standard error of the estimate (sy/x) and the correlation coefficient (r).