Linear Regression

Introduction to Statistical
Machine Learning

c 2011
Christfried Webers
NICTA
The Australian National
University

Introduction to Statistical Machine Learning

Christfried Webers

Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February – June 2011

(Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning")

1of 197

Machine Learning

c 2011
Christfried Webers
NICTA
University

Part V
Linear Basis Function
Models

Maximum Likelihood and

Linear Regression 1 Least Squares

Geometry of Least
Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

159of 197


Regression Machine Learning

c 2011
Christfried Webers
NICTA
University

Given a training data set of N observations {xn } and target
values tn .
Goal : Learn to predict the value of one ore more target
values t given a new value of the input x. Linear Basis Function
Models
Example: Polynomial curve ﬁtting (see Introduction). Maximum Likelihood and
Least Squares

t Geometry of Least
Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs
? Loss Function for
Regression

The Bias-Variance
Decomposition
x

160of 197


Supervised Learning Machine Learning

c 2011
Christfried Webers
NICTA
Training Phase University

training model with adjustable
data x parameter w

Models
training Maximum Likelihood and
targets Least Squares
t
Geometry of Least
Squares

Sequential Learning
ﬁx the most appropriate w* Regularized Least
Squares
Test Phase
Multiple Outputs

Loss Function for
Regression
test
test model with ﬁxed
target The Bias-Variance
data x parameter w* Decomposition
t

161of 197


Linear Basis Function Models Machine Learning

c 2011
Christfried Webers
NICTA
University

Linear combination of ﬁxed nonlinear basis functions
M−1 Linear Basis Function
Models
y(x, w) = wj φj (x) = wT φ(x) Maximum Likelihood and
Least Squares
j=0
Geometry of Least
Squares
T
parameter w = (w0 , . . . , wM−1 ) Sequential Learning

basis functions φ = (φ0 , . . . , φM−1 )T Regularized Least
Squares

convention φ0 (x) = 1 Multiple Outputs

w0 is the bias parameter Loss Function for
Regression

The Bias-Variance
Decomposition

162of 197


Polynomial Basis Functions Machine Learning

c 2011
Christfried Webers
NICTA
Scalar input variable x The Australian National
University

φj (x) = xj
Limitation : Polynomials are global functions of the input
variable x.
Extension: Split the input space into regions and ﬁt a Linear Basis Function
different polynomial to each region (spline functions). Models

Least Squares
1 Geometry of Least
Squares

Sequential Learning
0.5
Regularized Least
Squares

Multiple Outputs
0
Loss Function for
Regression

−0.5 The Bias-Variance
Decomposition

−1
−1 0 1

163of 197


’Gaussian’ Basis Functions Machine Learning

c 2011
Christfried Webers
NICTA
University

(x − µj )2
φj (x) = exp −
2s2
Not a probability distribution.
No normalisation required, taken care of by the model Linear Basis Function
parameters w. Models

Least Squares
1 Geometry of Least
Squares

Sequential Learning
0.75
Regularized Least
Squares

Multiple Outputs
0.5
Loss Function for
Regression

0.25 The Bias-Variance
Decomposition

0
−1 0 1

164of 197


Sigmoidal Basis Functions Machine Learning

c 2011
Christfried Webers
NICTA
University
x − µj
φj (x) = σ
s
where σ(a) is the logistic sigmoid function deﬁned by
1 Linear Basis Function
σ(a) = Models
1 + exp(−a) Maximum Likelihood and
Least Squares
σ(a) is related to the hyperbolic tangent tanh(a) by Geometry of Least
tanh(a) = 2σ(a) − 1. Squares

Sequential Learning

1 Regularized Least
Squares

Multiple Outputs

0.75 Loss Function for
Regression

The Bias-Variance
0.5 Decomposition

0.25

0 165of 197


Other Basis Functions Machine Learning

c 2011
Christfried Webers
NICTA
University

Fourier Basis : each basis function represents a speciﬁc Linear Basis Function
Models
frequency and has inﬁnite spatial extent. Maximum Likelihood and
Least Squares
Wavelets : localised in both space and frequency (also
Geometry of Least
mutually orthogonal to simplify appliciation). Squares

Sequential Learning
Splines (polynomials restricted to regions of the input
Regularized Least
space). Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

166of 197


Maximum Likelihood and Least Squares Machine Learning

c 2011
Christfried Webers
NICTA
University

No special assumption about the basis functions φj (x). In
the simplest case, one can think of φj (x) = x.
Assume target t is given by Models

t = y(x, w) + Least Squares

Geometry of Least
deterministic noise Squares

Sequential Learning

where is a zero-mean Gaussian random variable with Regularized Least
Squares
precision (inverse variance) β. Multiple Outputs

Thus Loss Function for
Regression
p(t | x, w, β) = N (t | y(x, w), β −1 ) The Bias-Variance
Decomposition

167of 197



c 2011
Christfried Webers
NICTA
Likelihood of one target t given the data x University

p(t | x, w, β) = N (t | y(x, w), β −1 )

Set of inputs X with corresponding target values t.
Assume data are independent and identically distributed Models

(i.i.d.) (means : data are drawn independent and from the Maximum Likelihood and
Least Squares
same distribution). The likelihood of the target t is then Geometry of Least
Squares
N Sequential Learning
p(t | X, w, β) = N (tn | y(xn , w), β −1 ) Regularized Least
Squares
n=1
Multiple Outputs
N
T −1 Loss Function for
= N (tn | w φ(xn ), β ) Regression

n=1 The Bias-Variance
Decomposition

From now on drop the conditioning variable X from the
notation, as with supervised learning we do not seek to
model the distribution of the input data.
168of 197



c 2011
Christfried Webers
NICTA

Consider the logarithm of the likelihood p(t | w, β) (the University

logarithm is a monoton function! )
N
ln p(t | w, β) = ln N (tn | wT φ(xn ), β −1 )
n=1 Models
N
β β Maximum Likelihood and

= ln exp − (tn − wT φ(xn ))2 Least Squares

2π 2 Geometry of Least
n=1 Squares

N N Sequential Learning
= ln β − ln(2π) − βED (w)
2 2 Regularized Least
Squares

Multiple Outputs
where the sum-of-squares error function is
Loss Function for
Regression
N
1 The Bias-Variance
ED (w) = {tn − wT φ(xn )}2 . Decomposition
2
n=1

arg maxw ln p(t | w, β) → arg minw ED (w)
169of 197



c 2011
Christfried Webers
NICTA
University

Rewrite the Error Function
N
1 1
ED (w) = {tn − wT φ(xn )}2 = (t − Φw)T (t − Φw) Linear Basis Function
2 2 Models
n=1
Least Squares
T
where t = (t1 , . . . , tN ) , and Geometry of Least
Squares
 
φ0 (x1 ) φ1 (x1 ) ... φM−1 (x1 ) Sequential Learning

 φ0 (x2 ) φ1 (x2 ) ... φM−1 (x2 )  Regularized Least
Squares
 
Φ= . . .. .
 . . .  Multiple Outputs
 . . . . 
 Loss Function for
  Regression

φ0 (xN ) φ1 (xN ) ... φM−1 (xN ) The Bias-Variance
Decomposition

170of 197



c 2011
Christfried Webers
NICTA
The log likelihood is now The Australian National
University

N N
ln p(t | w, β) =ln β − ln(2π) − βED (w)
2 2
N N 1
= ln β − ln(2π) − β (t − Φw)T (t − Φw)
2 2 2 Linear Basis Function
Models
Find critical points of ln p(t | w, β). Maximum Likelihood and
Least Squares
Directional derivative in direction ξ
Geometry of Least
Squares
D ln p(t | w, β)(ξ) = βξ T (ΦT t − ΦT Φw) Sequential Learning

Regularized Least
shall be zero in all directions ξ. Therefore Squares

Multiple Outputs
0 = ΦT t − ΦT Φw, Loss Function for
Regression

which results in The Bias-Variance
Decomposition

wML = (ΦT Φ)−1 ΦT t = Φ† t

where Φ† is the Moore-Penrose pseudo-inverse of the
matrix Φ. 171of 197



c 2011
Christfried Webers
NICTA
Found a critical point wML for ln p(t | w, β). Is it a maximum, The Australian National
University
a minimum, or a saddle point?
Calculate the second directional derivative
D2 ln p(t | w, β)(ξ, ξ) = −β ξ T ΦT Φξ
= −β (Φξ)T Φξ Linear Basis Function
Models
2
= −β Φξ ≤0 Maximum Likelihood and
Least Squares

Geometry of Least
We found a maximum. Squares

(Is it enough to check the second directional derivative in Sequential Learning

two directions ξ which are the same? Yes, because for any Regularized Least
Squares
bilinear function g(ξ, η), symmetric in its arguments, Multiple Outputs

Loss Function for
1 Regression
[g(ξ, ξ) + g(η, η) − g(ξ − η, ξ − η)]
2 The Bias-Variance
Decomposition
1
= [g(ξ, ξ) + g(η, η) − g(ξ, ξ) − g(η, η) + g(ξ, η) + g(η, ξ)]
2
1
= [g(ξ, η) + g(η, ξ)] = g(ξ, η)
2
172of 197



c 2011
Christfried Webers
NICTA
The log likelihood is now University

ln p(t | wML , β)
N N 1
= ln β − ln(2π) − β (t − ΦwML )T (t − ΦwML )
2 2 2 Linear Basis Function
Models

Find critical points of ln p(t | w, β) Maximum Likelihood and
Least Squares

∂ ln p(t | wML , β) Geometry of Least
Squares
=0
∂β Sequential Learning

Regularized Least
Squares
results in
Multiple Outputs

1 1 Loss Function for
= (t − ΦwML )T (t − ΦwML ) Regression
βML N The Bias-Variance
Decomposition

Note: We can first find the maximum likelihood for w as
this does not depend on β. Then we can use wML to find
the maximum likelihood solution for β.
173of 197


Geometry of Least Squares Machine Learning

c 2011
Christfried Webers
NICTA
University

Target vector t = (t1 , . . . , tN )T ∈ RN
Basis vectors
ϕj = (Φj1 , . . . , ΦjN )T = (φj (x1 ), . . . , φj (xN ))T ∈ RN span a
subspace S
Find w such that y = (y(x1 , w), . . . , y(xN , w))T ∈ S is closest Models

to t. Least Squares

Geometry of Least
Squares
S
Sequential Learning
t Regularized Least
Squares
ϕ2 Multiple Outputs
ϕ1 y Loss Function for
Regression

The Bias-Variance
Decomposition

174of 197


Sequential Learning - Stochastic Gradient Machine Learning

c 2011

Descent Christfried Webers
NICTA
University

For large data sets, calculating the maximum likelihood
parameters wML and βML may be costly.
For real-time applications, never all data in memory. Models

Use a sequential algorithms (online algorithm). Least Squares

Geometry of Least
If the error function is a sum over data points E = n En , Squares
then Sequential Learning

1 initialise w(0) to some starting value Regularized Least
Squares
2 update the parameter vector at iteration τ + 1 by
Multiple Outputs

(τ +1) (τ ) Loss Function for
w =w − η En , Regression

The Bias-Variance
where En is the error function after presenting the nth data Decomposition

set, and η is the learning rate.

175of 197


Sequential Learning - Stochastic Gradient Machine Learning

c 2011

Descent Christfried Webers
NICTA
University

For the sum-of-squares error function, stochastic gradient Linear Basis Function
Models
descent results in Maximum Likelihood and
Least Squares
(τ +1) (τ ) (τ )T
w =w − η tn − w φ(xn ) φ(xn ) Geometry of Least
Squares

Sequential Learning
The value for the learning rate must be chosen carefully. A Regularized Least
Squares
too large learning rate may prevent the algorithm from
Multiple Outputs
converging. A too small learning rate does follow the data
Loss Function for
too slowly. Regression

The Bias-Variance
Decomposition

176of 197


Regularized Least Squares Machine Learning

c 2011
Christfried Webers
NICTA
University

Add regularisation in order to prevent overﬁtting

ED (w) + λEW (w)
Models

with regularisation coefﬁcient λ. Maximum Likelihood and
Least Squares
Simple quadratic regulariser Geometry of Least
Squares

1 T Sequential Learning
EW (w) = w w
2 Regularized Least
Squares

Multiple Outputs
Maximum likelihood solution
Loss Function for
Regression
T −1 T
w = λI + Φ Φ Φ t The Bias-Variance
Decomposition

177of 197


Regularized Least Squares Machine Learning

c 2011
Christfried Webers
NICTA
University

More general regulariser
M
1
EW (w) = |wj |q
2 Linear Basis Function
j=1 Models

q = 1 (lasso) leads to a sparse model if λ large enough. Least Squares

Geometry of Least
Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition
q = 0.5 q=1 q=2 q=4

178of 197


Comparison of Quadratic and Lasso Regulariser Machine Learning

c 2011
Christfried Webers
NICTA
University

Assume a sufﬁciently large regularisation coefﬁcient λ.

Quadratic regulariser Lasso regulariser
M M
1 1 Linear Basis Function
w2
j |wj | Models
2 2 Maximum Likelihood and
j=1 j=1 Least Squares

w2 w2 Geometry of Least
Squares

Sequential Learning

Regularized Least
Squares

Multiple Outputs

Loss Function for
Regression
w w
The Bias-Variance
Decomposition

w1 w1

179of 197


Multiple Outputs Machine Learning

c 2011
Christfried Webers
NICTA
University

More than 1 target variable per data point.
y becomes a vector instead of a scalar. Each dimension
can be treated with a different set of basis functions (and
that may be necessary if the data in the different target Linear Basis Function
Models
dimensions represent very different types of information.) Maximum Likelihood and
Least Squares
Here we restrict ourselves to the SAME basis functions Geometry of Least
Squares

y(x, w) = WT φ(x) Sequential Learning

Regularized Least
Squares
where y is a K-dimensional column vector, WT is an M × K Multiple Outputs
matrix of model parameters, and Loss Function for
φ(x) = (φ0 (x), . . . , φM−1 (x), φ0 (x) = 1, as before. Regression

The Bias-Variance
Deﬁne target matrix T containing the target vector tT in the
n
Decomposition

nth row.

180of 197



c 2011
Christfried Webers
NICTA
University

Suppose the conditional distribution of the target vector is
an isotropic Gaussian of the form
p(t | x, W, β) = N (t | WT φ(x), β −1 I). Models

Least Squares
The log likelihood is then
Geometry of Least
Squares
N Sequential Learning
ln p(T | X, W, β) = ln N (tn | WT φ(xn ), β −1 VI) Regularized Least
Squares
n=1
Multiple Outputs
N
NK β β T 2 Loss Function for
= ln − tn − W φ(xn ) Regression
2 2π 2
n=1 The Bias-Variance
Decomposition

181of 197



c 2011
Christfried Webers
NICTA
University

Maximisation with respect to W results in

WML = (ΦT Φ)−1 ΦT T.

For each target variable tk , we get Linear Basis Function
Models

wk = (ΦT Φ)−1 ΦT tk = Φ† tk . Least Squares

Geometry of Least
Squares
The solution between the different target variables
Sequential Learning
decouples. Regularized Least
Squares
Holds also for a general Gaussian noise distribution with
Multiple Outputs
arbitrary covariance matrix. Loss Function for
Regression
Why? W deﬁnes the mean of the Gaussian noise
The Bias-Variance
distribution. And the maximum likelihood solution for the Decomposition

mean of a multivariate Gaussian is independent of the
covariance.

182of 197


Loss Function for Regression Machine Learning

c 2011
Christfried Webers
NICTA
University

Over-fitting results from a large number of basis functions Linear Basis Function
Models
and a relatively small training set. Maximum Likelihood and
Least Squares
Regularisation can prevent overfitting, but how to find the Geometry of Least
correct value for the regularisation constant λ ? Squares

Sequential Learning
Frequentists viewpoint of the model complexity is the
Regularized Least
bias-variance trade-off. Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

183of 197



c 2011
Christfried Webers
NICTA
University
Choose an estimator y(x) to estimate the target value t for
each input x.
Choose a loss function L(t, y(x)) which measures the
difference between the target t and the estimate y(x).
The expected loss is then Models

Least Squares
E [L] = L(t, y(x)) p(x, t) dx dt Geometry of Least
Squares

Sequential Learning
Common choice: Squared Loss Regularized Least
Squares
2
L(t, y(x)) = {y(x) − t} . Multiple Outputs

Loss Function for
Regression
Expected loss for squared loss function The Bias-Variance
Decomposition

2
E [L] = {y(x) − t} p(x, t) dx dt.

184of 197



c 2011
Christfried Webers
NICTA
University

Expected loss for squared loss function
2 Models
E [L] = {y(x) − t} p(x, t) dx dt.
Least Squares

Minimise E [L] by choosing the regression function Geometry of Least
Squares

Sequential Learning
t p(x, t) dt Regularized Least
y(x) = = t p(t | x) dt = Et [t | x] Squares
p(x)
Multiple Outputs

Loss Function for
(use calculus of variations to derive this result). Regression

The Bias-Variance
Decomposition

185of 197



c 2011
Christfried Webers
NICTA
University

The regression function which minimises the expected
squared loss, is given by the mean of the conditional
distribution p(t | x).
Models

t Maximum Likelihood and
Least Squares

y(x) Geometry of Least
Squares

Sequential Learning
y(x0 ) Regularized Least
p(t|x0 ) Squares

Multiple Outputs

Loss Function for
Regression

The Bias-Variance
x0 x Decomposition

186of 197



c 2011
Christfried Webers
NICTA
University

Analyse the expected loss

2
E [L] = {y(x) − t} p(x, t) dx dt.
Models
Rewrite the squared loss Maximum Likelihood and
Least Squares
2 2
{y(x) − t} = {y(x) − E [t | x] + E [t | x] − t} Geometry of Least
Squares
2 2
= {y(x) − E [t | x]} + {E [t | x] − t} Sequential Learning

Regularized Least
+ 2 {y(x) − E [t | x]} {E [t | x] − t} Squares

Multiple Outputs

Claim Loss Function for
Regression

The Bias-Variance
{y(x) − E [t | x]} {E [t | x] − t} p(x, t) dx dt = 0. Decomposition

187of 197



c 2011
Christfried Webers
NICTA
University
Claim

{y(x) − E [t | x]} {E [t | x] − t} p(x, t) dx dt = 0.

Seperate functions depending on t from function Linear Basis Function
Models
depending on x Maximum Likelihood and
Least Squares

Geometry of Least
{y(x) − E [t | x]} {E [t | x] − t} p(x, t) dt dx Squares

Sequential Learning

Regularized Least
Calculate the integral over t Squares

Multiple Outputs

t p(x, t) Loss Function for
{E [t | x] − t} p(x, t) dt = E [t | x] p(x) − p(x) dt Regression

p(x) The Bias-Variance
Decomposition
= E [t | x] p(x) − p(x)E [t | x]
=0

188of 197



c 2011
Christfried Webers
NICTA
University

The expected loss is now
Models
2
E [L] = {y(x) − E [t | x]} p(x) dx + var[t | x] p(x) dx Maximum Likelihood and
Least Squares

Geometry of Least
Squares
Minimise ﬁrst term by choosing appropriate y(x).
Sequential Learning
Second term represents the intrinsic variability of the Regularized Least
Squares
target data (can be regarded as noise). Independent of the
Multiple Outputs
choice y(x), can not be reduced by learning a better y(x).
Loss Function for
Regression

The Bias-Variance
Decomposition

189of 197


The Bias-Variance Decomposition Machine Learning

c 2011
Christfried Webers
NICTA
University

Consider now the dependency on the data set D.
Prediction function now y(x; D).
Consider again squared loss for which the optimal Linear Basis Function
Models
prediction is given by the conditional expectation h(x) Maximum Likelihood and
Least Squares

Geometry of Least
h(x) = E [t | x] = t p(t | x) dt. Squares

Sequential Learning

Regularized Least
BUT: we can not know h(x) exactly, as we would need an Squares

inﬁnite number of training data to learn it accurately. Multiple Outputs

Loss Function for
Evaluate performance of algorithm by taking the Regression

expectation ED [L] over all data sets D The Bias-Variance
Decomposition

190of 197



c 2011
Christfried Webers
NICTA
University

Taking the expectation over all data sets D

ED [E [L]] = ED {y(x; D) − h(x)}2 p(x) dx
Models

+ {h(x) − t}2 p(x, t) dx dt Maximum Likelihood and
Least Squares

Geometry of Least
Squares
Again, add and subtract the expectation ED [y(x; D)] Sequential Learning

Regularized Least
{y(x; D) − h(x)}2 = { y(x; D) − ED [y(x; D)] Squares

Multiple Outputs
+ ED [y(x; D)] − h(x)}2 Loss Function for
Regression

and show that the mixed term does vanish under the The Bias-Variance
Decomposition
expectation ED [. . .].

191of 197



c 2011
Christfried Webers
NICTA
University
Expected loss ED [L] over all data sets D

expected loss = (bias)2 + variance + noise.

where Linear Basis Function
Models

2
(bias)2 = {ED [y(x; D)] − h(x)} p(x) dx Maximum Likelihood and
Least Squares

Geometry of Least
Squares
2
variance = ED {y(x; D) − ED [y(x; D)]} p(x) dx Sequential Learning

Regularized Least
Squares
noise = {h(x) − t}2 p(x, t) dx dt. Multiple Outputs

Loss Function for
Regression
squared bias : how much does the average prediction over
The Bias-Variance
all data sets differ from the desired regression function ? Decomposition

variance : how much do solutions for individual data sets
vary around their average ?

192of 197



c 2011
Christfried Webers
NICTA
University

Dependence of bias and variance on the model complexity

Models
1 1
ln λ = 2.6 Maximum Likelihood and
t t
Least Squares

0 0 Geometry of Least
Squares

Sequential Learning
−1 −1 Regularized Least
Squares

0 1 0 1 Multiple Outputs
x x
Loss Function for
Left: Result of ﬁtting the model to 100 data sets, only 25 shown. Regression

Right: Average of the 100 ﬁts in red, the sinusoidal function The Bias-Variance
Decomposition
from where the data were created in green.

193of 197



c 2011
Christfried Webers
NICTA
University


Models
1 1
ln λ = −0.31
t t Maximum Likelihood and
Least Squares

Squares

Sequential Learning
Squares

x x
Loss Function for

Decomposition

194of 197



c 2011
Christfried Webers
NICTA
University


Models
1 1
ln λ = −2.4
t t Maximum Likelihood and
Least Squares

Squares

Sequential Learning
Squares

x x
Loss Function for

Decomposition

195of 197



c 2011
Christfried Webers
NICTA
Squared bias, variance, their sum, and test data University

The minimum for (bias)2 + variance occurs close to the
value that gives the minimum error

0.15 Linear Basis Function
Models
(bias)2
0.12 variance Least Squares

(bias)2 + variance Geometry of Least
Squares
0.09 test error
Sequential Learning

Regularized Least
0.06 Squares

Multiple Outputs

0.03 Loss Function for
Regression

The Bias-Variance
0 Decomposition
−3 −2 −1 0 1 2
ln λ

196of 197



c 2011
Christfried Webers
NICTA
University

Tradeoff between bias and variance Linear Basis Function
Models
simple models have high bias and low variance
complex models have low bias and high variance Least Squares

The sum of bias and variance has a minimum at some Geometry of Least
Squares
model complexity. Sequential Learning

The bias-variance decomposition needs many data sets, Regularized Least
Squares
which are not always available. Multiple Outputs

Loss Function for
Regression

The Bias-Variance
Decomposition

197of 197

Linear Regression

More Related Content

What's hot (20)

Viewers also liked (6)

More from nep_test_account (18)

Linear Regression