Machine learning (9)

CS229 Lecture notes
Andrew Ng
Part IX
The EM algorithm
In the previous set of notes, we talked about the EM algorithm as applied to
fitting a mixture of Gaussians. In this set of notes, we give a broader view
of the EM algorithm, and show how it can be applied to a large family of
estimation problems with latent variables. We begin our discussion with a
very useful result called Jensen’s inequality
1 Jensen’s inequality
Let f be a function whose domain is the set of real numbers. Recall that
f is a convex function if f′′(x) ≥ 0 (for all x ∈ R). In the case of f taking
vector-valued inputs, this is generalized to the condition that its hessian H
is positive semi-definite (H ≥ 0). If f′′(x) > 0 for all x, then we say f is
strictly convex (in the vector-valued case, the corresponding statement is
that H must be positive definite, written H > 0). Jensen’s inequality can
then be stated as follows:
Theorem. Let f be a convex function, and let X be a random variable.
Then:
E[f(X)] ≥ f(EX).
Moreover, if f is strictly convex, then E[f(X)] = f(EX) holds true if and
only if X = E[X] with probability 1 (i.e., if X is a constant).
Recall our convention of occasionally dropping the parentheses when writ-
ing expectations, so in the theorem above, f(EX) = f(E[X]).
For an interpretation of the theorem, consider the figure below.
1

2
a E[X] b
f(a)
E[f(X)]
f(b)
f(EX)
f
Here, f is a convex function shown by the solid line. Also, X is a random
variable that has a 0.5 chance of taking the value a, and a 0.5 chance of
taking the value b (indicated on the x-axis). Thus, the expected value of X
is given by the midpoint between a and b.
We also see the values f(a), f(b) and f(E[X]) indicated on the y-axis.
Moreover, the value E[f(X)] is now the midpoint on the y-axis between f(a)
and f(b). From our example, we see that because f is convex, it must be the
case that E[f(X)] ≥ f(EX).
Incidentally, quite a lot of people have trouble remembering which way
the inequality goes, and remembering a picture like this is a good way to
quickly figure out the answer.
Remark. Recall that f is [strictly] concave if and only if −f is [strictly]
convex (i.e., f′′(x) ≤ 0 or H ≤ 0). Jensen’s inequality also holds for concave
functions f, but with the direction of all the inequalities reversed (E[f(X)] ≤
f(EX), etc.).
2 The EM algorithm
Suppose we have an estimation problem in which we have a training set
{x(1), . . . , x(m)} consisting of m independent examples. We wish to fit the
parameters of a model p(x, z) to the data, where the likelihood is given by
ℓ(θ) =
m
Xi=1
log p(x; θ)
=
m
Xi=1
logXz
p(x, z; θ).

3
But, explicitly finding the maximum likelihood estimates of the parameters θ
may be hard. Here, the z(i)’s are the latent random variables; and it is often
the case that if the z(i)’s were observed, then maximum likelihood estimation
would be easy.
In such a setting, the EM algorithm gives an efficient method for max-
imum likelihood estimation. Maximizing ℓ(θ) explicitly might be difficult,
and our strategy will be to instead repeatedly construct a lower-bound on ℓ
(E-step), and then optimize that lower-bound (M-step).
For each i, let Qi be some distribution over the z’s (Pz Qi(z) = 1, Qi(z) ≥
0). Consider the following:1
Xi
log p(x(i); θ) = Xi
log(i)
z
Xp(x(i), z(i); θ) (1)
= Xi
log(i)
z
XQi(z(i))
p(x(i), z(i); θ)
Qi(z(i))
(2)
≥ Xi (i)
z
XQi(z(i)) log
p(x(i), z(i); θ)
Qi(z(i))
(3)
The last step of this derivation used Jensen’s inequality. Specifically, f(x) =
log x is a concave function, since f′′(x) = −1/x2 < 0 over its domain x ∈ R+.
Also, the term
Xz
(i)
Qi(z(i)) p(x(i), z(i); θ)
Qi(z(i))
in the summation is just an expectation of the quantity p(x(i), z(i); θ)/Qi(z(i)) with respect to z(i) drawn according to the distribution given by Qi. By
Jensen’s inequality, we have
f Ez(i)∼Qi p(x(i), z(i); θ)
Qi(z(i)) ≥ Ez(i)∼Qi f p(x(i), z(i); θ)
Qi(z(i)) ,
where the “z(i) ∼ Qi” subscripts above indicate that the expectations are
with respect to z(i) drawn from Qi. This allowed us to go from Equation (2)
to Equation (3).
Now, for any set of distributions Qi, the formula (3) gives a lower-bound
on ℓ(θ). There’re many possible choices for the Qi’s. Which should we
choose? Well, if we have some current guess θ of the parameters, it seems
1If z were continuous, then Qi would be a density, and the summations over z in our
discussion are replaced with integrals over z.

4
natural to try to make the lower-bound tight at that value of θ. I.e., we’ll
make the inequality above hold with equality at our particular value of θ.
(We’ll see later how this enables us to prove that ℓ(θ) increases monotonically
with successsive iterations of EM.)
To make the bound tight for a particular value of θ, we need for the step
involving Jensen’s inequality in our derivation above to hold with equality.
For this to be true, we know it is sufficient that that the expectation be taken
over a “constant”-valued random variable. I.e., we require that
p(x(i), z(i); θ)
Qi(z(i))
= c
for some constant c that does not depend on z(i). This is easily accomplished
by choosing
Qi(z(i)) ∝ p(x(i), z(i); θ).
Actually, since we know Pz Qi(z(i)) = 1 (because it is a distribution), this
further tells us that
Qi(z(i)) =
p(x(i), z(i); θ)
Pz p(x(i), z; θ)
=
p(x(i), z(i); θ)
p(x(i); θ)
= p(z(i)|x(i); θ)
Thus, we simply set the Qi’s to be the posterior distribution of the z(i)’s
given x(i) and the setting of the parameters θ.
Now, for this choice of the Qi’s, Equation (3) gives a lower-bound on the
loglikelihood ℓ that we’re trying to maximize. This is the E-step. In the
M-step of the algorithm, we then maximize our formula in Equation (3) with
respect to the parameters to obtain a new setting of the θ’s. Repeatedly
carrying out these two steps gives us the EM algorithm, which is as follows:
Repeat until convergence {
(E-step) For each i, set
Qi(z(i)) := p(z(i)|x(i); θ).
(M-step) Set
Xi Xz
θ := argmax
(i)
Qi(z(i)) log
p(x(i), z(i); θ)
Qi(z(i))
.

5
}
How we we know if this algorithm will converge? Well, suppose θ(t)
and θ(t+1) are the parameters from two successive iterations of EM. We will
now prove that ℓ(θ(t)) ≤ ℓ(θ(t+1)), which shows EM always monotonically
improves the log-likelihood. The key to showing this result lies in our choice
of the Qi’s. Specifically, on the iteration of EM in which the parameters had
started out as θ(t), we would have chosen Q(t)
i (z(i)) := p(z(i)|x(i); θ(t)). We
saw earlier that this choice ensures that Jensen’s inequality, as applied to get
Equation (3), holds with equality, and hence
z
ℓ(θ(t)) =Xi (i)
XQ(t)
i (z(i)) log
p(x(i), z(i); θ(t))
Q(t)
i (z(i))
.
The parameters θ(t+1) are then obtained by maximizing the right hand side
of the equation above. Thus,
z
ℓ(θ(t+1)) ≥ Xi (i)
XQ(t)
i (z(i)) log
p(x(i), z(i); θ(t+1))
Q(t)
i (z(i))
(4)
≥ Xi (i)
z
XQ(t)
i (z(i)) log
p(x(i), z(i); θ(t))
Q(t)
i (z(i))
(5)
= ℓ(θ(t)) (6)
This first inequality comes from the fact that
ℓ(θ) ≥Xi Xz(i)
Qi(z(i)) log
p(x(i), z(i); θ)
Qi(z(i))
holds for any values of Qi and θ, and in particular holds for Qi = Q(t)
i ,
θ = θ(t+1). To get Equation (5), we used the fact that θ(t+1) is chosen
explicitly to be
Xi Xz
argmax
(i)
Qi(z(i)) log
p(x(i), z(i); θ)
Qi(z(i))
,
and thus this formula evaluated at θ(t+1) must be equal to or larger than the
same formula evaluated at θ(t). Finally, the step used to get (6) was shown
earlier, and follows from Q(t)
i having been chosen to make Jensen’s inequality
hold with equality at θ(t).

6
Hence, EM causes the likelihood to converge monotonically. In our de-
scription of the EM algorithm, we said we’d run it until convergence. Given
the result that we just showed, one reasonable convergence test would be
to check if the increase in ℓ(θ) between successive iterations is smaller than
some tolerance parameter, and to declare convergence if EM is improving
ℓ(θ) too slowly.
Remark. If we define
z
J(Q, θ) =Xi (i)
XQi(z(i)) log
p(x(i), z(i); θ)
Qi(z(i))
,
then we know ℓ(θ) ≥ J(Q, θ) from our previous derivation. The EM can also
be viewed a coordinate ascent on J, in which the E-step maximizes it with
respect to Q (check this yourself), and the M-step maximizes it with respect
to θ.
3 Mixture of Gaussians revisited
Armed with our general definition of the EM algorithm, let’s go back to our
old example of fitting the parameters φ, μ and in a mixture of Gaussians.
For the sake of brevity, we carry out the derivations for the M-step updates
only for φ and μj , and leave the updates for j as an exercise for the reader.
The E-step is easy. Following our algorithm derivation above, we simply
calculate
w(i)
j = Qi(z(i) = j) = P(z(i) = j|x(i); φ, μ,).
Here, “Qi(z(i) = j)” denotes the probability of z(i) taking the value j under
the distribution Qi.
Next, in the M-step, we need to maximize, with respect to our parameters
φ, μ,, the quantity
m
Xi=1(i)
z
XQi(z(i)) log
p(x(i), z(i); φ, μ,)
Qi(z(i))
=
m
Xi=1
k
Xj=1
Qi(z(i) = j) log
p(x(i)|z(i) = j; μ,)p(z(i) = j; φ)
Qi(z(i) = j)
=
m
Xi=1
k
Xj=1
w(i)
j log
1
2 (x(i) − μj)T−1
(2)n/2|j |1/2 exp −1
j (x(i) − μj) · φj
w(i)
j

7
Let’s maximize this with respect to μl. If we take the derivative with respect
to μl, we find
∇μl
m
Xi=1
k
Xj=1
w(i)
j log
1
exp −1
(2)n/2|j |1/2 2 (x(i) − μj)T−1
j (x(i) − μj) · φj
w(i)
j
= −∇μl
m
Xi=1
k
Xj=1
w(i)
j
1
2
(x(i) − μj)T−1
j (x(i) − μj)
=
1
2
m
Xi=1
w(i)
l −1
l ∇μl2μT
l x(i) − μT
l −1
l μl
=
m
Xi=1
w(i)
l −1
l x(i) − −1
l μl
Setting this to zero and solving for μl therefore yields the update rule
i=1 w(i)
μl := Pm
l x(i)
i=1 w(i)
Pm
l
,
which was what we had in the previous set of notes.
Let’s do one more example, and derive the M-step update for the param-
eters φj . Grouping together only the terms that depend on φj , we find that
we need to maximize
m
Xi=1
k
Xj=1
w(i)
j log φj .
However, there is an additional constraint that the φj ’s sum to 1, since they
represent the probabilities φj = p(z(i) = j; φ). To deal with the constraint
that Pk
φj = 1, we construct the Lagrangian
j=1 L(φ) =
m
Xi=1
k
Xj=1
w(i)
j log φj + β(
k
Xj=1
φj − 1),
where β is the Lagrange multiplier.2 Taking derivatives, we find
∂
∂φj
L(φ) =
m
Xi=1
w(i)
j
φj
+ 1
2We don’t need to worry about the constraint that j ≥ 0, because as we’ll shortly see,
the solution we’ll find from this derivation will automatically satisfy that anyway.

8
Setting this to zero and solving, we get
i=1 w(i)
φj = Pm
j
−β
i=1 w(i)
I.e., φj ∝ Pm
j . Using the constraint that Pj φj = 1, we easily find
j=1 w(i)
that −β = Pm
i=1Pk
i=1 1 = m. (This used the fact that w(i)
j = Pm
j =
Qi(z(i) = j), and since probabilities sum to 1, Pj w(i)
j = 1.) We therefore
have our M-step updates for the parameters φj :
φj :=
1
m
m
Xi=1
w(i)
j .
The derivation for the M-step updates to j are also entirely straightfor-
ward.

Machine learning (9)

More Related Content

What's hot (19)

Viewers also liked (11)

Similar to Machine learning (9) (20)

More from NYversity (20)

Recently uploaded (20)

Machine learning (9)