EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm and its application in Probabilistic Latent
Semantic Analysis (pLSA)

Duc-Hieu Tran
tdh.net [at] gmail.com

Nanyang Technological University

July 27, 2010

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27

The parameter estimation problem

Outline


EM algorithm

Probabilistic Latent Sematic Analysis

Reference



Frequentist vs. Bayesian schools

Frequentist
parameters – quantities whose values are ﬁxed but unknown.
the best estimate of their values – the one that maximizes the
probability of obtaining the observed samples.
Bayesian
paramters – random variables having some known prior distribution.
observation of the samples converts this to a posterior density;
revising our opinion about the true values of the parameters.



Examples

training samples: S = {(x (1) , y (1) ), . . . (x (m) , y (m) )}
frequentist: maximum likelihood

max p(y (i) |x (i) ; θ)
θ
i

bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)
m
P(θ|S) ∝ P(y (i) |x (i) , θ) .P(θ)
i=1

θMAP = arg max P(θ|S)
θ


EM algorithm

Outline


EM algorithm


Reference


EM algorithm

An estimation problem

training set of m independent samples: {x (1) , x (2) , . . . , x (m) }
goal: ﬁt the paramters of a model p(x, z) to the data
the likelihood:
m m
(i)
(θ) = log p(x ; θ) = log p(x (i) , z; θ)
i=1 i=1 z

explicitly maximize (θ) might be diﬃcult.
z - laten random variable
if z (i) were observed, then maximum likelihood estimation would be
easy.
strategy: repeatedly construct a lower-bound on (E-step) and
optimize that lower-bound (M-step).


EM algorithm

EM algorithm (1)
digression: Jensen’s inequality.
f – convex function; E [f (X )] ≥ f (E [X ])
for each i, Qi – distribution of z: z Qi (z) = 1, Qi (z) ≥ 0

(θ) = log p(x (i) ; θ)
i

= log p(x (i) , z (i) ; θ)
i z (i)
p(x (i) , z (i) ; θ)
= log Qi (z (i) ) (1)
i
Qi (z (i) )
z (i)
applying Jensen’s inequality, concave function log
p(x (i) , z (i) ; θ)
≥ Qi (z (i) )log (2)
i
Qi (z (i) )
z (i)

More detail . . .

EM algorithm

EM algorithm (2)
for any set of distribution Qi , formula (2) gives a lower-bound on (θ)
how to choose Qi ?
strategy: make the inequality hold with equality at our particular
value of θ.
require:
p(x (i) , z (i) ; θ)
=c
Qi (z (i) )
c – constant not depend on z (i)
choose: Qi (z (i) ) ∝ p(x (i) , z (i) ; θ)
we know z Qi (z (i) ) = 1, so

p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ)
Qi (z (i) ) = = = p(z (i) |x (i) ; θ)
z p(x (i) , z; θ) p(x (i) ; θ)


EM algorithm

EM algorithm (3)

Qi – posterior distribution of z (i) given x (i) and the parameter θ
EM algorithm: repeat until convergence
E-step: for each i
Qi (z (i) ) := p(z (i) |x (i) ; θ)
M-step:

p(x (i) , z (i) ; θ)
θ := arg max Qi (z (i) ) log
θ i
Qi (z (i) )
z (i)

The algorithm will converge, since (θ(t) ) ≤ (θ(t+1) )


EM algorithm

EM algorithm (4)
Digression: coordinate ascent algorithm.
maxW (α1 , . . . αm )
α
loop until converge:
for i ∈ 1, . . . , m:

αi = arg max W (α1 , . . . , αi , . . . , αm )
ˆ
αi
ˆ

EM-algorithm as coordinate ascent algorithm

p(x (i) , z (i) ; θ)
J(Q, θ) = Qi (z (i) ) log
i
Qi (z (i) )
z (i)

(θ) ≥ J(Q, θ)
EM algorithm can be viewed as coordinate ascent on J
E-step: maximize w.r.t Q
M-step: maximize w.r.t θ


Outline


EM algorithm


Reference



Probabilistic Latent Semantic Analysis (1)

set of documents D = {d1 , . . . , dN }
set of words W = {w1 , . . . , wM }
set of unobserved classes Z = {z1 , . . . , zK }
conditional independence assumption:

P(di , wj |zk ) = P(di |zk )P(wj |zk ) (3)

so,
K
P(wj |di ) = P(zk |di )P(wj |zk ) (4)
k=1
K
P(di , wj ) = P(di ) P(wj |zk )P(zk |di )
k=1
More detail . . .




n(di , wj ) – # word wj in doc. di
Likelihood
N N M
L= P(di ) = [P(di , wj )]n(di ,wj )
i=1 i=1 j=1

N M K n(di ,wj )

= P(di ) P(wj |zk )P(zk |di )
i=1 j=1 k=1

log-likelihood = log(L)
N M K
= n(di , wj ) log P(di ) + n(di , wj ) log P(wj |zk )P(zk |di )
i=1 j=1 k=1



EM-algorithm
E-step: update
P(zk |di , wj ) = K
M-step: maximize w.r.t P(wj |zk ), P(zk |di )
N M K
n(di , wj ) P(zk |di , wj ) log[P(wj |zk )P(zk |di )]
i=1 j=1 k=1

subject to
M
P(wj |zk ) = 1, k ∈ {1 . . . K }
j=1
K
P(zk |di ) = 1, i ∈ {1 . . . N}
k=1

Reference

Outline


EM algorithm


Reference


Reference

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classiﬁcation,
Wiley-Interscience, 2001.
T. Hofmann, ”Unsupervised learning by probabilistic latent semantic
analysis,” Machine Learning, vol. 42, 2001, p. 177–196.
Course: ”Machine Learning CS229”, Andrew Ng, Stanford University


Appendix

Applying the Jensen’s inequality

f (x) = log (x), concave function

p(x (i) , z (i) ; θ) p(x (i) , z (i) ; θ)
f Ez (i) ∼Qi ≥ Ez (i) ∼Qi f
Qi (z (i) ) Qi (z (i) )

Return


EM algorithm and its application in probabilistic latent semantic analysis

More Related Content

What's hot (20)

Similar to EM algorithm and its application in probabilistic latent semantic analysis (20)

More from zukun (20)

Recently uploaded (20)

EM algorithm and its application in probabilistic latent semantic analysis