Latent dirichletallocation presentation

Latent Dirichlet Allocation
Soojung Hong
Feb 6. 2017

Contents
❏ Introduction : Text corpora modeling
❏ Background and Terminology
❏ Latent Dirichlet Allocation
❏ Comparison with other latent variable models
❏ Inference and Parameter Estimation
❏ Applications and Empirical Results
❏ Summary

Text corpora modeling
❏ Goal
Finding short descriptions of the members of the collections
(e.g. Finding topics from documents in corpora)
In particular, descriptions preserve essential statistical relationships
❏ Application areas
classification, novelty detection, summarization and collaborative filtering
❏ Relevant approaches for text modeling
tf-idf scheme, LSI, pLSI and latent Dirichlet allocation (LDA)

Summary of other approaches
Advantage Disadvantage
tf-idf Reduces documents from arbitrary length to
fixed-length lists of numbers
Relatively small amount of reduction. Reveal little
about inter-intra document’s statistical structure
LSI Reduces the dimensionality and learns latent topics
by performing a matrix composition (SVD) on
term-document matrix
Not clear why one should use LSI for generative
model of text
pLSI
Significant step forward probabilistic modeling of text No probabilistic model at the level of documents
If number of parameter grows linearly with size of
corpus, it leads overfitting
Not clear how to assign probability to document
outside of the training set

Details of tf-idf
● Term Frequency - Inverse Document Frequency :
● Term Frequency : How frequently a word occurs in a document
● Inverse Document Frequency : The inverse document frequency for any given term
http://www.joyofdata.de/blog/tf-idf-statistic-keyword-extraction/
High tf-idf scored words
-> occur frequently
-> provide more important information

Details of LSI (also called LSA)
❏ Analysis of latent semantics in a corpora of text (by using Singular Value Decomposition)
❏ A collection of documents can be represented as a term-document matrix
❏ Similarity of doc-doc, term-term, term-doc can be measured by cosine similarity
❏ Not easy to find polysemy and synonymy
❏ LSI transforms the original data in a different space so that two documents/words about
the same concept are mapped close
https://simonpaarlberg.com/post/latent-semantic-analyses/

Details of pLSI
❏ pLSI models each word in a document as a sample from mixture model
❏ The mixture components of mixture model are multinomial random variables that can be
viewed as representations of `topics’. Thus each word is generated from a single topic
❏ Each document is represented as a list of mixing proportions for these mixture components
“topics”.
d : the document index variable
c : a word's topic drawn from the document's topic distribution P(c|d)
w : a word drawn from the word distribution of this word's topic P(w|c)
d and w : observable variables
topic c : a latent variable (represented as z in the above)
Plate notation representing the pLSA model

Two assumptions
❏ Bag-of-words (Fundamental probabilistic assumption for dimensionality reduction)
The order of words in document can be neglected.
This is an assumption of exchangeability for the words in a document.
❏ Exchangeability (Documents are exchangeable)
The specific ordering of the documents in a corpus can be neglected
❏ Exchangeability is not equivalent to an assumption that the random variables are
independent and identically distributed, it is rather “Conditionally independent and
identically distributed”. This condition is with respect to an underlying latent parameter
of a probability distribution.

Notation and Terminology
❏ Latent variables aim to capture abstract notions such as topics
❏ Word is a basic unit of discrete data, defined as an item from a vocabulary indexed by
Words is unit-basis vectors that have a single component equal to one and all other
components equal to zero. v-th word in the vocabulary is represented by a V-vector w
such that and for
❏ Document is a sequence of N words denoted by
❏ Corpus is a collection of M documents denoted by

Latent Dirichlet Allocation
❏ Generative probabilistic model of a corpus
The basic idea is that documents are represented as random mixtures over latent topics
Each topic is characterized by a distribution over words.
❏ LDA generative process for each document w in a corpus D:
1. Choose N ∼ Poisson(ξ)
2. Choose θ ∼ Dir(α)
3. For each of the N words wn:
(a) Choose a topic zn ∼ Multinomial(θ).
(b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.
❏ Several simplifying assumption : [1] The dimensionality k of the Dirichlet distribution (and
thus the dimensionality of topic variable z) is assumed to be known and fixed [2] The word
probabilities are parameterized by k ×V matrix β where which for now we
treat as a fixed quantity to be estimated. [3] Poisson assumption is not critical to anything.
[4] N is independent to all other data generating θ and z

LDA (continue)
● k-dimensional Dirichlet random variable θ can take values in (k-1)-simplex
(k vector θ lies in the k-1 simplex if )
● The probability density on this simplex :
● Parameter α is a k-vector with components with and where is gamma function.
● Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics
z, and a set of N words w is given by:

LDA (continue)
● Where p(zn | θ) is simply θi for the unique i such that
Integrating over θ and summing over z, we obtain the marginal distribution of a document
● Probability of a corpus by taking the product of the marginal probabilities of single documents

Graphical Model Representation of LDA
Three levels to the LDA representation
1. The parameter α and β are corpus level parameters, assumed to be sampled once in the
process of generating a corpus
2. Variable θd are document-level variables, sampled per document
3. Variables zdn and wdn are word-level variables and sampled once for each word in each
document

LDA and Exchangeability
● If finite set of random variables {z1 , , , , zn} is said to be exchangeable, if the joint
distribution is invariant to permutation. If π is a permutation of the integers from 1 to N.
● An infinite sequence of random variables is infinitely exchangeable, if every infinite
subsequence is exchangeable.
● De Finetti’s representation theorem states that “the joint distribution of an infinitely
exchangeable sequence of random variables is as if a random parameter were drawn from some
distribution and then the random variables in question were independent and identically distributed,
conditioned on that parameter”.
● By de Finetti’s theorem, the probability of a sequence of words and topics must have the
form where θ is the random parameter of a multinomial over topic

Continuous mixture of unigrams
● LDA model is more elaborate than the two-level models often studied in the classical
hierarchical Bayesian literature
● By marginalizing over hidden topic variable z, LDA can be understood as a two-level model
● The word distribution p (w | θ, β)

Continuous mixture of unigrams (continue)
Given word distribution generative process for a document w :
1. Choose θ ~ Dir(α)
2. For each of the N words wn :
Choose a word wn from p(wn | θ, β) ,
This process defines the marginal distribution of a document as a continuous mixture
distribution,
p(θ|α) are the mixture weights p(wn | θ, β) are mixture components marginal out with θ

Interpretation of LDA
Example density on unigram distributions p(w|θ,β) under LDA for three words and four topics
The triangle embedded in the
x-y plane is the 2-D simplex
representing all possible
multinomial distributions over
three words.
Each of the vertices corresponds
to a deterministic distribution
that assigns probability one to
one of the words
Midpoint of an edge gives
probability 0.5 to two of the
words
Centroid of the triangle is the
uniform distribution over all
three words
Locations of the multinomial
distributions p(w|z) for each of the
four topics, and the surface shown on
top of the simplex is an example of a
density over the (V−1)-simplex
(multinomial distributions of words)
given by LDA.

Comparison : LDA and other latent variable models
(a) Unigram model (b) Mixture of unigram (c) pLSI model
Augment the unigram model with a
discrete random topic variable z and
obtain a mixture of unigrams model.
Each document is generated by the
first choosing topic z and then
generating N words independently
from the conditional multinomial
p(w|z).
Posits that a document label d and
a word wn are conditionally
independent given an unobserved
topic z.
The words of every document
are drawn independently from
single multinomial distribution.

Drawback of pLSI
❏ pLSI does capture the possibility that a document may contain multiple topics
- p(z|d) serves as the mixture weights of the topics for a particular document d
- However, d is a dummy index into the list of documents in the training set. Thus d is a
multinomial random variable restricted in training documents
❏ pLSI is not well-defined generative model of documents
- There is no natural way to assign probability to a previously unseen document.
- The parameters for a k-topic pLSI model are K multinomial distributions of size V and
M mixtures over the k-hidden topics. This gives kV + kM parameters and therefore
linear growth in M. The linear growth in parameters suggests that the model is prone
to overfitting.

LDA overcomes pLSI problem
❏ By treating the topic mixture weights as a k-parameter hidden random variable rather
than a large set of individual parameters which are explicitly linked to the training set
❏ LDA is a well-defined generative model and generalize easily to new documents.
Furthermore, the k+kV parameters in k-topic LDA model do not grow with the size of
training corpus, therefore doesn’t suffer from the issue of overfitting.
pLSI
LDA

Geometric Interpretation of latent space
Difference between LDA and other latent topic models (unigram, mixture of unigrams, pLSI)
The unigram model finds a single point on
the word simplex and posits that all words
in the corpus come from the
corresponding distribution.
Mixture of unigrams model posits that
for each document, one of the k points
on the word simplex (that is, one of the
corners of the topic simplex) is chosen
randomly and all the words of the
documents are drawn from the
distribution corresponding to that
point.
pLSI model posits that each word
of a training document comes
from randomly selected chosen
topic. The topics are themselves
drawn from a document-specific
distribution over topics.
LDA posit that each word of both
observed and unseen documents is
generated by a randomly chosen
topic which is drawn from a
distribution with a randomly
chosen parameter.

http://parkcu.com/blog/

Inference and Parameter Estimation
● Key inferential problem to solve in order to use LDA is computing the posterior distribution of
the hidden variables given a document. This distribution is intractable to compute in general.
Conditional density can be written as
● To normalize the distribution, we marginalize over the hidden variables and write following
equation in terms of model parameter as
● Problem : The denominator contains the marginal density of the observations (evidence) ,
often this evidence integral is unavailable in closed form or require exponential time to
compute. Therefore, we needs approximate inference algorithm
(*)

Inference and Parameter Estimation (continue)
● Goal : To find the best candidate approximate q(z), the one closest in KL divergence to
exact conditional p(z). Inference now to solving the optimization problem.
Family of densities over latent variables.
Each is candidate approximation to exact conditional
● The problem in (*) is coupling between θ and β arises due to the edges between θ, z and
w. Thus, by dropping these edges and the w nodes, and endowing the resulting
simplified graphical model with free variational parameters
Graphical model of the variational distribution
used to approximate the posterior in LDA

● By dropping these edges between θ and β and the w nodes, and endowing the resulting
simplified graphical model with free variational parameters, we obtain a family of
distributions on the latent variables. This family is characterized by the following variational
distribution:
Dirichlet parameter γ, multinomial parameters (φ1,...,φN) are the free variational parameters.
● The next step is to set up an optimization problem that determines the values of the
variational parameters γ and φ.

● Given a corpus of documents D =
● Find parameters α and β that maximize the (marginal) log likelihood of the data
● Quantity p(w|α,β) cannot be computed tractable
● Using variational inference provides us with a tractable lower bound on the log likelihood,
which we can maximize with respect to α an
● Find approximate empirical Bayes estimates for the LDA model via an alternating
variational EM procedure that maximizes a lower bound with respect to the variational
parameters γ and φ , and then, for fixed values of the variational parameters, maximizes
the lower bound with respect to the model parameters α and β.
Parameter Estimation

Smoothing
● Large vocabulary size is characteristic of document corpora often problems with sparsity
● Maximum likelihood estimates of the multinomial parameters β assign zero probability to
new words, and thus zero probability to new documents
● To avoid this problem using “smooth” the multinomial parameters β
● Assigning positive probability to all vocabulary items whether or not they are observed in
the training set.
⇒ Proposed solution for smoothing is to apply variational inference methods to the extended
model that includes Dirichlet smoothing on the multinomial parameter β
Graphical model representation of the smoothed LDA model.

Example Data
❏ Data : 16,000 documents from a subset of the TREC AP corpus
❏ Preparation : Removing a standard list of stop words and we use EM algorithm to find the
Dirichlet and conditional multinomial parameters for a 100-topic LDA model.
❏ The top words from some of the resulting multinomial distributions p(w|z)
(hopefully) These distributions capture some of the underlying topics in the corpus

Applications and Empirical Results : (1) Document Modeling
❏ Trained a number of latent variable models on two corpora
❏ Comparison of generalization performance (perplexity) of each models
❏ Goal is density estimation which achieve high likelihood on a held-out test set
❏ The perplexity, used by convention in the language modeling, is monotonically decreasing
in the likelihood of the test data.
❏ A lower perplexity score indicates better generalization performance.
❏ Formally, a test set of M documents, the perplexity is

Perplexity Results
Mixture of unigram model and pLSI both suffer serious overfitting issue.
LDA can easily assign probability to a new document without overfitting
● 5225 abstracts
● 28414 unique terms
● 6333 newswire articles
● 23075 unique terms

Applications and Empirical Results : Document classification
❏ Goal : classify a document into two or more mutually exclusive classes. In particular, by
using one LDA module for each class, we obtain a generative model for classification
❏ A challenging aspect of the document classification problem is the choice of features.
Treating individual words as features yields a rich but very large feature set
❏ One way to reduce feature set is to use an LDA model for dimensionality reduction. In
particular, LDA reduces any document to a fixed set of real-valued features - the posterior
Dirichlet parameters ϒ* (w) associated with the document.

❏ Two binary classification experiments using Reuter-21578 dataset
❏ Dataset contains 8000 documents and 15,818 words.
❏ Estimated the parameters of an LDA model on all the documents, without reference to
their true class label.
❏ SVM classification : one trained the low-dimensional representations provided by LDA
and the other with all the word features. (i.e. low-dimensional features by LDA vs. all
features, then use SVM classification)

Applications and Empirical Results : Collaborative filtering
❏ Final experiment on the EachMovie collaborative filtering data
❏ A collection of users indicates their preferred movie choice. A user and the movie chosen
are analogous to a document and the words in the document (respectively)
❏ The collaborative filtering task is as follows :
1. Train a model on fully observed set of users
2. For each unobserved user, we are shown all but one of movies
preferred by that user and are asked to predict what the held-out
movie is. The different algorithms are evaluated according to the
likelihood they assign to the held-out movie.
3. predictive perplexity on M test users to be :

Result for collaborative filtering on EachMovie data
3300 training users and
390 testing users
Mixture of unigram
LDA model

Summary
❏ Latent Dirichlet Allocation is a generative probabilistic model for collections of data.
❏ LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a
finite mixture over an underlying set of topics.
❏ Each topic is in turn, modeled as an infinite mixture over an underlying set of topic probability.
❏ This paper presents efficient approximate inference techniques based on variational methods and EM
algorithm for empirical Bayes parameter estimation.
❏ The paper reports results in document modeling, text classification and collaborative filtering,
compared to mixture of unigrams model and pLSI model.

Latent dirichletallocation presentation

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Latent dirichletallocation presentation (20)

Recently uploaded (20)

Latent dirichletallocation presentation