SlideShare a Scribd company logo
Latent Dirichlet Allocation
Soojung Hong
Feb 6. 2017
Contents
❏ Introduction : Text corpora modeling
❏ Background and Terminology
❏ Latent Dirichlet Allocation
❏ Comparison with other latent variable models
❏ Inference and Parameter Estimation
❏ Applications and Empirical Results
❏ Summary
Text corpora modeling
❏ Goal
Finding short descriptions of the members of the collections
(e.g. Finding topics from documents in corpora)
In particular, descriptions preserve essential statistical relationships
❏ Application areas
classification, novelty detection, summarization and collaborative filtering
❏ Relevant approaches for text modeling
tf-idf scheme, LSI, pLSI and latent Dirichlet allocation (LDA)
Summary of other approaches
Advantage Disadvantage
tf-idf Reduces documents from arbitrary length to
fixed-length lists of numbers
Relatively small amount of reduction. Reveal little
about inter-intra document’s statistical structure
LSI Reduces the dimensionality and learns latent topics
by performing a matrix composition (SVD) on
term-document matrix
Not clear why one should use LSI for generative
model of text
pLSI
Significant step forward probabilistic modeling of text No probabilistic model at the level of documents
If number of parameter grows linearly with size of
corpus, it leads overfitting
Not clear how to assign probability to document
outside of the training set
Details of tf-idf
● Term Frequency - Inverse Document Frequency :
● Term Frequency : How frequently a word occurs in a document
● Inverse Document Frequency : The inverse document frequency for any given term
http://www.joyofdata.de/blog/tf-idf-statistic-keyword-extraction/
High tf-idf scored words
-> occur frequently
-> provide more important information
Details of LSI (also called LSA)
❏ Analysis of latent semantics in a corpora of text (by using Singular Value Decomposition)
❏ A collection of documents can be represented as a term-document matrix
❏ Similarity of doc-doc, term-term, term-doc can be measured by cosine similarity
❏ Not easy to find polysemy and synonymy
❏ LSI transforms the original data in a different space so that two documents/words about
the same concept are mapped close
https://simonpaarlberg.com/post/latent-semantic-analyses/
Details of pLSI
❏ pLSI models each word in a document as a sample from mixture model
❏ The mixture components of mixture model are multinomial random variables that can be
viewed as representations of `topics’. Thus each word is generated from a single topic
❏ Each document is represented as a list of mixing proportions for these mixture components
“topics”.
d : the document index variable
c : a word's topic drawn from the document's topic distribution P(c|d)
w : a word drawn from the word distribution of this word's topic P(w|c)
d and w : observable variables
topic c : a latent variable (represented as z in the above)
Plate notation representing the pLSA model
Two assumptions
❏ Bag-of-words (Fundamental probabilistic assumption for dimensionality reduction)
The order of words in document can be neglected.
This is an assumption of exchangeability for the words in a document.
❏ Exchangeability (Documents are exchangeable)
The specific ordering of the documents in a corpus can be neglected
❏ Exchangeability is not equivalent to an assumption that the random variables are
independent and identically distributed, it is rather “Conditionally independent and
identically distributed”. This condition is with respect to an underlying latent parameter
of a probability distribution.
Notation and Terminology
❏ Latent variables aim to capture abstract notions such as topics
❏ Word is a basic unit of discrete data, defined as an item from a vocabulary indexed by
Words is unit-basis vectors that have a single component equal to one and all other
components equal to zero. v-th word in the vocabulary is represented by a V-vector w
such that and for
❏ Document is a sequence of N words denoted by
❏ Corpus is a collection of M documents denoted by
Latent Dirichlet Allocation
❏ Generative probabilistic model of a corpus
The basic idea is that documents are represented as random mixtures over latent topics
Each topic is characterized by a distribution over words.
❏ LDA generative process for each document w in a corpus D:
1. Choose N ∼ Poisson(ξ)
2. Choose θ ∼ Dir(α)
3. For each of the N words wn:
(a) Choose a topic zn ∼ Multinomial(θ).
(b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.
❏ Several simplifying assumption : [1] The dimensionality k of the Dirichlet distribution (and
thus the dimensionality of topic variable z) is assumed to be known and fixed [2] The word
probabilities are parameterized by k ×V matrix β where which for now we
treat as a fixed quantity to be estimated. [3] Poisson assumption is not critical to anything.
[4] N is independent to all other data generating θ and z
LDA (continue)
● k-dimensional Dirichlet random variable θ can take values in (k-1)-simplex
(k vector θ lies in the k-1 simplex if )
● The probability density on this simplex :
● Parameter α is a k-vector with components with and where is gamma function.
● Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics
z, and a set of N words w is given by:
LDA (continue)
● Where p(zn | θ) is simply θi for the unique i such that
Integrating over θ and summing over z, we obtain the marginal distribution of a document
● Probability of a corpus by taking the product of the marginal probabilities of single documents
Graphical Model Representation of LDA
Three levels to the LDA representation
1. The parameter α and β are corpus level parameters, assumed to be sampled once in the
process of generating a corpus
2. Variable θd are document-level variables, sampled per document
3. Variables zdn and wdn are word-level variables and sampled once for each word in each
document
LDA and Exchangeability
● If finite set of random variables {z1 , , , , zn} is said to be exchangeable, if the joint
distribution is invariant to permutation. If π is a permutation of the integers from 1 to N.
● An infinite sequence of random variables is infinitely exchangeable, if every infinite
subsequence is exchangeable.
● De Finetti’s representation theorem states that “the joint distribution of an infinitely
exchangeable sequence of random variables is as if a random parameter were drawn from some
distribution and then the random variables in question were independent and identically distributed,
conditioned on that parameter”.
● By de Finetti’s theorem, the probability of a sequence of words and topics must have the
form where θ is the random parameter of a multinomial over topic
Continuous mixture of unigrams
● LDA model is more elaborate than the two-level models often studied in the classical
hierarchical Bayesian literature
● By marginalizing over hidden topic variable z, LDA can be understood as a two-level model
● The word distribution p (w | θ, β)
Continuous mixture of unigrams (continue)
Given word distribution generative process for a document w :
1. Choose θ ~ Dir(α)
2. For each of the N words wn :
Choose a word wn from p(wn | θ, β) ,
This process defines the marginal distribution of a document as a continuous mixture
distribution,
p(θ|α) are the mixture weights p(wn | θ, β) are mixture components marginal out with θ
Interpretation of LDA
Example density on unigram distributions p(w|θ,β) under LDA for three words and four topics
The triangle embedded in the
x-y plane is the 2-D simplex
representing all possible
multinomial distributions over
three words.
Each of the vertices corresponds
to a deterministic distribution
that assigns probability one to
one of the words
Midpoint of an edge gives
probability 0.5 to two of the
words
Centroid of the triangle is the
uniform distribution over all
three words
Locations of the multinomial
distributions p(w|z) for each of the
four topics, and the surface shown on
top of the simplex is an example of a
density over the (V−1)-simplex
(multinomial distributions of words)
given by LDA.
Comparison : LDA and other latent variable models
(a) Unigram model (b) Mixture of unigram (c) pLSI model
Augment the unigram model with a
discrete random topic variable z and
obtain a mixture of unigrams model.
Each document is generated by the
first choosing topic z and then
generating N words independently
from the conditional multinomial
p(w|z).
Posits that a document label d and
a word wn are conditionally
independent given an unobserved
topic z.
The words of every document
are drawn independently from
single multinomial distribution.
Drawback of pLSI
❏ pLSI does capture the possibility that a document may contain multiple topics
- p(z|d) serves as the mixture weights of the topics for a particular document d
- However, d is a dummy index into the list of documents in the training set. Thus d is a
multinomial random variable restricted in training documents
❏ pLSI is not well-defined generative model of documents
- There is no natural way to assign probability to a previously unseen document.
- The parameters for a k-topic pLSI model are K multinomial distributions of size V and
M mixtures over the k-hidden topics. This gives kV + kM parameters and therefore
linear growth in M. The linear growth in parameters suggests that the model is prone
to overfitting.
LDA overcomes pLSI problem
❏ By treating the topic mixture weights as a k-parameter hidden random variable rather
than a large set of individual parameters which are explicitly linked to the training set
❏ LDA is a well-defined generative model and generalize easily to new documents.
Furthermore, the k+kV parameters in k-topic LDA model do not grow with the size of
training corpus, therefore doesn’t suffer from the issue of overfitting.
pLSI
LDA
Geometric Interpretation of latent space
Difference between LDA and other latent topic models (unigram, mixture of unigrams, pLSI)
The unigram model finds a single point on
the word simplex and posits that all words
in the corpus come from the
corresponding distribution.
Mixture of unigrams model posits that
for each document, one of the k points
on the word simplex (that is, one of the
corners of the topic simplex) is chosen
randomly and all the words of the
documents are drawn from the
distribution corresponding to that
point.
pLSI model posits that each word
of a training document comes
from randomly selected chosen
topic. The topics are themselves
drawn from a document-specific
distribution over topics.
LDA posit that each word of both
observed and unseen documents is
generated by a randomly chosen
topic which is drawn from a
distribution with a randomly
chosen parameter.
http://parkcu.com/blog/
Inference and Parameter Estimation
● Key inferential problem to solve in order to use LDA is computing the posterior distribution of
the hidden variables given a document. This distribution is intractable to compute in general.
Conditional density can be written as
● To normalize the distribution, we marginalize over the hidden variables and write following
equation in terms of model parameter as
● Problem : The denominator contains the marginal density of the observations (evidence) ,
often this evidence integral is unavailable in closed form or require exponential time to
compute. Therefore, we needs approximate inference algorithm
(*)
Inference and Parameter Estimation (continue)
● Goal : To find the best candidate approximate q(z), the one closest in KL divergence to
exact conditional p(z). Inference now to solving the optimization problem.
Family of densities over latent variables.
Each is candidate approximation to exact conditional
● The problem in (*) is coupling between θ and β arises due to the edges between θ, z and
w. Thus, by dropping these edges and the w nodes, and endowing the resulting
simplified graphical model with free variational parameters
Graphical model of the variational distribution
used to approximate the posterior in LDA
● By dropping these edges between θ and β and the w nodes, and endowing the resulting
simplified graphical model with free variational parameters, we obtain a family of
distributions on the latent variables. This family is characterized by the following variational
distribution:
Dirichlet parameter γ, multinomial parameters (φ1,...,φN) are the free variational parameters.
● The next step is to set up an optimization problem that determines the values of the
variational parameters γ and φ.
● Given a corpus of documents D =
● Find parameters α and β that maximize the (marginal) log likelihood of the data
● Quantity p(w|α,β) cannot be computed tractable
● Using variational inference provides us with a tractable lower bound on the log likelihood,
which we can maximize with respect to α an
● Find approximate empirical Bayes estimates for the LDA model via an alternating
variational EM procedure that maximizes a lower bound with respect to the variational
parameters γ and φ , and then, for fixed values of the variational parameters, maximizes
the lower bound with respect to the model parameters α and β.
Parameter Estimation
Smoothing
● Large vocabulary size is characteristic of document corpora often problems with sparsity
● Maximum likelihood estimates of the multinomial parameters β assign zero probability to
new words, and thus zero probability to new documents
● To avoid this problem using “smooth” the multinomial parameters β
● Assigning positive probability to all vocabulary items whether or not they are observed in
the training set.
⇒ Proposed solution for smoothing is to apply variational inference methods to the extended
model that includes Dirichlet smoothing on the multinomial parameter β
Graphical model representation of the smoothed LDA model.
Example Data
❏ Data : 16,000 documents from a subset of the TREC AP corpus
❏ Preparation : Removing a standard list of stop words and we use EM algorithm to find the
Dirichlet and conditional multinomial parameters for a 100-topic LDA model.
❏ The top words from some of the resulting multinomial distributions p(w|z)
(hopefully) These distributions capture some of the underlying topics in the corpus
Applications and Empirical Results : (1) Document Modeling
❏ Trained a number of latent variable models on two corpora
❏ Comparison of generalization performance (perplexity) of each models
❏ Goal is density estimation which achieve high likelihood on a held-out test set
❏ The perplexity, used by convention in the language modeling, is monotonically decreasing
in the likelihood of the test data.
❏ A lower perplexity score indicates better generalization performance.
❏ Formally, a test set of M documents, the perplexity is
Perplexity Results
Mixture of unigram model and pLSI both suffer serious overfitting issue.
LDA can easily assign probability to a new document without overfitting
● 5225 abstracts
● 28414 unique terms
● 6333 newswire articles
● 23075 unique terms
Applications and Empirical Results : Document classification
❏ Goal : classify a document into two or more mutually exclusive classes. In particular, by
using one LDA module for each class, we obtain a generative model for classification
❏ A challenging aspect of the document classification problem is the choice of features.
Treating individual words as features yields a rich but very large feature set
❏ One way to reduce feature set is to use an LDA model for dimensionality reduction. In
particular, LDA reduces any document to a fixed set of real-valued features - the posterior
Dirichlet parameters ϒ* (w) associated with the document.
❏ Two binary classification experiments using Reuter-21578 dataset
❏ Dataset contains 8000 documents and 15,818 words.
❏ Estimated the parameters of an LDA model on all the documents, without reference to
their true class label.
❏ SVM classification : one trained the low-dimensional representations provided by LDA
and the other with all the word features. (i.e. low-dimensional features by LDA vs. all
features, then use SVM classification)
Applications and Empirical Results : Collaborative filtering
❏ Final experiment on the EachMovie collaborative filtering data
❏ A collection of users indicates their preferred movie choice. A user and the movie chosen
are analogous to a document and the words in the document (respectively)
❏ The collaborative filtering task is as follows :
1. Train a model on fully observed set of users
2. For each unobserved user, we are shown all but one of movies
preferred by that user and are asked to predict what the held-out
movie is. The different algorithms are evaluated according to the
likelihood they assign to the held-out movie.
3. predictive perplexity on M test users to be :
Result for collaborative filtering on EachMovie data
3300 training users and
390 testing users
Mixture of unigram
LDA model
Summary
❏ Latent Dirichlet Allocation is a generative probabilistic model for collections of data.
❏ LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a
finite mixture over an underlying set of topics.
❏ Each topic is in turn, modeled as an infinite mixture over an underlying set of topic probability.
❏ This paper presents efficient approximate inference techniques based on variational methods and EM
algorithm for empirical Bayes parameter estimation.
❏ The paper reports results in document modeling, text classification and collaborative filtering,
compared to mixture of unigrams model and pLSI model.

More Related Content

PDF
Latent Dirichlet Allocation
Sangwoo Mo
 
PDF
Topics Modeling
Svitlana volkova
 
PPT
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
PPTX
Probabilistic information retrieval models & systems
Selman Bozkır
 
PPTX
Text mining
ThejeswiniChivukula
 
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Salah Amean
 
PPTX
ブートストラップ法とその周辺とR
Daisuke Yoneoka
 
PPTX
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
 
Latent Dirichlet Allocation
Sangwoo Mo
 
Topics Modeling
Svitlana volkova
 
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
Probabilistic information retrieval models & systems
Selman Bozkır
 
Text mining
ThejeswiniChivukula
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Salah Amean
 
ブートストラップ法とその周辺とR
Daisuke Yoneoka
 
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
 

What's hot (20)

DOC
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
ijscmc
 
PDF
08. Mining Type Of Complex Data
Achmad Solichin
 
PPT
2.3 bayesian classification
Krish_ver2
 
PPT
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
PPTX
Unit 4-apache pig
vishal choudhary
 
PDF
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
PDF
Predictive analytics
Sidharth Raj Agarwal
 
PDF
18 Data Streams
Pier Luca Lanzi
 
PDF
潜在ディリクレ配分法
y-uti
 
PPTX
Text data mining1
KU Leuven
 
POTX
LDA Beginner's Tutorial
Wayne Lee
 
PPT
Inverted index
Krishna Gehlot
 
PDF
Rでベイズをやってみよう!(コワい本1章)@BCM勉強会
Shushi Namba
 
PPT
Data Mining
NafiulIslamNakib
 
PDF
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
tmtm otm
 
PDF
The Nature of Data
Josef Šlerka
 
PPTX
Data Analytics
Srinimf-Slides
 
PDF
大規模グラフアルゴリズムの最先端
Takuya Akiba
 
PDF
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui
 
PPTX
Probabilistic models (part 1)
KU Leuven
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
ijscmc
 
08. Mining Type Of Complex Data
Achmad Solichin
 
2.3 bayesian classification
Krish_ver2
 
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
Unit 4-apache pig
vishal choudhary
 
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
Predictive analytics
Sidharth Raj Agarwal
 
18 Data Streams
Pier Luca Lanzi
 
潜在ディリクレ配分法
y-uti
 
Text data mining1
KU Leuven
 
LDA Beginner's Tutorial
Wayne Lee
 
Inverted index
Krishna Gehlot
 
Rでベイズをやってみよう!(コワい本1章)@BCM勉強会
Shushi Namba
 
Data Mining
NafiulIslamNakib
 
深層学習の不確実性 - Uncertainty in Deep Neural Networks -
tmtm otm
 
The Nature of Data
Josef Šlerka
 
Data Analytics
Srinimf-Slides
 
大規模グラフアルゴリズムの最先端
Takuya Akiba
 
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui
 
Probabilistic models (part 1)
KU Leuven
 
Ad

Viewers also liked (20)

PDF
Latent Dirichlet Allocation
Marco Righini
 
PDF
LDA入門
正志 坪坂
 
PPT
A Topic Model for Traffic Speed Data Analysis
Tomonari Masada
 
PDF
WSDM2014
Jun Yu
 
PDF
Interactive Latent Dirichlet Allocation
Quentin Pleplé
 
PDF
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Sebastien Louvigne
 
PPT
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Tomonari Masada
 
PDF
What's new in scikit-learn 0.17
Andreas Mueller
 
PDF
A Note on Expectation-Propagation for Latent Dirichlet Allocation
Tomonari Masada
 
PPTX
LDA
henri2005
 
PDF
Visualzing Topic Models
Turi, Inc.
 
PDF
Topic Modeling
Kyunghoon Kim
 
PPTX
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
Tomonari Masada
 
PPTX
PCA vs LDA
Nawin Kumar Sharma
 
PPTX
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Tomonari Masada
 
PPT
Understandig PCA and LDA
Dr. Syed Hassan Amin
 
PDF
Blei ngjordan2003
Ajay Ohri
 
PDF
C4.5
Daniel LIAO
 
PDF
Topic Models, LDA and all that
Zhibo Xiao
 
PPTX
Text Mining using LDA with Context
Steffen Staab
 
Latent Dirichlet Allocation
Marco Righini
 
LDA入門
正志 坪坂
 
A Topic Model for Traffic Speed Data Analysis
Tomonari Masada
 
WSDM2014
Jun Yu
 
Interactive Latent Dirichlet Allocation
Quentin Pleplé
 
Goal-based Recommendation utilizing Latent Dirichlet Allocation
Sebastien Louvigne
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Tomonari Masada
 
What's new in scikit-learn 0.17
Andreas Mueller
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
Tomonari Masada
 
Visualzing Topic Models
Turi, Inc.
 
Topic Modeling
Kyunghoon Kim
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
Tomonari Masada
 
PCA vs LDA
Nawin Kumar Sharma
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Tomonari Masada
 
Understandig PCA and LDA
Dr. Syed Hassan Amin
 
Blei ngjordan2003
Ajay Ohri
 
Topic Models, LDA and all that
Zhibo Xiao
 
Text Mining using LDA with Context
Steffen Staab
 
Ad

Similar to Latent dirichletallocation presentation (20)

PDF
graduate_thesis (1)
Sihan Chen
 
PDF
Basic review on topic modeling
Hiroyuki Kuromiya
 
ODP
Topic Modeling
Karol Grzegorczyk
 
PDF
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
PDF
Topic modelling
Shubhmay Potdar
 
PDF
Survey of Generative Clustering Models 2008
Roman Stanchak
 
PDF
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
Aaron Li
 
PDF
Blei lafferty2009
Ajay Ohri
 
PPT
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
PDF
Canini09a
Ajay Ohri
 
PDF
LDA on social bookmarking systems
Denis Parra Santander
 
PPTX
Topic extraction using machine learning
Sanjib Basak
 
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
PPTX
Tdm probabilistic models (part 2)
KU Leuven
 
PPTX
Topic Extraction using Machine Learning
Sanjib Basak
 
PDF
A Text Mining Research Based on LDA Topic Modelling
csandit
 
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
PDF
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
IJECEIAES
 
graduate_thesis (1)
Sihan Chen
 
Basic review on topic modeling
Hiroyuki Kuromiya
 
Topic Modeling
Karol Grzegorczyk
 
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
Topic modelling
Shubhmay Potdar
 
Survey of Generative Clustering Models 2008
Roman Stanchak
 
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...
Aaron Li
 
Blei lafferty2009
Ajay Ohri
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
Canini09a
Ajay Ohri
 
LDA on social bookmarking systems
Denis Parra Santander
 
Topic extraction using machine learning
Sanjib Basak
 
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
Tdm probabilistic models (part 2)
KU Leuven
 
Topic Extraction using Machine Learning
Sanjib Basak
 
A Text Mining Research Based on LDA Topic Modelling
csandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
IJECEIAES
 

Recently uploaded (20)

PDF
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPT
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PDF
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
PDF
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PPTX
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 
Cryptography and Information :Security Fundamentals
Dr. Madhuri Jawale
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
sunil mishra pptmmmmmmmmmmmmmmmmmmmmmmmmm
singhamit111
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
Information Retrieval and Extraction - Module 7
premSankar19
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
Understanding the Key Components and Parts of a Drone System.ppt
Siva Reddy
 
Inventory management chapter in automation and robotics.
atisht0104
 
STUDY OF NOVEL CHANNEL MATERIALS USING III-V COMPOUNDS WITH VARIOUS GATE DIEL...
ijoejnl
 
The Effect of Artifact Removal from EEG Signals on the Detection of Epileptic...
Partho Prosad
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
FUNDAMENTALS OF ELECTRIC VEHICLES UNIT-1
MikkiliSuresh
 

Latent dirichletallocation presentation

  • 2. Contents ❏ Introduction : Text corpora modeling ❏ Background and Terminology ❏ Latent Dirichlet Allocation ❏ Comparison with other latent variable models ❏ Inference and Parameter Estimation ❏ Applications and Empirical Results ❏ Summary
  • 3. Text corpora modeling ❏ Goal Finding short descriptions of the members of the collections (e.g. Finding topics from documents in corpora) In particular, descriptions preserve essential statistical relationships ❏ Application areas classification, novelty detection, summarization and collaborative filtering ❏ Relevant approaches for text modeling tf-idf scheme, LSI, pLSI and latent Dirichlet allocation (LDA)
  • 4. Summary of other approaches Advantage Disadvantage tf-idf Reduces documents from arbitrary length to fixed-length lists of numbers Relatively small amount of reduction. Reveal little about inter-intra document’s statistical structure LSI Reduces the dimensionality and learns latent topics by performing a matrix composition (SVD) on term-document matrix Not clear why one should use LSI for generative model of text pLSI Significant step forward probabilistic modeling of text No probabilistic model at the level of documents If number of parameter grows linearly with size of corpus, it leads overfitting Not clear how to assign probability to document outside of the training set
  • 5. Details of tf-idf ● Term Frequency - Inverse Document Frequency : ● Term Frequency : How frequently a word occurs in a document ● Inverse Document Frequency : The inverse document frequency for any given term http://www.joyofdata.de/blog/tf-idf-statistic-keyword-extraction/ High tf-idf scored words -> occur frequently -> provide more important information
  • 6. Details of LSI (also called LSA) ❏ Analysis of latent semantics in a corpora of text (by using Singular Value Decomposition) ❏ A collection of documents can be represented as a term-document matrix ❏ Similarity of doc-doc, term-term, term-doc can be measured by cosine similarity ❏ Not easy to find polysemy and synonymy ❏ LSI transforms the original data in a different space so that two documents/words about the same concept are mapped close https://simonpaarlberg.com/post/latent-semantic-analyses/
  • 7. Details of pLSI ❏ pLSI models each word in a document as a sample from mixture model ❏ The mixture components of mixture model are multinomial random variables that can be viewed as representations of `topics’. Thus each word is generated from a single topic ❏ Each document is represented as a list of mixing proportions for these mixture components “topics”. d : the document index variable c : a word's topic drawn from the document's topic distribution P(c|d) w : a word drawn from the word distribution of this word's topic P(w|c) d and w : observable variables topic c : a latent variable (represented as z in the above) Plate notation representing the pLSA model
  • 8. Two assumptions ❏ Bag-of-words (Fundamental probabilistic assumption for dimensionality reduction) The order of words in document can be neglected. This is an assumption of exchangeability for the words in a document. ❏ Exchangeability (Documents are exchangeable) The specific ordering of the documents in a corpus can be neglected ❏ Exchangeability is not equivalent to an assumption that the random variables are independent and identically distributed, it is rather “Conditionally independent and identically distributed”. This condition is with respect to an underlying latent parameter of a probability distribution.
  • 9. Notation and Terminology ❏ Latent variables aim to capture abstract notions such as topics ❏ Word is a basic unit of discrete data, defined as an item from a vocabulary indexed by Words is unit-basis vectors that have a single component equal to one and all other components equal to zero. v-th word in the vocabulary is represented by a V-vector w such that and for ❏ Document is a sequence of N words denoted by ❏ Corpus is a collection of M documents denoted by
  • 10. Latent Dirichlet Allocation ❏ Generative probabilistic model of a corpus The basic idea is that documents are represented as random mixtures over latent topics Each topic is characterized by a distribution over words. ❏ LDA generative process for each document w in a corpus D: 1. Choose N ∼ Poisson(ξ) 2. Choose θ ∼ Dir(α) 3. For each of the N words wn: (a) Choose a topic zn ∼ Multinomial(θ). (b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn. ❏ Several simplifying assumption : [1] The dimensionality k of the Dirichlet distribution (and thus the dimensionality of topic variable z) is assumed to be known and fixed [2] The word probabilities are parameterized by k ×V matrix β where which for now we treat as a fixed quantity to be estimated. [3] Poisson assumption is not critical to anything. [4] N is independent to all other data generating θ and z
  • 11. LDA (continue) ● k-dimensional Dirichlet random variable θ can take values in (k-1)-simplex (k vector θ lies in the k-1 simplex if ) ● The probability density on this simplex : ● Parameter α is a k-vector with components with and where is gamma function. ● Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:
  • 12. LDA (continue) ● Where p(zn | θ) is simply θi for the unique i such that Integrating over θ and summing over z, we obtain the marginal distribution of a document ● Probability of a corpus by taking the product of the marginal probabilities of single documents
  • 13. Graphical Model Representation of LDA Three levels to the LDA representation 1. The parameter α and β are corpus level parameters, assumed to be sampled once in the process of generating a corpus 2. Variable θd are document-level variables, sampled per document 3. Variables zdn and wdn are word-level variables and sampled once for each word in each document
  • 14. LDA and Exchangeability ● If finite set of random variables {z1 , , , , zn} is said to be exchangeable, if the joint distribution is invariant to permutation. If π is a permutation of the integers from 1 to N. ● An infinite sequence of random variables is infinitely exchangeable, if every infinite subsequence is exchangeable. ● De Finetti’s representation theorem states that “the joint distribution of an infinitely exchangeable sequence of random variables is as if a random parameter were drawn from some distribution and then the random variables in question were independent and identically distributed, conditioned on that parameter”. ● By de Finetti’s theorem, the probability of a sequence of words and topics must have the form where θ is the random parameter of a multinomial over topic
  • 15. Continuous mixture of unigrams ● LDA model is more elaborate than the two-level models often studied in the classical hierarchical Bayesian literature ● By marginalizing over hidden topic variable z, LDA can be understood as a two-level model ● The word distribution p (w | θ, β)
  • 16. Continuous mixture of unigrams (continue) Given word distribution generative process for a document w : 1. Choose θ ~ Dir(α) 2. For each of the N words wn : Choose a word wn from p(wn | θ, β) , This process defines the marginal distribution of a document as a continuous mixture distribution, p(θ|α) are the mixture weights p(wn | θ, β) are mixture components marginal out with θ
  • 17. Interpretation of LDA Example density on unigram distributions p(w|θ,β) under LDA for three words and four topics The triangle embedded in the x-y plane is the 2-D simplex representing all possible multinomial distributions over three words. Each of the vertices corresponds to a deterministic distribution that assigns probability one to one of the words Midpoint of an edge gives probability 0.5 to two of the words Centroid of the triangle is the uniform distribution over all three words Locations of the multinomial distributions p(w|z) for each of the four topics, and the surface shown on top of the simplex is an example of a density over the (V−1)-simplex (multinomial distributions of words) given by LDA.
  • 18. Comparison : LDA and other latent variable models (a) Unigram model (b) Mixture of unigram (c) pLSI model Augment the unigram model with a discrete random topic variable z and obtain a mixture of unigrams model. Each document is generated by the first choosing topic z and then generating N words independently from the conditional multinomial p(w|z). Posits that a document label d and a word wn are conditionally independent given an unobserved topic z. The words of every document are drawn independently from single multinomial distribution.
  • 19. Drawback of pLSI ❏ pLSI does capture the possibility that a document may contain multiple topics - p(z|d) serves as the mixture weights of the topics for a particular document d - However, d is a dummy index into the list of documents in the training set. Thus d is a multinomial random variable restricted in training documents ❏ pLSI is not well-defined generative model of documents - There is no natural way to assign probability to a previously unseen document. - The parameters for a k-topic pLSI model are K multinomial distributions of size V and M mixtures over the k-hidden topics. This gives kV + kM parameters and therefore linear growth in M. The linear growth in parameters suggests that the model is prone to overfitting.
  • 20. LDA overcomes pLSI problem ❏ By treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set ❏ LDA is a well-defined generative model and generalize easily to new documents. Furthermore, the k+kV parameters in k-topic LDA model do not grow with the size of training corpus, therefore doesn’t suffer from the issue of overfitting. pLSI LDA
  • 21. Geometric Interpretation of latent space Difference between LDA and other latent topic models (unigram, mixture of unigrams, pLSI) The unigram model finds a single point on the word simplex and posits that all words in the corpus come from the corresponding distribution. Mixture of unigrams model posits that for each document, one of the k points on the word simplex (that is, one of the corners of the topic simplex) is chosen randomly and all the words of the documents are drawn from the distribution corresponding to that point. pLSI model posits that each word of a training document comes from randomly selected chosen topic. The topics are themselves drawn from a document-specific distribution over topics. LDA posit that each word of both observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution with a randomly chosen parameter.
  • 23. Inference and Parameter Estimation ● Key inferential problem to solve in order to use LDA is computing the posterior distribution of the hidden variables given a document. This distribution is intractable to compute in general. Conditional density can be written as ● To normalize the distribution, we marginalize over the hidden variables and write following equation in terms of model parameter as ● Problem : The denominator contains the marginal density of the observations (evidence) , often this evidence integral is unavailable in closed form or require exponential time to compute. Therefore, we needs approximate inference algorithm (*)
  • 24. Inference and Parameter Estimation (continue) ● Goal : To find the best candidate approximate q(z), the one closest in KL divergence to exact conditional p(z). Inference now to solving the optimization problem. Family of densities over latent variables. Each is candidate approximation to exact conditional ● The problem in (*) is coupling between θ and β arises due to the edges between θ, z and w. Thus, by dropping these edges and the w nodes, and endowing the resulting simplified graphical model with free variational parameters Graphical model of the variational distribution used to approximate the posterior in LDA
  • 25. ● By dropping these edges between θ and β and the w nodes, and endowing the resulting simplified graphical model with free variational parameters, we obtain a family of distributions on the latent variables. This family is characterized by the following variational distribution: Dirichlet parameter γ, multinomial parameters (φ1,...,φN) are the free variational parameters. ● The next step is to set up an optimization problem that determines the values of the variational parameters γ and φ.
  • 26. ● Given a corpus of documents D = ● Find parameters α and β that maximize the (marginal) log likelihood of the data ● Quantity p(w|α,β) cannot be computed tractable ● Using variational inference provides us with a tractable lower bound on the log likelihood, which we can maximize with respect to α an ● Find approximate empirical Bayes estimates for the LDA model via an alternating variational EM procedure that maximizes a lower bound with respect to the variational parameters γ and φ , and then, for fixed values of the variational parameters, maximizes the lower bound with respect to the model parameters α and β. Parameter Estimation
  • 27. Smoothing ● Large vocabulary size is characteristic of document corpora often problems with sparsity ● Maximum likelihood estimates of the multinomial parameters β assign zero probability to new words, and thus zero probability to new documents ● To avoid this problem using “smooth” the multinomial parameters β ● Assigning positive probability to all vocabulary items whether or not they are observed in the training set. ⇒ Proposed solution for smoothing is to apply variational inference methods to the extended model that includes Dirichlet smoothing on the multinomial parameter β Graphical model representation of the smoothed LDA model.
  • 28. Example Data ❏ Data : 16,000 documents from a subset of the TREC AP corpus ❏ Preparation : Removing a standard list of stop words and we use EM algorithm to find the Dirichlet and conditional multinomial parameters for a 100-topic LDA model. ❏ The top words from some of the resulting multinomial distributions p(w|z) (hopefully) These distributions capture some of the underlying topics in the corpus
  • 29. Applications and Empirical Results : (1) Document Modeling ❏ Trained a number of latent variable models on two corpora ❏ Comparison of generalization performance (perplexity) of each models ❏ Goal is density estimation which achieve high likelihood on a held-out test set ❏ The perplexity, used by convention in the language modeling, is monotonically decreasing in the likelihood of the test data. ❏ A lower perplexity score indicates better generalization performance. ❏ Formally, a test set of M documents, the perplexity is
  • 30. Perplexity Results Mixture of unigram model and pLSI both suffer serious overfitting issue. LDA can easily assign probability to a new document without overfitting ● 5225 abstracts ● 28414 unique terms ● 6333 newswire articles ● 23075 unique terms
  • 31. Applications and Empirical Results : Document classification ❏ Goal : classify a document into two or more mutually exclusive classes. In particular, by using one LDA module for each class, we obtain a generative model for classification ❏ A challenging aspect of the document classification problem is the choice of features. Treating individual words as features yields a rich but very large feature set ❏ One way to reduce feature set is to use an LDA model for dimensionality reduction. In particular, LDA reduces any document to a fixed set of real-valued features - the posterior Dirichlet parameters ϒ* (w) associated with the document.
  • 32. ❏ Two binary classification experiments using Reuter-21578 dataset ❏ Dataset contains 8000 documents and 15,818 words. ❏ Estimated the parameters of an LDA model on all the documents, without reference to their true class label. ❏ SVM classification : one trained the low-dimensional representations provided by LDA and the other with all the word features. (i.e. low-dimensional features by LDA vs. all features, then use SVM classification)
  • 33. Applications and Empirical Results : Collaborative filtering ❏ Final experiment on the EachMovie collaborative filtering data ❏ A collection of users indicates their preferred movie choice. A user and the movie chosen are analogous to a document and the words in the document (respectively) ❏ The collaborative filtering task is as follows : 1. Train a model on fully observed set of users 2. For each unobserved user, we are shown all but one of movies preferred by that user and are asked to predict what the held-out movie is. The different algorithms are evaluated according to the likelihood they assign to the held-out movie. 3. predictive perplexity on M test users to be :
  • 34. Result for collaborative filtering on EachMovie data 3300 training users and 390 testing users Mixture of unigram LDA model
  • 35. Summary ❏ Latent Dirichlet Allocation is a generative probabilistic model for collections of data. ❏ LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. ❏ Each topic is in turn, modeled as an infinite mixture over an underlying set of topic probability. ❏ This paper presents efficient approximate inference techniques based on variational methods and EM algorithm for empirical Bayes parameter estimation. ❏ The paper reports results in document modeling, text classification and collaborative filtering, compared to mixture of unigrams model and pLSI model.