SlideShare a Scribd company logo
Paragraph Topic Classification
Eugene Nho
Graduate School of Business
Stanford University
Stanford, CA 94305
enho@stanford.edu
Edward Ng
Department of Electrical Engineering
Stanford University
Stanford, CA 94305
edjng@stanford.edu
1 Introduction
This project implements and compares various approaches for predicting the topics of paragraph-length texts.1
The motivation
for the project is an increasingly widening information processing gap modern decision makers are facing; text information is
generated at an ever increasing rate while the human cognitive ability to ingest text is largely constant over time. Machine
learning can help reduce this gap by automatically identifying contextually relevant, useful text from the rest. We see
paragraph topic classification as a first step toward that future.
We formally define the problem as constructing a model that, given a corpus with k topics, outputs a k-dimensional vector
for each paragraph-length input, indicating which topics the input is about. Note that this is a multilabel classification,
where an input could belong to multiple topics. Approximately 315,000 Wikipedia article summaries, categorized by
users, were used as data. We employed and compared the following approaches, spanning a wide range of complexity and
computational requirements: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent
Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks
(LSTM).
2 Related Work
2.1 Traditional Approaches
In topic classification using traditional machine learning, Naive Bayes has been preeminent because it is both accurate and
quick to train in many contexts (Ting et. al, 2011). Furthermore, many variations of Naive Bayes have been developed in
recent years. For example, an algorithm developed by Wang et. al (2012) uses Naive Bayes log-count ratios as features into
an SVM model (known as NB-SVM) and utilizes bigrams rather than term frequency-inverse document frequency (tf-idf).
This combined model (with a small modification, known as NB-LM) is considered among the state-of-the-art supervised
learning algorithms for topic classification.
2.2 Topic Modeling
Topic modeling is another area of active research. Topic models are statistical models used to discover topics that occur in
documents. Latent Dirichlet Allocation (LDA) is the simplest topic model and provides foundation for many of the extensions
that followed by relaxing some of LDA’s assumptions and incorporating other types of meta data (Blei, 2012). Because LDA
is unsupervised, labeled LDA was developed to deploy LDA in a supervised learning context (Ramage et. al, 2009). While
labeled LDA has performance comparable to SVM, it is not considered state-of-the-art.
1
This project was conducted for CS 229: Machine Learning at Stanford University during Autumn Quarter 2016, and shared the code
base with a project for CS 221: Artificial Intelligence: Principles and Techniques, where Sigtryggur Kjartansson was also an author.
2.3 Neural Networks
The recent surge in interest in neural networks has spilled over to topic classification. Convolutional neural networks (CNN),
initially used for image classification, have been modified to perform natural language processing (NLP) and are considered
state-of-the-art for topic classification (Kim, 2014). Many variations of CNN have been developed since then, including one
that uses individual characters as inputs (Zhang et. al, 2015) and another that train embeddings through parallel convolution
layers with differing region sizes (Johnson et. al, 2014). However, the increased complexity of neural network topology
requires significantly larger training datasets. For example, according to Zhang et. al (2015), character-level CNNs need
datasets on the order of several million to outperform traditional approaches.
3 Dataset
We used Wikipedia article summaries as our data. Each article’s first summary paragraph served as our input, and its topic
assignments as our labels. We scraped Wikipedia for total ≈ 315, 000 inputs, all belonging to one or more of the following
five topics: mathematics, politics, computer science, film, and music.
We limited our scope of topics to five to test the efficacy of our models without requiring an unwieldy amount of data, and
thus computational resources. Our training and test sets were 275,000 and 40,000 entries respectively, with an average word
count of 93 for each summary.
We used Wikipedia because it is one of the few available sources of topically labeled, multilabel dataset. We insisted on
solving a multilabel problem because, in real-world applications, a piece of text is rarely about one topic only. An example
entry is:
In mathematics, the minimum k-cut, is a combinatorial optimization problem that requires finding a set of
edges whose removal would partition the graph to k connected components. These edges are referred to
as k-cut. The goal is to find the minimum-weight k-cut. This partitioning can have applications in VLSI
design, data-mining, finite elements and communication in parallel computing.
(Labeled as mathematics, computer science)
4 Features
4.1 Term Frequency-Inverse Document Frequency
Ff-idf was used because it captures how important a word is to a document in a corpus by calculating the frequency of a word
in the document offset by its frequency in the overall corpus, de-emphasizing common words like “is” and “the.” It is also a
common baseline feature for information retrieval tasks.
4.2 Global Vectors for Word Representation (GloVe)
We used pre-trained GloVe to feed richer representation of input text into our models. GloVe represents each word as a vector
representing the co-occurrence of the word with other words over the global corpus. GloVe vectors are trained by minimizing
the following cost function:
J(θ) =
1
2
W
i,j=1
f(Pij)(uT
i vj − log Pij)2
where Pij is the probability of word i occurring in the context of word j, W is the size of the vocabulary, f is a weighting
function, and u and v are the GloVe vector representations of i and j respectively.
Our hypothesis for using GloVe vectors was that they would more effectively capture the underlying relationships between
words characterizing the same topic because related words are likely represented by similar vectors. We used pre-trained
GloVe vectors from Stanford’s NLP group (trained on Wikipedia 2014 corpus, 100-dimensions).
5 Methods
Our overall approach to text classification was starting with a simple baseline model and gradually adding more sophisticated
features or model architecture to see if and how the results improve. Table 1 summarizes our approaches and rationale.
2
Table 1: Summary of Approaches and Rationale
Model Approach [features] Rationale
Naive Bayes [tf-idf] Common baseline model for text classification
OvR SVM [GloVe] One-vs-Rest supports multilabel learning; richer feature (GloVe)
LDA with OvR SVM [tf] To capture latent topics more effectively
CNN [GloVe] Fast; could identify "shapes of ideas" in NLP context
LSTM [GloVe] Effective in learning from sequential experience (e.g. text)
5.1 Naive Bayes
We used Naive Bayes for our baseline implementation. Naive Bayes is often the go-to strategy for baseline given its speed
and ease of its training. We trained Naive Bayes from sklearn for multiclass topic classification, where paragraphs that fit into
multiple topics were classified to be under a single compound label. Since we worked with k = 5 topics, there were 25
= 32
possible compound labels to choose from. Input feature was a tf-idf matrix.
5.2 One-vs-Rest Support Vector Machine with GloVe
To test our hypothesis that feeding richer representation of input text would improve the results, we used GloVe as our feature
and fed it into a linear Support Vector Machine (SVM) classifier using One-vs-Rest strategy (OvR). OvR was chosen because
it enables multilabel classification by constructing k SVMs, where k is the number of classes. Given m training data (x(1)
,
y(1)
),..., (x(m)
, y(m)
) where x(i)
∈ Rn
, and y(i)
∈ Rk
is a vector indicating the classes x(i)
belongs to, the lth SVM solves
the following problem:
min
wl,bl,ξl
1
2
wT
l wl + c
m
i=1
ξ
(i)
l
s.t. wT
l x(i)
+ bl ≥ 1 − ξ
(i)
l , if y
(i)
l = 1
wT
l x(i)
+ bl ≤ −1 + ξ
(i)
l , if y
(i)
l = 0
where ξi ∈ R ≥ 0 for all i, wl ∈ Rn
, and c is a penalty parameter. In our implementation, x(i)
was constructed by first
vectorizing each word of the ith example into a 100-dimensional vector using the pre-trained GloVe weights, and then
collapsing the matrix into a single vector using tf-idf as the weighting factor.
For a new example x, OvR conducts the following for each l:
yl = sgn(wT
l x + bl)
where l = 1, ..., k. The key difference from multiclass classification, which appears more commonly in literature, is that OvR
for multilabel performs the sgn operation k times, rather than performing argmax once.
5.3 Latent Dirichlet Allocation with One-vs-Rest Support Vector Machine
LDA is a 3-level hierarchical Bayesian model often used for topic modeling in NLP. Fully explaining the model is outside the
scope of this paper, but one can get an intuition by observing how a single document w is modeled:
p(w|α, β) = p(θ|α)
N
n=1 zn
p(zn|θ)p(wn|zn, β) dθ
where α and β are parameters, θ~Dirichlet(α) is a random variable representing topic mixture, zn is the topic for nth word
of the document, and wn is nth word. Intuition is that each document w is essentially modeled as p(zn|θ)p(wn|zn)—in other
words, as random mixtures over latent topics, where each topic is in turn characterized by a distribution over words (Blei
et. al, 2003). Our hypothesis for using LDA was that modeling topics explicitly would help us mine intricate, underlying
relationships between words and latent topics in ways classification algorithms alone cannot.
Since LDA is an unsupervised learning model, we used its output—document topic matrix (number of inputs × number of
topics)—as our design matrix X that gets fed into the OvR SVM.
5.4 Convolutional Neural Network (CNN)
Convolutional Neural Networks (CNN) apply layers of convolving filters to features. Traditionally CNNs have been used for
computer vision, but recently they are increasingly applied to NLP problems. Instead of applying filters to image pixels,
3
CNNs in NLP apply the filters to the matrices representing paragraphs, typically consisting of rows of word embeddings such
as word2vec or GloVe.
We will not define CNNs with full mathematical rigor in this paper, but the essence of CNNs can be glimpsed at the
convolution layers. Given k-dimensional word embedding vectors xi’s for i = 1, ..., n, where n is the number of words in
a document, let the document matrix be represented as x1:n ∈ Rn×k
, where all n of the xi’s in the document are stacked
together (hence 1:n). Then convolution occurs when a filter w ∈ Rh×k
is applied to a window of h words to produce a new
feature. A feature ci is generated from each window of words xi:i+h−1 by calcuating
ci = f(w · xi:i+h−1 + b)
where f is a non-linear activation function and b ∈ R is a bias term. This filter w is applied to each possible window of
words in the paragraph, producing n − h + 1 of ci’s above (Kim, 2014). Repeating this process—with multiple convolution
layers—in computer vision context would enable identification of edges and eventually shapes of objects. Intuition in context
of NLP is that, by applying these convolving filters and abstracting away from raw word embeddings, CNNs enable us to
capture the "shapes of topics" in a similar way.
We chose CNN for two main reasons. First, we believe CNNs do what LDA does–capturing latent topics–but in a much more
powerful way, as described above. Second, CNNs are fast and relatively easy to train with limited GPU resources, which
enabled us to experiment with various hyperparameter settings without extensive infrastructure setup.
The model architecture is a multilabel variant of the CNN architecture proposed in Kim (2014), using a static embedding
layer and an extra fully-connected layer. In order to do multilabel classification, we used a binary cross entropy loss function
and sigmoid activation.
Figure 1: Graph of CNN Model Architecture
5.5 Long Short Term Memory (LSTM)
LSTM is a Recurrent Neural Network (RNN) architecture that is well-suited to learn from sequential experience when
long time lags of unknown size exist between important events. It accomplishes this by operating the following gates for
controlling the flow of information:
it = σ W(i)
xt + U(i)
ht−1
ft = σ W(f)
xt + U(f)
ht−1
ot = σ W(o)
xt + U(o)
ht−1
where W’s and U’s are parameter matrices. Without going into details, intuition is that at each time t, given input xt and
hidden state from the previous period ht−1, an LSTM cell has the power to decide whether to input (it), forget (ft), or output
(ot) the data based on function σ, thereby enabling the overall network to learn when the "important events" may be close or
far apart in a given sequence of data (Olah, 2015; Socher, 2016).
We tried LSTM mainly as a benchmark against our CNN results. RNN architecture is widely viewed as the go-to strategy for
NLP tasks because language is a sequential type of data. Our model architecture is defined by an embedding layer, a single
layer of 200 LSTM cells, and a fully connected layer using sigmoid activation. As with CNN, it is a multilabel variant using
a binary cross entropy loss function.
6 Results and Discussion
The overall results from our approaches for paragraph topic classification are summarized in Figure 2 below. There are a few
points worth noting about these results.
4
Figure 2: Metrics for All Approaches
First, Naive Bayes was very robust despite its simplicity, and given its preeminence in topic classification, this is not at all
surprising.
Second, OvR SVM with GloVe performed very poorly on accuracy. We believe this has to do with the fact that we collapsed
the GloVe document matrix into a vector to feed into the classifier. The initial hypothesis was that the collapsed vectors
would develop and exhibit distinct patterns depending on the topic. However, we suspect they began to converge toward a
similar, overlapping pattern as more data was used. As a reference point, this model achieved a much higher accuracy (40%)
when trained over only 10,000 samples. Using tf-idf to collapse GloVe’s could have been a poor way to reduce dimensions in
this case—as opposed to a proper principal component analysis where maximum information is retained as high-dimensional
inputs are reduced to a lower-dimensional subspace.
Third, contrary to the initial hypothesis, LDA did not work well in topic prediction. We believe the principal reason is the
difficulty to control the way an unsupervised algorithm learns. The output of LDA we used for classification—document
topic matrix—is LDA’s best guess at the likelihood of a given document belonging to each of the five clusters between which
it learned to differentiate. A corpus with m documents could be clustered into five groups in 5m
different ways, and it is
nearly impossible to exactly align the way an unsupervised learning algorithm clusters them with the way humans do.
Fourth, both neural networks far outperformed other approaches with ≈ 97% accuracy. It is noteworthy that CNN and LSTM
both ended up with very similar metrics, but CNN was much faster and robust. CNN’s training time for the entire dataset was
on average 30 minutes with a standard laptop CPU, whereas LSTM took more than 3 hours for a subset half the size. Lastly,
as the next section on hyperparameter tuning will discuss, CNN was very robust, showing consistently effective results across
various hyperparameter settings. In contrast, LSTM’s results varied widely depending on hyperparameters.
6.1 Hyperparameter Tuning
Our CNN model proved relatively stable over various hyperparameter settings. For example, deeper networks performed
only marginally better, but at much higher train and test time cost. Going from one to three layers, we saw an increase of
≈ 0.5 percentage points for both recall and precision. Changing dropout layers, number of fully-connected layers, number
and size of filters, and activation function between ReLU and tanh did not yield more than 5% difference in accuracy.
On the other hand, LSTM was very sensitive to the number of epochs, going from 40% to 86% as we increased the number
of epochs from one to five. The results also varied depending on whether we allowed embeddings to be continuously updated
(≈ 10% higher when allowed), as well as the number of LSTM cells per layer (80% for 100 cells vs 86% for 200 cells). In
the end, however, LSTM performed marginally better than CNN.
7 Future Work
We see mainly two areas of future work. First, we would like to optimize the performance of OvR SVM with GloVe and
LDA to observe how the best versions of these models compare with CNN and LSTM. Specifically, we would consider using
a different way to collapse GloVe vectors or not collapsing at all (instead using input matrices of fixed size with padding) for
the former, and implementing labeled LDA for the latter. Second, we would like to try a different dataset with much finer
topic categories (with tens to hundreds of topics) and observe how the performance of our approaches changes.
5
References
[1] Ting, S. L., Ip, W. H., Tsang, A. H. (2011). Is Naive Bayes a good classifier for document classification?. International Journal of
Software Engineering and Its Applications, 5(3), 37-46.
[2] Yoo, J. Y., Yang, D. (2015). Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier.
[3] Wang, S., Manning, C. D. (2012, July). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings
of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (pp. 90-94). Association for
Computational Linguistics.
[4] Blei, D. M. (2012). Introduction to Probabilistic Topic Models.
[5] Ramage, D., Hall, D., Nallapati, R., Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution
in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume
1-Volume 1 (pp. 248-256). Association for Computational Linguistics.
[6] Tan, C. M., Wang, Y. F., Lee, C. D. (2002). The use of bigrams to enhance text categorization. Information processing management,
38(4), 529-546.
[7] Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
[8] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
[9] Zhang, X., Zhao, J., LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information
Processing Systems (pp. 649-657).
[10] Johnson, R., Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv preprint
arXiv:1412.1058.
[11] Olah, C. (2015). Understanding LSTM Networks. Retrieved December 05, 2016, from http://colah.github.io/posts/2015-08-
Understanding-LSTMs/
[12] Socher, R. (2016). Lecture 8: Recap, Projects and Fancy Recurrent Neural Networks for Machine Translation. Lecture presented at CS
224D. Retrieved December 05, 2016, from http://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf
6

More Related Content

What's hot (20)

PPTX
Topic model, LDA and all that
Zhibo Xiao
 
PDF
Latent dirichletallocation presentation
Soojung Hong
 
PDF
Basic review on topic modeling
Hiroyuki Kuromiya
 
PDF
Author Topic Model
FReeze FRancis
 
PPTX
Neural Models for Information Retrieval
Bhaskar Mitra
 
PDF
A-Study_TopicModeling
Sardhendu Mishra
 
PPTX
Deep Learning for Search
Bhaskar Mitra
 
PPTX
Probabilistic models (part 1)
KU Leuven
 
PPTX
Neural Models for Information Retrieval
Bhaskar Mitra
 
PPTX
Topic modeling using big data analytics
Farheen Nilofer
 
PPTX
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
PDF
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
PPTX
Search Engines
butest
 
PPTX
Duet @ TREC 2019 Deep Learning Track
Bhaskar Mitra
 
PPT
Artificial Intelligence
vini89
 
PPT
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Leonardo Di Donato
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
Topicmodels
Ajay Ohri
 
PPTX
Tdm probabilistic models (part 2)
KU Leuven
 
Topic model, LDA and all that
Zhibo Xiao
 
Latent dirichletallocation presentation
Soojung Hong
 
Basic review on topic modeling
Hiroyuki Kuromiya
 
Author Topic Model
FReeze FRancis
 
Neural Models for Information Retrieval
Bhaskar Mitra
 
A-Study_TopicModeling
Sardhendu Mishra
 
Deep Learning for Search
Bhaskar Mitra
 
Probabilistic models (part 1)
KU Leuven
 
Neural Models for Information Retrieval
Bhaskar Mitra
 
Topic modeling using big data analytics
Farheen Nilofer
 
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
Search Engines
butest
 
Duet @ TREC 2019 Deep Learning Track
Bhaskar Mitra
 
Artificial Intelligence
vini89
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Leonardo Di Donato
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Topicmodels
Ajay Ohri
 
Tdm probabilistic models (part 2)
KU Leuven
 

Similar to NLP Project: Paragraph Topic Classification (20)

PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
PDF
A scalable gibbs sampler for probabilistic entity linking
Sunny Kr
 
PDF
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
Hiroki Shimanaka
 
PDF
graduate_thesis (1)
Sihan Chen
 
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
PDF
A Text Mining Research Based on LDA Topic Modelling
csandit
 
PDF
G04124041046
IOSR-JEN
 
PDF
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
PDF
Mini-batch Variational Inference for Time-Aware Topic Modeling
Tomonari Masada
 
PDF
Blei ngjordan2003
Ajay Ohri
 
PDF
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
ijsc
 
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
ijsc
 
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
ijsc
 
PDF
Text clustering and topic modeling with LLMs.pdf
UTPL
 
PDF
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
PDF
Topic modelling
Shubhmay Potdar
 
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
PDF
Co-Clustering For Cross-Domain Text Classification
paperpublications3
 
PDF
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 
PDF
Vsm lsi
Ryan Wang
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
A scalable gibbs sampler for probabilistic entity linking
Sunny Kr
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
Hiroki Shimanaka
 
graduate_thesis (1)
Sihan Chen
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
cscpconf
 
A Text Mining Research Based on LDA Topic Modelling
csandit
 
G04124041046
IOSR-JEN
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
IJDKP
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Tomonari Masada
 
Blei ngjordan2003
Ajay Ohri
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
ijsc
 
Texts Classification with the usage of Neural Network based on the Word2vec’s...
ijsc
 
Texts Classification with the usage of Neural Network based on the Word2vec’s...
ijsc
 
Text clustering and topic modeling with LLMs.pdf
UTPL
 
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
Topic modelling
Shubhmay Potdar
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
AbdurrahimDerric
 
Co-Clustering For Cross-Domain Text Classification
paperpublications3
 
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 
Vsm lsi
Ryan Wang
 
Ad

Recently uploaded (20)

PDF
勉強会資料_An Image is Worth More Than 16x16 Patches
NABLAS株式会社
 
PDF
4 Tier Teamcenter Installation part1.pdf
VnyKumar1
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
PPTX
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
PDF
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
PPTX
quantum computing transition from classical mechanics.pptx
gvlbcy
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PPTX
cybersecurityandthe importance of the that
JayachanduHNJc
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
ETP Presentation(1000m3 Small ETP For Power Plant and industry
MD Azharul Islam
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
PPTX
Precedence and Associativity in C prog. language
Mahendra Dheer
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PDF
Zero carbon Building Design Guidelines V4
BassemOsman1
 
PPTX
Inventory management chapter in automation and robotics.
atisht0104
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
勉強会資料_An Image is Worth More Than 16x16 Patches
NABLAS株式会社
 
4 Tier Teamcenter Installation part1.pdf
VnyKumar1
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
filteration _ pre.pptx 11111110001.pptx
awasthivaibhav825
 
business incubation centre aaaaaaaaaaaaaa
hodeeesite4
 
Air -Powered Car PPT by ER. SHRESTH SUDHIR KOKNE.pdf
SHRESTHKOKNE
 
quantum computing transition from classical mechanics.pptx
gvlbcy
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
cybersecurityandthe importance of the that
JayachanduHNJc
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
ETP Presentation(1000m3 Small ETP For Power Plant and industry
MD Azharul Islam
 
Zero Carbon Building Performance standard
BassemOsman1
 
Precedence and Associativity in C prog. language
Mahendra Dheer
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Zero carbon Building Design Guidelines V4
BassemOsman1
 
Inventory management chapter in automation and robotics.
atisht0104
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
Construction of a Thermal Vacuum Chamber for Environment Test of Triple CubeS...
2208441
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Ad

NLP Project: Paragraph Topic Classification

  • 1. Paragraph Topic Classification Eugene Nho Graduate School of Business Stanford University Stanford, CA 94305 [email protected] Edward Ng Department of Electrical Engineering Stanford University Stanford, CA 94305 [email protected] 1 Introduction This project implements and compares various approaches for predicting the topics of paragraph-length texts.1 The motivation for the project is an increasingly widening information processing gap modern decision makers are facing; text information is generated at an ever increasing rate while the human cognitive ability to ingest text is largely constant over time. Machine learning can help reduce this gap by automatically identifying contextually relevant, useful text from the rest. We see paragraph topic classification as a first step toward that future. We formally define the problem as constructing a model that, given a corpus with k topics, outputs a k-dimensional vector for each paragraph-length input, indicating which topics the input is about. Note that this is a multilabel classification, where an input could belong to multiple topics. Approximately 315,000 Wikipedia article summaries, categorized by users, were used as data. We employed and compared the following approaches, spanning a wide range of complexity and computational requirements: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM). 2 Related Work 2.1 Traditional Approaches In topic classification using traditional machine learning, Naive Bayes has been preeminent because it is both accurate and quick to train in many contexts (Ting et. al, 2011). Furthermore, many variations of Naive Bayes have been developed in recent years. For example, an algorithm developed by Wang et. al (2012) uses Naive Bayes log-count ratios as features into an SVM model (known as NB-SVM) and utilizes bigrams rather than term frequency-inverse document frequency (tf-idf). This combined model (with a small modification, known as NB-LM) is considered among the state-of-the-art supervised learning algorithms for topic classification. 2.2 Topic Modeling Topic modeling is another area of active research. Topic models are statistical models used to discover topics that occur in documents. Latent Dirichlet Allocation (LDA) is the simplest topic model and provides foundation for many of the extensions that followed by relaxing some of LDA’s assumptions and incorporating other types of meta data (Blei, 2012). Because LDA is unsupervised, labeled LDA was developed to deploy LDA in a supervised learning context (Ramage et. al, 2009). While labeled LDA has performance comparable to SVM, it is not considered state-of-the-art. 1 This project was conducted for CS 229: Machine Learning at Stanford University during Autumn Quarter 2016, and shared the code base with a project for CS 221: Artificial Intelligence: Principles and Techniques, where Sigtryggur Kjartansson was also an author.
  • 2. 2.3 Neural Networks The recent surge in interest in neural networks has spilled over to topic classification. Convolutional neural networks (CNN), initially used for image classification, have been modified to perform natural language processing (NLP) and are considered state-of-the-art for topic classification (Kim, 2014). Many variations of CNN have been developed since then, including one that uses individual characters as inputs (Zhang et. al, 2015) and another that train embeddings through parallel convolution layers with differing region sizes (Johnson et. al, 2014). However, the increased complexity of neural network topology requires significantly larger training datasets. For example, according to Zhang et. al (2015), character-level CNNs need datasets on the order of several million to outperform traditional approaches. 3 Dataset We used Wikipedia article summaries as our data. Each article’s first summary paragraph served as our input, and its topic assignments as our labels. We scraped Wikipedia for total ≈ 315, 000 inputs, all belonging to one or more of the following five topics: mathematics, politics, computer science, film, and music. We limited our scope of topics to five to test the efficacy of our models without requiring an unwieldy amount of data, and thus computational resources. Our training and test sets were 275,000 and 40,000 entries respectively, with an average word count of 93 for each summary. We used Wikipedia because it is one of the few available sources of topically labeled, multilabel dataset. We insisted on solving a multilabel problem because, in real-world applications, a piece of text is rarely about one topic only. An example entry is: In mathematics, the minimum k-cut, is a combinatorial optimization problem that requires finding a set of edges whose removal would partition the graph to k connected components. These edges are referred to as k-cut. The goal is to find the minimum-weight k-cut. This partitioning can have applications in VLSI design, data-mining, finite elements and communication in parallel computing. (Labeled as mathematics, computer science) 4 Features 4.1 Term Frequency-Inverse Document Frequency Ff-idf was used because it captures how important a word is to a document in a corpus by calculating the frequency of a word in the document offset by its frequency in the overall corpus, de-emphasizing common words like “is” and “the.” It is also a common baseline feature for information retrieval tasks. 4.2 Global Vectors for Word Representation (GloVe) We used pre-trained GloVe to feed richer representation of input text into our models. GloVe represents each word as a vector representing the co-occurrence of the word with other words over the global corpus. GloVe vectors are trained by minimizing the following cost function: J(θ) = 1 2 W i,j=1 f(Pij)(uT i vj − log Pij)2 where Pij is the probability of word i occurring in the context of word j, W is the size of the vocabulary, f is a weighting function, and u and v are the GloVe vector representations of i and j respectively. Our hypothesis for using GloVe vectors was that they would more effectively capture the underlying relationships between words characterizing the same topic because related words are likely represented by similar vectors. We used pre-trained GloVe vectors from Stanford’s NLP group (trained on Wikipedia 2014 corpus, 100-dimensions). 5 Methods Our overall approach to text classification was starting with a simple baseline model and gradually adding more sophisticated features or model architecture to see if and how the results improve. Table 1 summarizes our approaches and rationale. 2
  • 3. Table 1: Summary of Approaches and Rationale Model Approach [features] Rationale Naive Bayes [tf-idf] Common baseline model for text classification OvR SVM [GloVe] One-vs-Rest supports multilabel learning; richer feature (GloVe) LDA with OvR SVM [tf] To capture latent topics more effectively CNN [GloVe] Fast; could identify "shapes of ideas" in NLP context LSTM [GloVe] Effective in learning from sequential experience (e.g. text) 5.1 Naive Bayes We used Naive Bayes for our baseline implementation. Naive Bayes is often the go-to strategy for baseline given its speed and ease of its training. We trained Naive Bayes from sklearn for multiclass topic classification, where paragraphs that fit into multiple topics were classified to be under a single compound label. Since we worked with k = 5 topics, there were 25 = 32 possible compound labels to choose from. Input feature was a tf-idf matrix. 5.2 One-vs-Rest Support Vector Machine with GloVe To test our hypothesis that feeding richer representation of input text would improve the results, we used GloVe as our feature and fed it into a linear Support Vector Machine (SVM) classifier using One-vs-Rest strategy (OvR). OvR was chosen because it enables multilabel classification by constructing k SVMs, where k is the number of classes. Given m training data (x(1) , y(1) ),..., (x(m) , y(m) ) where x(i) ∈ Rn , and y(i) ∈ Rk is a vector indicating the classes x(i) belongs to, the lth SVM solves the following problem: min wl,bl,ξl 1 2 wT l wl + c m i=1 ξ (i) l s.t. wT l x(i) + bl ≥ 1 − ξ (i) l , if y (i) l = 1 wT l x(i) + bl ≤ −1 + ξ (i) l , if y (i) l = 0 where ξi ∈ R ≥ 0 for all i, wl ∈ Rn , and c is a penalty parameter. In our implementation, x(i) was constructed by first vectorizing each word of the ith example into a 100-dimensional vector using the pre-trained GloVe weights, and then collapsing the matrix into a single vector using tf-idf as the weighting factor. For a new example x, OvR conducts the following for each l: yl = sgn(wT l x + bl) where l = 1, ..., k. The key difference from multiclass classification, which appears more commonly in literature, is that OvR for multilabel performs the sgn operation k times, rather than performing argmax once. 5.3 Latent Dirichlet Allocation with One-vs-Rest Support Vector Machine LDA is a 3-level hierarchical Bayesian model often used for topic modeling in NLP. Fully explaining the model is outside the scope of this paper, but one can get an intuition by observing how a single document w is modeled: p(w|α, β) = p(θ|α) N n=1 zn p(zn|θ)p(wn|zn, β) dθ where α and β are parameters, θ~Dirichlet(α) is a random variable representing topic mixture, zn is the topic for nth word of the document, and wn is nth word. Intuition is that each document w is essentially modeled as p(zn|θ)p(wn|zn)—in other words, as random mixtures over latent topics, where each topic is in turn characterized by a distribution over words (Blei et. al, 2003). Our hypothesis for using LDA was that modeling topics explicitly would help us mine intricate, underlying relationships between words and latent topics in ways classification algorithms alone cannot. Since LDA is an unsupervised learning model, we used its output—document topic matrix (number of inputs × number of topics)—as our design matrix X that gets fed into the OvR SVM. 5.4 Convolutional Neural Network (CNN) Convolutional Neural Networks (CNN) apply layers of convolving filters to features. Traditionally CNNs have been used for computer vision, but recently they are increasingly applied to NLP problems. Instead of applying filters to image pixels, 3
  • 4. CNNs in NLP apply the filters to the matrices representing paragraphs, typically consisting of rows of word embeddings such as word2vec or GloVe. We will not define CNNs with full mathematical rigor in this paper, but the essence of CNNs can be glimpsed at the convolution layers. Given k-dimensional word embedding vectors xi’s for i = 1, ..., n, where n is the number of words in a document, let the document matrix be represented as x1:n ∈ Rn×k , where all n of the xi’s in the document are stacked together (hence 1:n). Then convolution occurs when a filter w ∈ Rh×k is applied to a window of h words to produce a new feature. A feature ci is generated from each window of words xi:i+h−1 by calcuating ci = f(w · xi:i+h−1 + b) where f is a non-linear activation function and b ∈ R is a bias term. This filter w is applied to each possible window of words in the paragraph, producing n − h + 1 of ci’s above (Kim, 2014). Repeating this process—with multiple convolution layers—in computer vision context would enable identification of edges and eventually shapes of objects. Intuition in context of NLP is that, by applying these convolving filters and abstracting away from raw word embeddings, CNNs enable us to capture the "shapes of topics" in a similar way. We chose CNN for two main reasons. First, we believe CNNs do what LDA does–capturing latent topics–but in a much more powerful way, as described above. Second, CNNs are fast and relatively easy to train with limited GPU resources, which enabled us to experiment with various hyperparameter settings without extensive infrastructure setup. The model architecture is a multilabel variant of the CNN architecture proposed in Kim (2014), using a static embedding layer and an extra fully-connected layer. In order to do multilabel classification, we used a binary cross entropy loss function and sigmoid activation. Figure 1: Graph of CNN Model Architecture 5.5 Long Short Term Memory (LSTM) LSTM is a Recurrent Neural Network (RNN) architecture that is well-suited to learn from sequential experience when long time lags of unknown size exist between important events. It accomplishes this by operating the following gates for controlling the flow of information: it = σ W(i) xt + U(i) ht−1 ft = σ W(f) xt + U(f) ht−1 ot = σ W(o) xt + U(o) ht−1 where W’s and U’s are parameter matrices. Without going into details, intuition is that at each time t, given input xt and hidden state from the previous period ht−1, an LSTM cell has the power to decide whether to input (it), forget (ft), or output (ot) the data based on function σ, thereby enabling the overall network to learn when the "important events" may be close or far apart in a given sequence of data (Olah, 2015; Socher, 2016). We tried LSTM mainly as a benchmark against our CNN results. RNN architecture is widely viewed as the go-to strategy for NLP tasks because language is a sequential type of data. Our model architecture is defined by an embedding layer, a single layer of 200 LSTM cells, and a fully connected layer using sigmoid activation. As with CNN, it is a multilabel variant using a binary cross entropy loss function. 6 Results and Discussion The overall results from our approaches for paragraph topic classification are summarized in Figure 2 below. There are a few points worth noting about these results. 4
  • 5. Figure 2: Metrics for All Approaches First, Naive Bayes was very robust despite its simplicity, and given its preeminence in topic classification, this is not at all surprising. Second, OvR SVM with GloVe performed very poorly on accuracy. We believe this has to do with the fact that we collapsed the GloVe document matrix into a vector to feed into the classifier. The initial hypothesis was that the collapsed vectors would develop and exhibit distinct patterns depending on the topic. However, we suspect they began to converge toward a similar, overlapping pattern as more data was used. As a reference point, this model achieved a much higher accuracy (40%) when trained over only 10,000 samples. Using tf-idf to collapse GloVe’s could have been a poor way to reduce dimensions in this case—as opposed to a proper principal component analysis where maximum information is retained as high-dimensional inputs are reduced to a lower-dimensional subspace. Third, contrary to the initial hypothesis, LDA did not work well in topic prediction. We believe the principal reason is the difficulty to control the way an unsupervised algorithm learns. The output of LDA we used for classification—document topic matrix—is LDA’s best guess at the likelihood of a given document belonging to each of the five clusters between which it learned to differentiate. A corpus with m documents could be clustered into five groups in 5m different ways, and it is nearly impossible to exactly align the way an unsupervised learning algorithm clusters them with the way humans do. Fourth, both neural networks far outperformed other approaches with ≈ 97% accuracy. It is noteworthy that CNN and LSTM both ended up with very similar metrics, but CNN was much faster and robust. CNN’s training time for the entire dataset was on average 30 minutes with a standard laptop CPU, whereas LSTM took more than 3 hours for a subset half the size. Lastly, as the next section on hyperparameter tuning will discuss, CNN was very robust, showing consistently effective results across various hyperparameter settings. In contrast, LSTM’s results varied widely depending on hyperparameters. 6.1 Hyperparameter Tuning Our CNN model proved relatively stable over various hyperparameter settings. For example, deeper networks performed only marginally better, but at much higher train and test time cost. Going from one to three layers, we saw an increase of ≈ 0.5 percentage points for both recall and precision. Changing dropout layers, number of fully-connected layers, number and size of filters, and activation function between ReLU and tanh did not yield more than 5% difference in accuracy. On the other hand, LSTM was very sensitive to the number of epochs, going from 40% to 86% as we increased the number of epochs from one to five. The results also varied depending on whether we allowed embeddings to be continuously updated (≈ 10% higher when allowed), as well as the number of LSTM cells per layer (80% for 100 cells vs 86% for 200 cells). In the end, however, LSTM performed marginally better than CNN. 7 Future Work We see mainly two areas of future work. First, we would like to optimize the performance of OvR SVM with GloVe and LDA to observe how the best versions of these models compare with CNN and LSTM. Specifically, we would consider using a different way to collapse GloVe vectors or not collapsing at all (instead using input matrices of fixed size with padding) for the former, and implementing labeled LDA for the latter. Second, we would like to try a different dataset with much finer topic categories (with tens to hundreds of topics) and observe how the performance of our approaches changes. 5
  • 6. References [1] Ting, S. L., Ip, W. H., Tsang, A. H. (2011). Is Naive Bayes a good classifier for document classification?. International Journal of Software Engineering and Its Applications, 5(3), 37-46. [2] Yoo, J. Y., Yang, D. (2015). Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier. [3] Wang, S., Manning, C. D. (2012, July). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (pp. 90-94). Association for Computational Linguistics. [4] Blei, D. M. (2012). Introduction to Probabilistic Topic Models. [5] Ramage, D., Hall, D., Nallapati, R., Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics. [6] Tan, C. M., Wang, Y. F., Lee, C. D. (2002). The use of bigrams to enhance text categorization. Information processing management, 38(4), 529-546. [7] Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. [8] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. [9] Zhang, X., Zhao, J., LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems (pp. 649-657). [10] Johnson, R., Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058. [11] Olah, C. (2015). Understanding LSTM Networks. Retrieved December 05, 2016, from http://colah.github.io/posts/2015-08- Understanding-LSTMs/ [12] Socher, R. (2016). Lecture 8: Recap, Projects and Fancy Recurrent Neural Networks for Machine Translation. Lecture presented at CS 224D. Retrieved December 05, 2016, from http://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf 6