NLP Project: Paragraph Topic Classification

Paragraph Topic Classification
Eugene Nho
Graduate School of Business
Stanford University
Stanford, CA 94305
enho@stanford.edu
Edward Ng
Department of Electrical Engineering
Stanford University
Stanford, CA 94305
edjng@stanford.edu
1 Introduction
This project implements and compares various approaches for predicting the topics of paragraph-length texts.1
The motivation
for the project is an increasingly widening information processing gap modern decision makers are facing; text information is
generated at an ever increasing rate while the human cognitive ability to ingest text is largely constant over time. Machine
learning can help reduce this gap by automatically identifying contextually relevant, useful text from the rest. We see
paragraph topic classification as a first step toward that future.
We formally define the problem as constructing a model that, given a corpus with k topics, outputs a k-dimensional vector
for each paragraph-length input, indicating which topics the input is about. Note that this is a multilabel classification,
where an input could belong to multiple topics. Approximately 315,000 Wikipedia article summaries, categorized by
users, were used as data. We employed and compared the following approaches, spanning a wide range of complexity and
computational requirements: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent
Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks
(LSTM).
2 Related Work
2.1 Traditional Approaches
In topic classification using traditional machine learning, Naive Bayes has been preeminent because it is both accurate and
quick to train in many contexts (Ting et. al, 2011). Furthermore, many variations of Naive Bayes have been developed in
recent years. For example, an algorithm developed by Wang et. al (2012) uses Naive Bayes log-count ratios as features into
an SVM model (known as NB-SVM) and utilizes bigrams rather than term frequency-inverse document frequency (tf-idf).
This combined model (with a small modification, known as NB-LM) is considered among the state-of-the-art supervised
learning algorithms for topic classification.
2.2 Topic Modeling
Topic modeling is another area of active research. Topic models are statistical models used to discover topics that occur in
documents. Latent Dirichlet Allocation (LDA) is the simplest topic model and provides foundation for many of the extensions
that followed by relaxing some of LDA’s assumptions and incorporating other types of meta data (Blei, 2012). Because LDA
is unsupervised, labeled LDA was developed to deploy LDA in a supervised learning context (Ramage et. al, 2009). While
labeled LDA has performance comparable to SVM, it is not considered state-of-the-art.
1
This project was conducted for CS 229: Machine Learning at Stanford University during Autumn Quarter 2016, and shared the code
base with a project for CS 221: Artificial Intelligence: Principles and Techniques, where Sigtryggur Kjartansson was also an author.

2.3 Neural Networks
The recent surge in interest in neural networks has spilled over to topic classification. Convolutional neural networks (CNN),
initially used for image classification, have been modified to perform natural language processing (NLP) and are considered
state-of-the-art for topic classification (Kim, 2014). Many variations of CNN have been developed since then, including one
that uses individual characters as inputs (Zhang et. al, 2015) and another that train embeddings through parallel convolution
layers with differing region sizes (Johnson et. al, 2014). However, the increased complexity of neural network topology
requires significantly larger training datasets. For example, according to Zhang et. al (2015), character-level CNNs need
datasets on the order of several million to outperform traditional approaches.
3 Dataset
We used Wikipedia article summaries as our data. Each article’s first summary paragraph served as our input, and its topic
assignments as our labels. We scraped Wikipedia for total ≈ 315, 000 inputs, all belonging to one or more of the following
five topics: mathematics, politics, computer science, film, and music.
We limited our scope of topics to five to test the efficacy of our models without requiring an unwieldy amount of data, and
thus computational resources. Our training and test sets were 275,000 and 40,000 entries respectively, with an average word
count of 93 for each summary.
We used Wikipedia because it is one of the few available sources of topically labeled, multilabel dataset. We insisted on
solving a multilabel problem because, in real-world applications, a piece of text is rarely about one topic only. An example
entry is:
In mathematics, the minimum k-cut, is a combinatorial optimization problem that requires finding a set of
edges whose removal would partition the graph to k connected components. These edges are referred to
as k-cut. The goal is to find the minimum-weight k-cut. This partitioning can have applications in VLSI
design, data-mining, finite elements and communication in parallel computing.
(Labeled as mathematics, computer science)
4 Features
4.1 Term Frequency-Inverse Document Frequency
Ff-idf was used because it captures how important a word is to a document in a corpus by calculating the frequency of a word
in the document offset by its frequency in the overall corpus, de-emphasizing common words like “is” and “the.” It is also a
common baseline feature for information retrieval tasks.
4.2 Global Vectors for Word Representation (GloVe)
We used pre-trained GloVe to feed richer representation of input text into our models. GloVe represents each word as a vector
representing the co-occurrence of the word with other words over the global corpus. GloVe vectors are trained by minimizing
the following cost function:
J(θ) =
1
2
W
i,j=1
f(Pij)(uT
i vj − log Pij)2
where Pij is the probability of word i occurring in the context of word j, W is the size of the vocabulary, f is a weighting
function, and u and v are the GloVe vector representations of i and j respectively.
Our hypothesis for using GloVe vectors was that they would more effectively capture the underlying relationships between
words characterizing the same topic because related words are likely represented by similar vectors. We used pre-trained
GloVe vectors from Stanford’s NLP group (trained on Wikipedia 2014 corpus, 100-dimensions).
5 Methods
Our overall approach to text classification was starting with a simple baseline model and gradually adding more sophisticated
features or model architecture to see if and how the results improve. Table 1 summarizes our approaches and rationale.
2

Table 1: Summary of Approaches and Rationale
Model Approach [features] Rationale
Naive Bayes [tf-idf] Common baseline model for text classification
OvR SVM [GloVe] One-vs-Rest supports multilabel learning; richer feature (GloVe)
LDA with OvR SVM [tf] To capture latent topics more effectively
CNN [GloVe] Fast; could identify "shapes of ideas" in NLP context
LSTM [GloVe] Effective in learning from sequential experience (e.g. text)
5.1 Naive Bayes
We used Naive Bayes for our baseline implementation. Naive Bayes is often the go-to strategy for baseline given its speed
and ease of its training. We trained Naive Bayes from sklearn for multiclass topic classification, where paragraphs that fit into
multiple topics were classified to be under a single compound label. Since we worked with k = 5 topics, there were 25
= 32
possible compound labels to choose from. Input feature was a tf-idf matrix.
5.2 One-vs-Rest Support Vector Machine with GloVe
To test our hypothesis that feeding richer representation of input text would improve the results, we used GloVe as our feature
and fed it into a linear Support Vector Machine (SVM) classifier using One-vs-Rest strategy (OvR). OvR was chosen because
it enables multilabel classification by constructing k SVMs, where k is the number of classes. Given m training data (x(1)
,
y(1)
),..., (x(m)
, y(m)
) where x(i)
∈ Rn
, and y(i)
∈ Rk
is a vector indicating the classes x(i)
belongs to, the lth SVM solves
the following problem:
min
wl,bl,ξl
1
2
wT
l wl + c
m
i=1
ξ
(i)
l
s.t. wT
l x(i)
+ bl ≥ 1 − ξ
(i)
l , if y
(i)
l = 1
wT
l x(i)
+ bl ≤ −1 + ξ
(i)
l , if y
(i)
l = 0
where ξi ∈ R ≥ 0 for all i, wl ∈ Rn
, and c is a penalty parameter. In our implementation, x(i)
was constructed by first
vectorizing each word of the ith example into a 100-dimensional vector using the pre-trained GloVe weights, and then
collapsing the matrix into a single vector using tf-idf as the weighting factor.
For a new example x, OvR conducts the following for each l:
yl = sgn(wT
l x + bl)
where l = 1, ..., k. The key difference from multiclass classification, which appears more commonly in literature, is that OvR
for multilabel performs the sgn operation k times, rather than performing argmax once.
5.3 Latent Dirichlet Allocation with One-vs-Rest Support Vector Machine
LDA is a 3-level hierarchical Bayesian model often used for topic modeling in NLP. Fully explaining the model is outside the
scope of this paper, but one can get an intuition by observing how a single document w is modeled:
p(w|α, β) = p(θ|α)
N
n=1 zn
p(zn|θ)p(wn|zn, β) dθ
where α and β are parameters, θ~Dirichlet(α) is a random variable representing topic mixture, zn is the topic for nth word
of the document, and wn is nth word. Intuition is that each document w is essentially modeled as p(zn|θ)p(wn|zn)—in other
words, as random mixtures over latent topics, where each topic is in turn characterized by a distribution over words (Blei
et. al, 2003). Our hypothesis for using LDA was that modeling topics explicitly would help us mine intricate, underlying
relationships between words and latent topics in ways classification algorithms alone cannot.
Since LDA is an unsupervised learning model, we used its output—document topic matrix (number of inputs × number of
topics)—as our design matrix X that gets fed into the OvR SVM.
5.4 Convolutional Neural Network (CNN)
Convolutional Neural Networks (CNN) apply layers of convolving filters to features. Traditionally CNNs have been used for
computer vision, but recently they are increasingly applied to NLP problems. Instead of applying filters to image pixels,
3

CNNs in NLP apply the filters to the matrices representing paragraphs, typically consisting of rows of word embeddings such
as word2vec or GloVe.
We will not define CNNs with full mathematical rigor in this paper, but the essence of CNNs can be glimpsed at the
convolution layers. Given k-dimensional word embedding vectors xi’s for i = 1, ..., n, where n is the number of words in
a document, let the document matrix be represented as x1:n ∈ Rn×k
, where all n of the xi’s in the document are stacked
together (hence 1:n). Then convolution occurs when a filter w ∈ Rh×k
is applied to a window of h words to produce a new
feature. A feature ci is generated from each window of words xi:i+h−1 by calcuating
ci = f(w · xi:i+h−1 + b)
where f is a non-linear activation function and b ∈ R is a bias term. This filter w is applied to each possible window of
words in the paragraph, producing n − h + 1 of ci’s above (Kim, 2014). Repeating this process—with multiple convolution
layers—in computer vision context would enable identification of edges and eventually shapes of objects. Intuition in context
of NLP is that, by applying these convolving filters and abstracting away from raw word embeddings, CNNs enable us to
capture the "shapes of topics" in a similar way.
We chose CNN for two main reasons. First, we believe CNNs do what LDA does–capturing latent topics–but in a much more
powerful way, as described above. Second, CNNs are fast and relatively easy to train with limited GPU resources, which
enabled us to experiment with various hyperparameter settings without extensive infrastructure setup.
The model architecture is a multilabel variant of the CNN architecture proposed in Kim (2014), using a static embedding
layer and an extra fully-connected layer. In order to do multilabel classification, we used a binary cross entropy loss function
and sigmoid activation.
Figure 1: Graph of CNN Model Architecture
5.5 Long Short Term Memory (LSTM)
LSTM is a Recurrent Neural Network (RNN) architecture that is well-suited to learn from sequential experience when
long time lags of unknown size exist between important events. It accomplishes this by operating the following gates for
controlling the flow of information:
it = σ W(i)
xt + U(i)
ht−1
ft = σ W(f)
xt + U(f)
ht−1
ot = σ W(o)
xt + U(o)
ht−1
where W’s and U’s are parameter matrices. Without going into details, intuition is that at each time t, given input xt and
hidden state from the previous period ht−1, an LSTM cell has the power to decide whether to input (it), forget (ft), or output
(ot) the data based on function σ, thereby enabling the overall network to learn when the "important events" may be close or
far apart in a given sequence of data (Olah, 2015; Socher, 2016).
We tried LSTM mainly as a benchmark against our CNN results. RNN architecture is widely viewed as the go-to strategy for
NLP tasks because language is a sequential type of data. Our model architecture is defined by an embedding layer, a single
layer of 200 LSTM cells, and a fully connected layer using sigmoid activation. As with CNN, it is a multilabel variant using
a binary cross entropy loss function.
6 Results and Discussion
The overall results from our approaches for paragraph topic classification are summarized in Figure 2 below. There are a few
points worth noting about these results.
4

Figure 2: Metrics for All Approaches
First, Naive Bayes was very robust despite its simplicity, and given its preeminence in topic classification, this is not at all
surprising.
Second, OvR SVM with GloVe performed very poorly on accuracy. We believe this has to do with the fact that we collapsed
the GloVe document matrix into a vector to feed into the classifier. The initial hypothesis was that the collapsed vectors
would develop and exhibit distinct patterns depending on the topic. However, we suspect they began to converge toward a
similar, overlapping pattern as more data was used. As a reference point, this model achieved a much higher accuracy (40%)
when trained over only 10,000 samples. Using tf-idf to collapse GloVe’s could have been a poor way to reduce dimensions in
this case—as opposed to a proper principal component analysis where maximum information is retained as high-dimensional
inputs are reduced to a lower-dimensional subspace.
Third, contrary to the initial hypothesis, LDA did not work well in topic prediction. We believe the principal reason is the
difficulty to control the way an unsupervised algorithm learns. The output of LDA we used for classification—document
topic matrix—is LDA’s best guess at the likelihood of a given document belonging to each of the five clusters between which
it learned to differentiate. A corpus with m documents could be clustered into five groups in 5m
different ways, and it is
nearly impossible to exactly align the way an unsupervised learning algorithm clusters them with the way humans do.
Fourth, both neural networks far outperformed other approaches with ≈ 97% accuracy. It is noteworthy that CNN and LSTM
both ended up with very similar metrics, but CNN was much faster and robust. CNN’s training time for the entire dataset was
on average 30 minutes with a standard laptop CPU, whereas LSTM took more than 3 hours for a subset half the size. Lastly,
as the next section on hyperparameter tuning will discuss, CNN was very robust, showing consistently effective results across
various hyperparameter settings. In contrast, LSTM’s results varied widely depending on hyperparameters.
6.1 Hyperparameter Tuning
Our CNN model proved relatively stable over various hyperparameter settings. For example, deeper networks performed
only marginally better, but at much higher train and test time cost. Going from one to three layers, we saw an increase of
≈ 0.5 percentage points for both recall and precision. Changing dropout layers, number of fully-connected layers, number
and size of filters, and activation function between ReLU and tanh did not yield more than 5% difference in accuracy.
On the other hand, LSTM was very sensitive to the number of epochs, going from 40% to 86% as we increased the number
of epochs from one to five. The results also varied depending on whether we allowed embeddings to be continuously updated
(≈ 10% higher when allowed), as well as the number of LSTM cells per layer (80% for 100 cells vs 86% for 200 cells). In
the end, however, LSTM performed marginally better than CNN.
7 Future Work
We see mainly two areas of future work. First, we would like to optimize the performance of OvR SVM with GloVe and
LDA to observe how the best versions of these models compare with CNN and LSTM. Specifically, we would consider using
a different way to collapse GloVe vectors or not collapsing at all (instead using input matrices of fixed size with padding) for
the former, and implementing labeled LDA for the latter. Second, we would like to try a different dataset with much finer
topic categories (with tens to hundreds of topics) and observe how the performance of our approaches changes.
5

References
[1] Ting, S. L., Ip, W. H., Tsang, A. H. (2011). Is Naive Bayes a good classifier for document classification?. International Journal of
Software Engineering and Its Applications, 5(3), 37-46.
[2] Yoo, J. Y., Yang, D. (2015). Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier.
[3] Wang, S., Manning, C. D. (2012, July). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings
of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (pp. 90-94). Association for
Computational Linguistics.
[4] Blei, D. M. (2012). Introduction to Probabilistic Topic Models.
[5] Ramage, D., Hall, D., Nallapati, R., Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution
in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume
1-Volume 1 (pp. 248-256). Association for Computational Linguistics.
[6] Tan, C. M., Wang, Y. F., Lee, C. D. (2002). The use of bigrams to enhance text categorization. Information processing management,
38(4), 529-546.
[7] Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
[8] Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
[9] Zhang, X., Zhao, J., LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information
Processing Systems (pp. 649-657).
[10] Johnson, R., Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv preprint
arXiv:1412.1058.
[11] Olah, C. (2015). Understanding LSTM Networks. Retrieved December 05, 2016, from http://colah.github.io/posts/2015-08-
Understanding-LSTMs/
[12] Socher, R. (2016). Lecture 8: Recap, Projects and Fancy Recurrent Neural Networks for Machine Translation. Lecture presented at CS
224D. Retrieved December 05, 2016, from http://cs224d.stanford.edu/lectures/CS224d-Lecture9.pdf
6

NLP Project: Paragraph Topic Classification

More Related Content

What's hot (20)

Similar to NLP Project: Paragraph Topic Classification (20)

Recently uploaded (20)

NLP Project: Paragraph Topic Classification