SlideShare a Scribd company logo
A Study of Deep Learning on Sentence Classification
By: Trinh Hoang-Trieu
Supervisor: Nguyen Le-Minh
Nguyen lab, School of Information Science
Japan Advanced Institute of Science and Technology
thtrieu@apcs.vn
Abstract
The author study different deep learning models on the task of sentence classification. Dataset used is
TREC. Models under investigation are: Convolutional Neural Networks proposed by Yoon Kim, Long
Short Term Memory, Long Short Term Memory on top of a Convolutional layer, and Convolutional
network augmented with max-pooling position. Results show that architectures with time-retaining
mechanism works better than those do not. The author also propose new changes to the conventional
final layer, namely to replace the last fully connected layer with a Linear Support Vector Machine, or to
replace the discriminative nature of the last layer with a generative one.
1. TREC dataset
TREC dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset
has 6 labels, 50 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700. A
sample from TREC is as follow:
DESC:manner How did serfdom develop in and then leave Russia ?
ENTY:cremat What films featured the character Popeye Doyle ?
DESC:manner How can I find a list of celebrities ' real names ?
ENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey ?
ABBR:exp What is the full form of .com ?
HUM:ind What contemptible scoundrel stole the cork from my lunch ?
We also experiment on the Vietnamese TREC dataset, which is a direct translation of the TREC dataset
and thus, also have very similar statistics.
2. Word2Vec dataset
In this experiment, we also used another dataset to support training. The dataset consists of pre-trained
300-dimension embeddings of 3 millions words and phrases [1]. These embeddings will be used as the
representation of input for training. This covers 7500 out of 8700 words present in the TREC
vocabulary.
For the Vietnamese TREC dataset, we used another pre-trained dataset, which covers 4600 words out
of approximately 8700 words in the Vietnamese TREC dataset.
3. Convolutional Neural Networks for Sentence Classification
Yoon Kim proposes a simple yet effective convolutional architecture that achieves remarkable result on
several datasets [2]. The network consists of three layers. The first uses 300 convolutional filters with
varying size to detect 300 different features across the sentence's length. The second performs
maximum operation to summarise the previous detection. The last is a fully connected layer with
softmax output.
Figure 1. Convolutional neural network for sentence classification
The details ofS this network is as follow:
• Layer 1: 3 types of window size 3x300, 4x300, 5x300. 100 feature maps each.
• Layer 2: max-pool-over-time.
• Layer 3: Fully connected with softmax output. Weight norm constrained by 3.
▪ Dropout with p = 0.5
4. First experiment: Long Short Term Memory
The authors first experiment with a simple Long Short Term Memory architecture. We used the many-
to-one scheme to extract features before discriminating on these extracted features by a fully-connected
layer.
Figure 2. Many-to-one structure.
Source:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
The detail of this network is as follow:
- First Layer: Word Embedding
- Second Layer: Long Short Term Memory
- Third Layer: Fully connected with Softmax output
For the second layer, we experimented on three different variations of Long Short Term Memory Cell:
4.a. Gated Recurrent Unit
Figure 3: Gated Recurrent Unit
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
4.b. Basic Long Short Term Memory
Figure 4: Basic Long Short Term Memory
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
4.c. Long Short Term Memory with peepholes
Figure 5: Long Short Term Memory with peepholes
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
5. Long Short Term Memory on top of a Convolutional layer
Instead of max-pool-over-time, the author propose a softer version where the maximum operation is
perform over a smaller region of the feature maps. By this, temporal information is better retained,
max-pool layer does not produce a single value but a sequence instead. After this layer, the information
is then feed into a Long Short Term Memory Cell with many-to-one scheme and a fully connected layer
with softmax output. The authors also experimented with three aforementioned types of LSTM Cell.
Figure 6: Long Short Term Memory on top of a Convolutional Layer
6. Convolutional network augmented with max-pooling position
In this network, besides performing maximum operation at the max-pooling layer, we also apply the
argmax function to produce the position in which this maximum value takes place. By supplying this
position information, the authors expect a better performance since this is also a time-retaining
mechanism, only without any recurrent connection.
To be able to propagate gradient information through this argmax operation, the authors approximate
this argmax operation with another differentiable function:
Figure 7: Convolutional network with max-pooling position
7. Hyper-parameter tuning and Experimental details
The authors use 10-fold cross validation on training set for hyper parameter tuning. The word
embedding input is kept static throughout all model except the first one to replicate Yoon Kim's
experiment. Since the average length of the sentence is only 10, while the maximum length of
sentences in the dataset is approximately 40, we also did not use paddings to avoid feeding the
networks too much noise (otherwise, on average there will be 30 padding words for each 10 words in
the dataset).
Words that do not appear in the pre-train embedding dataset is initialised with variance 0.25 to match
with existing words. For training procedure,we hold-out 5% of the data for early stopping since the
dataset is not too big. Training is carried out in batches of size 50 to speed up convergence.
8. Result
The LSTM Cell with peepholes gives best result when a single recurrent layer is used (described in
Section 4), while Gated Recurrent Unit gives best result in the hybrid model of Convolution and
Recurrent connections (described in Section 5).
Performance of the Convolutional Network augmented with position is very unstable due to the
presence of argmax operator. In our implementation, argmax is approximated with a function involves
very big constant, which in turn results in unreliable gradients and poor result. In our model, the final
layer is a fully connected layer, which is also not the best choice since classification is unlikely a linear
function of features and their position. Other results are reported as follow:
TREC TRECvn
CNN (implemented on our system) 93 91.8
CNN-LSTM 94.2 92.8
LSTM 95.4 94.2
Table 1. Accuracy (%) of different models on two datasets
9. Conclusion
As can be seen, models which utilises temporal information gives better results comparing to those do
not.
10. Future work: Support Vector Machine as the final layer
Of all the proposed model, the final layer is always a fully connected layer with softmax output, which
is equivalent to a Linear Classifier. In this case, the previous layers act as a feature extractor. Good
performance as reported indicates that these feature extractor is able to extract useful features that
represent points that can be separated well by a simple Linear Classifier. It is highly likely that these
features are also useful for other kind of Linear Classifier.
The authors propose using Linear Support Vector Machine as the final layer, this can be done by simply
replacing the usual cross-entropy loss function by Linear Support Vector Machine's loss function.
Namely, instead of using:
(Where is the extracted features from the previous layers)
With Support Vector Machine being the final layer, loss function is now:
Where RELU stands for Rectifier Linear Unit function, t is the target that is either -1 or 1 . This loss
function represent the loss when one-vs-rest scheme is used (since TREC is a multiclass dataset). For
Linear Support Vector with a soft-margin, the loss function is:
11. Reference:
[1] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint
arXiv:1408.5882 (2014).
[2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their
compositionality." Advances in neural information processing systems. 2013.

More Related Content

ODP
DL for setence classification project presentation
Hoàng Triều Trịnh
 
PPTX
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
PDF
Deep Belief Networks
Hasan H Topcu
 
PDF
Deep learning presentation
Baptiste Wicht
 
PDF
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
PPTX
Training course lect3
Noor Dhiya
 
PDF
LSTM
佳蓉 倪
 
PDF
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Indraneel Pole
 
DL for setence classification project presentation
Hoàng Triều Trịnh
 
論文輪読資料「Gated Feedback Recurrent Neural Networks」
kurotaki_weblab
 
Deep Belief Networks
Hasan H Topcu
 
Deep learning presentation
Baptiste Wicht
 
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Training course lect3
Noor Dhiya
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Indraneel Pole
 

What's hot (20)

PDF
Multidimensional RNN
Grigory Sapunov
 
PPTX
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
PPTX
Deep Belief nets
butest
 
PDF
Recurrent Neural Networks
Sharath TS
 
PDF
Lexically constrained decoding for sequence generation using grid beam search
Satoru Katsumata
 
PDF
Lecture 7: Recurrent Neural Networks
Sang Jun Lee
 
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
PPTX
Lstm
Mehrnaz Faraz
 
PPT
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
adnanfaisal
 
PPTX
XLnet RoBERTa Reformer
San Kim
 
PPTX
Rnn & Lstm
Subash Chandra Pakhrin
 
PDF
Bh36352357
IJERA Editor
 
PDF
Analysis and design of a half hypercube interconnection network topology
Amir Masoud Sefidian
 
PPT
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
PDF
Neural Turing Machines
Kato Yuzuru
 
PPTX
Towards Dropout Training for Convolutional Neural Networks
Mah Sa
 
PDF
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Association for Computational Linguistics
 
PDF
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
PDF
Emnlp2015 reading festival_lstm_cws
Ace12358
 
PPTX
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
Multidimensional RNN
Grigory Sapunov
 
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Deep Belief nets
butest
 
Recurrent Neural Networks
Sharath TS
 
Lexically constrained decoding for sequence generation using grid beam search
Satoru Katsumata
 
Lecture 7: Recurrent Neural Networks
Sang Jun Lee
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Altoros
 
Porting MPEG-2 files on CerberO, a framework for FPGA based MPSoc
adnanfaisal
 
XLnet RoBERTa Reformer
San Kim
 
Bh36352357
IJERA Editor
 
Analysis and design of a half hypercube interconnection network topology
Amir Masoud Sefidian
 
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
Neural Turing Machines
Kato Yuzuru
 
Towards Dropout Training for Convolutional Neural Networks
Mah Sa
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Association for Computational Linguistics
 
Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)
Universitat Politècnica de Catalunya
 
Emnlp2015 reading festival_lstm_cws
Ace12358
 
Deep learning lecture - part 1 (basics, CNN)
SungminYou
 
Ad

Similar to DL for sentence classification project Write-up (20)

PDF
Survey on Text Prediction Techniques
vivatechijri
 
PDF
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Hon Weng Chong
 
PDF
is2015_poster
Jan Svec
 
PDF
Introduction to Tree-LSTMs
Daniel Perez
 
PDF
AINL 2016: Nikolenko
Lidia Pivovarova
 
PDF
Duplicate_Quora_Question_Detection
Jayavardhan Reddy Peddamail
 
PDF
Deep-learning based Language Understanding and Emotion extractions
Jeongkyu Shin
 
PDF
Eat it, Review it: A New Approach for Review Prediction
vivatechijri
 
PPTX
Long and short term memory presesntation
chWaqasZahid
 
PDF
Neural Architectures for Named Entity Recognition
Rrubaa Panchendrarajan
 
PPTX
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
PPTX
Dataworkz odsc london 2018
Olaf de Leeuw
 
PDF
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 
PPTX
lstmhh hjhj uhujikj iijiijijiojijijijijiji
nadamaatallah665
 
PPTX
Long Short Term Memory (Neural Networks)
Olusola Amusan
 
PDF
Sequence learning and modern RNNs
Grigory Sapunov
 
PPTX
Long Short-Term Memory
milad abbasi
 
PPTX
Deep Learning and Watson Studio
Sasha Lazarevic
 
PDF
Automated Speech Recognition
Pruthvij Thakar
 
PDF
Study_of_Sequence_labeling_Systems
Jayavardhan Reddy Peddamail
 
Survey on Text Prediction Techniques
vivatechijri
 
CoreML for NLP (Melb Cocoaheads 08/02/2018)
Hon Weng Chong
 
is2015_poster
Jan Svec
 
Introduction to Tree-LSTMs
Daniel Perez
 
AINL 2016: Nikolenko
Lidia Pivovarova
 
Duplicate_Quora_Question_Detection
Jayavardhan Reddy Peddamail
 
Deep-learning based Language Understanding and Emotion extractions
Jeongkyu Shin
 
Eat it, Review it: A New Approach for Review Prediction
vivatechijri
 
Long and short term memory presesntation
chWaqasZahid
 
Neural Architectures for Named Entity Recognition
Rrubaa Panchendrarajan
 
Natural Language Processing Advancements By Deep Learning: A Survey
Rimzim Thube
 
Dataworkz odsc london 2018
Olaf de Leeuw
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Márton Miháltz
 
lstmhh hjhj uhujikj iijiijijiojijijijijiji
nadamaatallah665
 
Long Short Term Memory (Neural Networks)
Olusola Amusan
 
Sequence learning and modern RNNs
Grigory Sapunov
 
Long Short-Term Memory
milad abbasi
 
Deep Learning and Watson Studio
Sasha Lazarevic
 
Automated Speech Recognition
Pruthvij Thakar
 
Study_of_Sequence_labeling_Systems
Jayavardhan Reddy Peddamail
 
Ad

Recently uploaded (20)

PPTX
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
PDF
Identification of unnecessary object allocations using static escape analysis
ESUG
 
PPTX
INTRO-TO-CRIM-THEORIES-OF-CRIME-2023 (1).pptx
ChrisFlickIII
 
PPTX
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
PDF
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
DOCX
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
PPTX
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
PDF
Drones in Disaster Response: Real-Time Data Collection and Analysis (www.kiu...
publication11
 
PPTX
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
PPTX
Modifications in RuBisCO system to enhance photosynthesis .pptx
raghumolbiotech
 
PPTX
Laboratory design and safe microbiological practices
Akanksha Divkar
 
PDF
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
PPTX
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
PDF
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
PPTX
first COT (MATH).pptxCSAsCNKHPHCouAGSCAUO:GC/ZKVHxsacba
DitaSIdnay
 
PPTX
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
PPTX
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
PPT
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
PPTX
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
PPTX
INTERNATIONAL CLASSIFICATION OF DISEASES ji.pptx
46JaybhayAshwiniHari
 
The Toxic Effects of Aflatoxin B1 and Aflatoxin M1 on Kidney through Regulati...
OttokomaBonny
 
Identification of unnecessary object allocations using static escape analysis
ESUG
 
INTRO-TO-CRIM-THEORIES-OF-CRIME-2023 (1).pptx
ChrisFlickIII
 
fghvqwhfugqaifbiqufbiquvbfuqvfuqyvfqvfouiqvfq
PERMISONJERWIN
 
Approximating manifold orbits by means of Machine Learning Techniques
Esther Barrabés Vera
 
Echoes_of_Andromeda_Partial (1).docx9989
yakshitkrishnia5a3
 
Hericium erinaceus, also known as lion's mane mushroom
TinaDadkhah1
 
Drones in Disaster Response: Real-Time Data Collection and Analysis (www.kiu...
publication11
 
Feeding stratagey for climate change dairy animals.
Dr.Zulfy haq
 
Modifications in RuBisCO system to enhance photosynthesis .pptx
raghumolbiotech
 
Laboratory design and safe microbiological practices
Akanksha Divkar
 
The Cosmic Symphony: How Photons Shape the Universe and Our Place Within It
kutatomoshi
 
Hydrocarbons Pollution. OIL pollutionpptx
AkCreation33
 
A deep Search for Ethylene Glycol and Glycolonitrile in the V883 Ori Protopla...
Sérgio Sacani
 
first COT (MATH).pptxCSAsCNKHPHCouAGSCAUO:GC/ZKVHxsacba
DitaSIdnay
 
RED ROT DISEASE OF SUGARCANE.pptx
BikramjitDeuri
 
Unit 4 - Astronomy and Astrophysics - Milky Way And External Galaxies
RDhivya6
 
1a. Basic Principles of Medical Microbiology Part 2 [Autosaved].ppt
separatedwalk
 
Hepatopulmonary syndrome power point presentation
raknasivar1997
 
INTERNATIONAL CLASSIFICATION OF DISEASES ji.pptx
46JaybhayAshwiniHari
 

DL for sentence classification project Write-up

  • 1. A Study of Deep Learning on Sentence Classification By: Trinh Hoang-Trieu Supervisor: Nguyen Le-Minh Nguyen lab, School of Information Science Japan Advanced Institute of Science and Technology [email protected] Abstract The author study different deep learning models on the task of sentence classification. Dataset used is TREC. Models under investigation are: Convolutional Neural Networks proposed by Yoon Kim, Long Short Term Memory, Long Short Term Memory on top of a Convolutional layer, and Convolutional network augmented with max-pooling position. Results show that architectures with time-retaining mechanism works better than those do not. The author also propose new changes to the conventional final layer, namely to replace the last fully connected layer with a Linear Support Vector Machine, or to replace the discriminative nature of the last layer with a generative one. 1. TREC dataset TREC dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 50 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700. A sample from TREC is as follow: DESC:manner How did serfdom develop in and then leave Russia ? ENTY:cremat What films featured the character Popeye Doyle ? DESC:manner How can I find a list of celebrities ' real names ? ENTY:animal What fowl grabs the spotlight after the Chinese Year of the Monkey ? ABBR:exp What is the full form of .com ? HUM:ind What contemptible scoundrel stole the cork from my lunch ? We also experiment on the Vietnamese TREC dataset, which is a direct translation of the TREC dataset and thus, also have very similar statistics. 2. Word2Vec dataset In this experiment, we also used another dataset to support training. The dataset consists of pre-trained 300-dimension embeddings of 3 millions words and phrases [1]. These embeddings will be used as the representation of input for training. This covers 7500 out of 8700 words present in the TREC vocabulary. For the Vietnamese TREC dataset, we used another pre-trained dataset, which covers 4600 words out of approximately 8700 words in the Vietnamese TREC dataset.
  • 2. 3. Convolutional Neural Networks for Sentence Classification Yoon Kim proposes a simple yet effective convolutional architecture that achieves remarkable result on several datasets [2]. The network consists of three layers. The first uses 300 convolutional filters with varying size to detect 300 different features across the sentence's length. The second performs maximum operation to summarise the previous detection. The last is a fully connected layer with softmax output. Figure 1. Convolutional neural network for sentence classification The details ofS this network is as follow: • Layer 1: 3 types of window size 3x300, 4x300, 5x300. 100 feature maps each. • Layer 2: max-pool-over-time. • Layer 3: Fully connected with softmax output. Weight norm constrained by 3. ▪ Dropout with p = 0.5 4. First experiment: Long Short Term Memory The authors first experiment with a simple Long Short Term Memory architecture. We used the many- to-one scheme to extract features before discriminating on these extracted features by a fully-connected layer. Figure 2. Many-to-one structure. Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
  • 3. The detail of this network is as follow: - First Layer: Word Embedding - Second Layer: Long Short Term Memory - Third Layer: Fully connected with Softmax output For the second layer, we experimented on three different variations of Long Short Term Memory Cell: 4.a. Gated Recurrent Unit Figure 3: Gated Recurrent Unit Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 4.b. Basic Long Short Term Memory Figure 4: Basic Long Short Term Memory Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 4. 4.c. Long Short Term Memory with peepholes Figure 5: Long Short Term Memory with peepholes Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 5. Long Short Term Memory on top of a Convolutional layer Instead of max-pool-over-time, the author propose a softer version where the maximum operation is perform over a smaller region of the feature maps. By this, temporal information is better retained, max-pool layer does not produce a single value but a sequence instead. After this layer, the information is then feed into a Long Short Term Memory Cell with many-to-one scheme and a fully connected layer with softmax output. The authors also experimented with three aforementioned types of LSTM Cell. Figure 6: Long Short Term Memory on top of a Convolutional Layer 6. Convolutional network augmented with max-pooling position In this network, besides performing maximum operation at the max-pooling layer, we also apply the argmax function to produce the position in which this maximum value takes place. By supplying this position information, the authors expect a better performance since this is also a time-retaining mechanism, only without any recurrent connection. To be able to propagate gradient information through this argmax operation, the authors approximate this argmax operation with another differentiable function:
  • 5. Figure 7: Convolutional network with max-pooling position 7. Hyper-parameter tuning and Experimental details The authors use 10-fold cross validation on training set for hyper parameter tuning. The word embedding input is kept static throughout all model except the first one to replicate Yoon Kim's experiment. Since the average length of the sentence is only 10, while the maximum length of sentences in the dataset is approximately 40, we also did not use paddings to avoid feeding the networks too much noise (otherwise, on average there will be 30 padding words for each 10 words in the dataset). Words that do not appear in the pre-train embedding dataset is initialised with variance 0.25 to match with existing words. For training procedure,we hold-out 5% of the data for early stopping since the dataset is not too big. Training is carried out in batches of size 50 to speed up convergence. 8. Result The LSTM Cell with peepholes gives best result when a single recurrent layer is used (described in Section 4), while Gated Recurrent Unit gives best result in the hybrid model of Convolution and Recurrent connections (described in Section 5). Performance of the Convolutional Network augmented with position is very unstable due to the presence of argmax operator. In our implementation, argmax is approximated with a function involves very big constant, which in turn results in unreliable gradients and poor result. In our model, the final layer is a fully connected layer, which is also not the best choice since classification is unlikely a linear function of features and their position. Other results are reported as follow: TREC TRECvn CNN (implemented on our system) 93 91.8 CNN-LSTM 94.2 92.8 LSTM 95.4 94.2 Table 1. Accuracy (%) of different models on two datasets
  • 6. 9. Conclusion As can be seen, models which utilises temporal information gives better results comparing to those do not. 10. Future work: Support Vector Machine as the final layer Of all the proposed model, the final layer is always a fully connected layer with softmax output, which is equivalent to a Linear Classifier. In this case, the previous layers act as a feature extractor. Good performance as reported indicates that these feature extractor is able to extract useful features that represent points that can be separated well by a simple Linear Classifier. It is highly likely that these features are also useful for other kind of Linear Classifier. The authors propose using Linear Support Vector Machine as the final layer, this can be done by simply replacing the usual cross-entropy loss function by Linear Support Vector Machine's loss function. Namely, instead of using: (Where is the extracted features from the previous layers) With Support Vector Machine being the final layer, loss function is now: Where RELU stands for Rectifier Linear Unit function, t is the target that is either -1 or 1 . This loss function represent the loss when one-vs-rest scheme is used (since TREC is a multiclass dataset). For Linear Support Vector with a soft-margin, the loss function is: 11. Reference: [1] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). [2] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.