0% found this document useful (0 votes)
53 views24 pages

Sentiment Classification and Aspect Based Sentiment Analysis On Yelp Reviews Using Deep Learning and Word Embeddings

Uploaded by

fasilfaris1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views24 pages

Sentiment Classification and Aspect Based Sentiment Analysis On Yelp Reviews Using Deep Learning and Word Embeddings

Uploaded by

fasilfaris1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Journal of Decision Systems

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tjds20

Sentiment classification and aspect-based


sentiment analysis on yelp reviews using deep
learning and word embeddings

Eman Saeed Alamoudi & Norah Saleh Alghamdi

To cite this article: Eman Saeed Alamoudi & Norah Saleh Alghamdi (2021): Sentiment
classification and aspect-based sentiment analysis on yelp reviews using deep learning and word
embeddings, Journal of Decision Systems, DOI: 10.1080/12460125.2020.1864106

To link to this article: https://doi.org/10.1080/12460125.2020.1864106

Published online: 27 Jan 2021.

Submit your article to this journal

Article views: 279

View related articles

View Crossmark data

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tjds20
JOURNAL OF DECISION SYSTEMS
https://doi.org/10.1080/12460125.2020.1864106

ARTICLE

Sentiment classification and aspect-based sentiment analysis


on yelp reviews using deep learning and word embeddings
a b
Eman Saeed Alamoudi and Norah Saleh Alghamdi
a
College of Computers and Information Technology, Taif University, Taif, Saudi Arabia; bCollege of Computer
and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia

ABSTRACT ARTICLE HISTORY


Opinion mining has significantly supported knowledge and deci­ Received 31 July 2020
sion-making. This research analysed the content of online reviews Accepted 8 December 2020
including the text of reviews and their rankings. The restaurant KEYWORDS
reviews of Yelp website have been analysed into two sentiment Sentiment analysis; bag-of-
classifications, binary classification (positive and negative) and tern­ words; TF–IDF; GloVe;
ary classification (positive, negative, and neutral). Three different machine learning; deep
types of predictive models have been applied including: machine learning; transfer learning;
learning, deep learning and transfer learning models. In addition, aspect extraction; lime
we propose a new unsupervised approach to apply for aspect-level
sentiment classification based on semantic similarity, which allows
our framework to leverage the powerful capacity of pre-trained
language models like GloVe and eliminated many of the complica­
tions associated with the supervised learning models. Food, service,
ambience, and price are the aspects that have been categorized
according to their sentiment context. In conclusion, 98.30% of the
maximum accuracy obtained using ALBERT model. The proposed
aspect extraction method achieved an accuracy of 83.04%.

1. Introduction
The web is a highly common information source, complemented by the development of
social media. The number of people regularly involved in social networking has been
shown to be growing steadily (Hemmatian & Sohrabi, 2019). Blogs, tweets, articles,
reviews, social media conversations are analysed to collect people’s opinions. Online
reviews are a type of user-generated content focusing on the personal experiences of
a product. Online reviews are considered as the new generation of the word-of-mouth,
which known as electronic word-of-mouth (eWOM; Jeong & Jang, 2011). Nowadays, online
shopping websites provide online forums for product reviews and for expressing opi­
nions. People share their opinions on various topics in the form of comments, tweets,
posts, and reviews (Hemmatian & Sohrabi, 2019). A study conducted by Li and Liu (2014)
indicated that 81% of Internet users have searched for relevant comments at least once
before purchasing a product. Search levels for corresponding comments were registered
between 73% and 87% prior to using restaurants, hotels and a range of other services. It is
important to note that these online contents have a major impact on consumer decisions,

CONTACT Eman Saeed Alamoudi [email protected] College of Computers and Information Technology,
Taif University, Alhawiah, 888 Taif, Saudi Arabia
© 2021 Informa UK Limited, trading as Taylor & Francis Group
2 E. S. ALAMOUDI AND N. S. ALGHAMDI

as new consumers usually trust previous clients reviews more than the owners’ product
ads or product details (Jeong & Jang, 2011). Both online users and companies have
increasingly started paying attention to reviews. Online customers take previous custo­
mer experiences into account when choosing to buy a product or service. Moreover,
public opinion for businesses plays an important role in marketing products, generating
new opportunities and forecasting sales. In addition, management teams within organi­
sations require these online feedbacks to be evaluated in order to determine customer
satisfaction level.
Digital transformation (e.g. online shopping and social media platforms) integrates
digital technology into businesses work in order to solve problems related to old operat­
ing methods, attract new opportunities and deliver valuable services to customers
(Verhoef et al., 2021). Furthermore, incorporating big data analysis with business pro­
cesses will lead to more accurate results-producing through decision support systems
(DSS; Osuszek et al., 2016). However, in the case of online review, the vast number of
reviews has made it difficult for interested parties to read all the reviews and determine
the opinions and the quality of products. Subsequently, the artificial intelligence (AI)
branches such as sentiment analysis is simply an automated way for the classification of
emotions in the review text. It is a method of extracting information from the text. It seeks
to turn the large and unstructured dataset into observable indices of sentiment (e.g.
positive, negative, or neutral). The extracted information can be summarised opinions
presented in the form of numbers or graphs, which makes producing the required
information by interested persons, such as managers or customers, easy and quick. This
underlines the inspiration behind sentiment analysis and generates greater interest in this
field of research.
In this research, restaurant reviews are analysed from the Yelp website. This research
employs natural language processing (NLP) and opinion mining, known as sentiment
analysis, to analyse online users’ reviews. The study aims to examine the relationship
between reviews’ contents and the ratings that are assigned to them by users in order to
automate the sentiment classification by building models that can predict sentiments
involved in the reviews.
The primary contributions of this paper are as follows:
(1)This paper compared three different prediction techniques of sentiment classifica­
tion, machine learning, deep learning and transfer learning.
(2)A novel unsupervised technique has been applied in aspect extraction based on pre-
trained language models and semantic similarity, which eliminate the labour of data
annotating and the needs of supervised model training.
(3)The aspect polarity (or sentiment) has been detected and the aspect average rating
has been computed to enable a comparison of restaurants based on different aspects,
such as food service, ambience, and price instead of comparing them only based on the
overall rating.
(4)The analysis findings have been presented simply and conveniently to help custo­
mers find valuable information and make a confident purchasing decision and, on the
other hand, to enable organisations to identify the level of customer satisfaction in order
to make appropriate decisions.
The remainder of this paper is organised as follows: The Related Work section reviews
the latest work in the field. The Research Methodology section presents the background
JOURNAL OF DECISION SYSTEMS 3

of the methods used in this study. The Implementation and Experiment section describes
the dataset and experimental settings. Results and Evaluation section evaluates and
discusses the performance of the study results. Finally, the conclusions, along with future
work, are presented in Conclusion and Future Work.

2. Related work
All the approaches used in the field of sentiment classification and extraction can be
grouped into three main classes (Yadav & Vishwakarma, 2020): prediction-based methods,
lexicon-based methods, and hybrid methods.
Machine Learning-based methods have been implemented via various studies. In
(Vairetti et al., 2020) the authors proposed a modified version of the support vector
machine (SVM) algorithm, which consists of including a new parameter, γ, for weighting
the contribution of each part of a review, title and body. To extract aspects from the
restaurant reviews (e.g. food, costs, service, environment, and anecdotes/miscellaneous)
in Kiritchenko et al. (2014), five binary one-vs-all SVM classifications were used to deal with
this problem as a multi-label text classification.
Lexicon-based methods have also been involved in literature either purely or jointly with
prediction models. Yu et al. (2018) proposed a word vector refinement model to refine
existing pre-trained word vectors using real-valued sentiment intensity scores provided by
the extended version of the Affective Norms of English Words (E-ANEW) sentiment lexicon.
Rezaeinia et al. (2019) introduced a novel method, improved word vectors (IWV), to
increase the accuracy of pre-trained word embeddings in sentiment analysis. This is done
by connecting traditional word2vec/GloVe (global vector for word representation) embed­
dings for every word in a sentence, with three other embeddings representing using six
different lexicons. In order to improve the accuracy and analysis speed of twitter sentiment
classification, Jianqiang et al. (2018) introduced a word embedding method called GloVe-
DCNN (GloVe-deep convolution neural networks) using the AFINN lexicon. The lexicon-
enhanced LSTM (LE-LSTM) model introduced by Fu et al. (2018) to enhance LSTM networks
for sentiment classification tasks on the unlabelled datasets.
The majority of recent research relies on deep learning techniques, which have widely
proven their high potential to obtain competitive results. One of the primary methods of
deep learning is the convolutional neural network (CNN), which was originally developed
to work in the field of computer vision. A novel and simple CNN model, which uses two
forms of pre-trained embedding for the extraction of aspect (general-purpose embedding
and domain-specific embedding) was applied in Xu et al. (2018). An approach named
semantic-based padding was proposed by Giménez et al. (2020) to improve the perfor­
mance of CNN in NLP tasks.
Regarding the implementation of deep learning sequence models, the study sug­
gested by Rao et al. (2018) was aimed at addressing one of the difficulties in the
deployment of long short-term memory networks (LSTM), in document-level sentiment
classification, which is to model the semantic relations between sentences. To solve this
problem, two improvements were introduced. The first is SR-LSTM, which stands for
sentence representation LSTM. The second improvement is SSR-LSTM (sorted SR-LSTM),
which is an approach to improve SR-LSTM, by first removing sentences with less emo­
tional polarity before data are input to the SR-LSTM model. The idea is that although
4 E. S. ALAMOUDI AND N. S. ALGHAMDI

LSTMs can theoretically handle long sequences, it still can benefit from removing part of
the sequence that is not relevant to the task at hand.
A new form of deep contextualised word representation, embeddings from language
model (ELMo) representations, was introduced by Peters et al. (2018). This differs from
conventional word embedding as every word is represented by an embedding vector,
which is a function of the entire input sequence. A bidirectional LSTM (BiLSTM) network
was used to train a language model (LM) on a large text corpus; thus, each word in the
downstream function is represented by a weighted sum of all the hidden vectors that
match the same word in the BiLSTM layers. The aim of the downstream task model is,
therefore, to learn the linear combination of these hidden vectors. Jeremy Howard and
Sebastian Ruder (2018) introduced ULMfiT (universal language model fine-tuning for text
classification), which is an efficient form of TL that can be applied to any NLP task. There
are three key steps in the algorithm. First, training an LM based on the state-of-the-art
AWD-LSTM network on a broad general text-domain to learn general language character­
istics (Chen et al., 2010). Second, fine-tuning the LM on downstream task data to learn
task-specific features using discriminative fine-tuning and slanted triangular learning
rates (STLR). Finally, adding to the LM two fully connected layers, both of which are the
classifier. The classifier is fine-tuned to the task-specific data using STLR and gradually
unfreezing the layers in order to preserve low-level representations and adjust high-level
representations. Devlin et al. (2018) introduced bidirectional encoder representations
from transformers (BERT) by pre-training a multi-layer bidirectional Transformer encoder
based on the original implementation described by Vaswani et al. (2017). BERT primarily
consists of two stages: pre-training and fine-tuning. BERT is pre-trained by two unsuper­
vised tasks: the masked language model (MLM), in which some percentage of input
tokens are randomly masked, and the model’s goal is to predict the masked tokens.
The second task is next sentence prediction (NSP), which aims to respond to the question
whether sentence B is in the sequence after sentence A. BERT is later fine-tuned for the
downstream task by adding one additional output layer that is appropriate for the task
and fine-tuning all end-to-end parameters.
By reviewing the literature, we found that sentiment classification and aspect extrac­
tion have not been studied satisfactorily on Yelp dataset. There is no previous research
that compared the performance of different types of prediction models: machine learn­
ing, deep learning, and transfer learning models using two different classification tasks,
binary classification and ternary classification. In addition, no simple way has been
suggested to extract aspect from un-labelled dataset such as Yelp. Therefore, this research
gap will be filled by our proposed study. This paper aims to conduct a comparison
between the efficiency of three various classification models using different feature
extractions. Furthermore, a new approach to extract aspect from restaurants review will
be introduced, which can be generalised easily to apply in different aspect extraction
domain.

3. Research methodology
This section discusses two methodologies, the first methodology is for review classifica­
tion, and the second is the methodology of aspect extraction and polarity. The sequences
of the methodologies are described in Figures 1 and 2.
JOURNAL OF DECISION SYSTEMS 5

3.1. Sentiment analysis


The development of NLP dates back to the 1950s. Many such programs had focused on
complicated collections of hand-written rules until the 1980s. However, there was
a breakthrough in the late 1980s, with the advent of machine learning algorithms for
language processing (Abdi et al., 2019).
Sentiment analysis (or opinion mining) is considered as one of the NLP applications.
The main objective of sentiment analysis is to automatically extract users’ feelings from
unstructured texts. Sentiment analysis can be defined as extracting people’s opinions
from the web. It discusses the perceptions, attitudes and emotions of people towards
organisations, entities, individuals, problems, actions, topics and their attributes.
Sentiment analysis also includes the examination of emotions related to any entity.
The word object has been used in research by Liu (2010) as a reflection of the target
entity in a document (or review). An object (or entity) is constituted of aspects. For
example, consider the statement ‘The food was delicious at the restaurant’. The target
entity here is the restaurant having food as an aspect. The feelings or opinions
expressed in the text are further classified into positive, negative, and neutral, or
more fine-grained categories (e.g. most positive, least positive, most negative, and
least negative). Hence, positive sentiment for the restaurant entity is expressed in the
above case. Sentiment analysis can be carried out at three stages (Do et al., 2019). The
first level is document level, and in this level of opinion mining, feelings are summarised
across the whole document as, for example, either positive or negative. Sentence level,
the purpose of this level is to assess whether the statement is subjective or factual and
to define the polarity of the sentence as, for example, either positive or negative. The
third level, aspect level is where is the goal is to determine and extract aspects from the
text and then specify their polarity; this level is commonly known as aspect-based
sentiment analysis (ABSA). Through the proposed method of aspect extraction in this
research, the aim is to obtain more comprehensive information about customers’
opinions by investigating their opinions towards the restaurants’ common aspects
(e.g. food, service, ambience, and price) that are mentioned in their online reviews,
for example, examining the reviews to understand more about the customer experi­
ence. Was the customer rating high or low because of the food? Or service? Or the
place? Or the price? Or perhaps the food was really good, but the manager was awful!
Clearly, such a nuanced assessment of a product or service is invaluable both to
customers and business owners. To perform aspect-level sentiment classification, the
common approach consists of four fundamental steps: First, hiring experts to manually
segment each customer review into sentences. Second, annotating each sentence
according to the aspect it entails, as well as its underlying sentiment. Third, training
a supervised machine learning or deep learning model on the annotated (labelled) data.
Finally, using the trained model to extract the aspect and predict the sentiment of
a newly written review. This is an extremely time-consuming and costly procedure,
especially if the business is providing different lines of products and services, e.g.
a retailer. As a solution to these obstacles, a novel and universal approach is proposed
that significantly reduces both time and cost, by reformulating the problem of aspect
extraction from a supervised text classification to an unsupervised text classification
6 E. S. ALAMOUDI AND N. S. ALGHAMDI

based on the approach of semantic similarity. To the best of our knowledge, we are the
first to use this method in the field of aspect extraction.

3.2. Extraction features


The dataset used in this research is textual; therefore, it needs special pre-processing
techniques before being entered into classification models. Texts need to be cleaned and
converted to numerical features to be used by prediction models. Different vectorisation
techniques are performed in this study. The bag-of-words (BOW) model describes the
presence of words in a document by ignoring all information about the order or the
meaning of the words (Zhao & Mao, 2018). N-gram is another useful feature extraction
model, developed by counting the number of word sequences in a text corpus and then
calculating their probability (Wei et al., 2009). Term Frequency–Inverse Document
Frequency (TF–IDF) is a wildly popular textual feature extraction method that calculates
the relative frequency of terms in a given document by an inverse proportion of the terms
over the whole document (Ramos, 2003).
While the aforementioned feature extraction techniques are common for multiple NLP
implementations, they are in fact ignoring the contextual similarities and the semantic
similarity between the extracted words. Hence, on the other hand, word embedding is
a powerful feature extraction method performed by using the deep learning techniques
in the NLP domain. The words are represented by vectors of real numbers, and the words
of similar meaning have similar representation (Joulin et al., 2017). There are many
commonly used examples of pre-trained word embeddings (or LM) such as word2vec
(Mikolov et al., 2013) and GloVe (Pennington et al., 2014), included in this research.

3.3. Classification models


In this research, a variety of models have been accomplished.
Machine learning models include logistic regression (LR) and naive Bayes (NB). The goal
of LR is to train an algorithm that can determine a new observation class by learning
a weight vector and a bias term from a training set. An NB model defines an event’s
likelihood based on previous knowledge of the event’s conditions (Zhang, 2005).
Deep Learning models include CNN, BERT, and ALBERT. The CNN was introduced with
AlexNet in 2012 and is an effective method for the analysis of images and texts without
a large amount of data (Krizhevsky et al., 2012). BERT is designed to pre-train deep
bidirectional LM representations of the unlabelled text. Thus, BERT model can be fine-
tuned with only one external output layer to construct state-of-the-art structures for
a wide range of tasks, including text classification, without major task-specific changes
in the model’s architecture (Devlin et al., 2018). A Lite BERT (ALBERT) has substantially
fewer parameters than BERT. Furthermore, a self-supervised task for sentence-order
prediction (SOP) was added to further enhance ALBERT’s results (Lan et al., 2019).
TL is implemented when storing knowledge from some problems and transfer it to
solve another new problem. In our case, TL was implemented by using the pre-trained
word embeddings as an embedding layer in CNN, BERT and ALBERT models.
JOURNAL OF DECISION SYSTEMS 7

Figure 1. Illustrations for sentiment classification process.

4. Implementation and experiment


4.1. Dataset
4.1.1. Data collection
(1) Yelp Dataset: Yelp.com is an online directory and large source review platform. Data
was collected from the Yelp Dataset Challenge, which is publicly available and accessible
from the Yelp1 and Kaggle2 websites. In this project, the dataset was downloaded from
the Kaggle repository. The Yelp dataset consists of five CSV files: business, users, reviews,
check-in, and tips. There are three files related to business: yelp_business, yelp_busines­
s_attributes, and yelp_business_hours. The research focus is on the business and reviews,
for which only the yelp_business and the yelp_review files are used. Among businesses,
restaurants is the most popular category, so this research focuses specifically on this
category, and the Yelp review dataset was filtered to show only businesses that are within
the restaurant group. There is a high variation in review numbers among restaurants. The
lowest number of reviews is 2, and the highest number is 10,323 reviews, duo to this
8 E. S. ALAMOUDI AND N. S. ALGHAMDI

Figure 2. Illustrations for aspects extraction and polarity process.

situation, further filtering was performed to retain only restaurants with over 500 reviews,
as this criterion results in an appropriate number of reviews that can be used for
comparable analysis. (2) SemEval-2014 Dataset: SemEval-2014 was used in order to
examine the approach proposed for aspect extraction as the Yelp dataset is not annotated
in relation to the restaurants’ aspects. The SemEval-2014 dataset is annotated with
restaurants reviews data. This dataset is provided for Task 4 of SemEval-2014, Aspect-
Based Sentiment Analysis (ABSA). It comprises 3,044 English sentences from Ganu et al.
(2009) restaurant reviews. There are descriptions for many details (e.g. aspect terms,
aspect categories). This research only focuses on aspect categories for each sentence.
The dataset was downloaded from the META-SHARE website.3

4.1.2. Data splitting


Two distinct techniques were implemented to modify the Yelp dataset from star rating
classification to sentiment rating classification. First, the Yelp dataset was tuned to the
ternary sentiment task: positive, negative, and neutral classes. The positive class
JOURNAL OF DECISION SYSTEMS 9

represents a rating of 4 stars and above, whereas the negative class corresponded to 2
stars and below. The 3-star rating was allocated to the neutral class. Second, the Yelp
dataset was adapted to the binary-class task, positive and negative reviews. The positive
class represents a rating of 4 stars and above, whereas the negative class was assigned to
2 stars and below. The 3-star rating was dropped. The data splitting mechanism is
explained in detail in Table 1.

4.2. Sentiment classification


4.2.1. Data preprocessing
The main objective of the predictive models is to deliver a good performance. Data
preparation is therefore important before entering the data into the models, which
helps to achieve high accuracy. The textual data required specific procedures. In this
research, two steps for data preparation were applied, (1) Name normalisation was
implemented to decrease the variance on restaurants’ names due to spelling variations.
In this step, each name was converted to its lowercase form, and all hyphens were
removed. (2) Data examining, data were examined in terms of null values and duplicate
values, and the results showed the data were free of these. After preparing data, the data
pre-processing procedures were applied, which includes (1) The tokenisation function,
which is to split a document into appropriate parts, called tokens, based on the white­
space between characters and by the implementation of language-specific rules. (2)
Lemmatisation, which is assigning the forms of the basic words. It involves removing
prefixes and suffixes and converting text to the lowercase form unless the entity names
(e.g. meal names and city names) that are already written in reviews starting with a capital
letter. (3) Removing uninformative words, which includes cleaning the text from less
informative words. The common stop words set that is available from spaCy,4 a python
library (e.g. the, always, anyway, etc.) was removed. Punctuation, non-alphabetical char­
acters, and personal pronouns were also removed. Figure 3 provides an illustration of
a review text before and after the data pre-processing stage of machine learning models.
The text preprocessing step for deep learning models was done using a function called
Standard Text Preprocessor in Ktrain,5 a python library. This involves (a) removing punctua­
tion, (b) text splitting and lowercasing, and (c) vectorisation, which includes translating each
text into an integer sequence (each integer is a token index in a dictionary). Figure 4
provides pre- and post-processing examples of a sample text of deep learning models.

Table 1. Data splitting details.


Machine learning models
Training Validation Testing
Dataset classes (75%) (15%) (15%) Total number of reviews
3-class Yelp dataset 862,291 184,777 184,777 1,231,845
2-class Yelp dataset 744,758 159,591 159,591 1,063,940
Deep learning models
Dataset classes Training Validation Testing Total number of reviews
(99%) (0.5%) (0.5%)
3-class Yelp dataset 1,219,618 6160 6160 1,231,938
2-class Yelp dataset 1,053,375 5320 5321 1,064,016
10 E. S. ALAMOUDI AND N. S. ALGHAMDI

Love coming here. Yes the place always needs


the floor swept but when you give out peanuts
in the shell how won't it always be a bit dirty. love come yes place need floor sweep peanut
\n\nThe food speaks for itself, so good. Burgers shell bit dirty food speak good burger order
are made to order and the meat is put on the meat grill order sandwich Cajun fry
grill when you order your sandwich. \n\nGeng
the Cajun fries

Figure 3. A review text before and after the pre-processing stage for machine learning models.

'I like this restaurant !? .......' 'I like this restaurant' [[3, 40, 16, 75]]

Figure 4. A review text before and after the pre-processing stage for deep learning models.

4.2.2. Feature extraction


The clean data resulting from the pre-processing phase were used in this stage to apply
the feature extraction steps as follows: (1) Term frequency (BOW): The data were trans­
ferred to numerical vectors, considering how frequently each word in a dataset is
repeated while disregarding the relative position information of the words in the text.
The unigram (U), bigram (B) and trigram (T) features were also extracted from the BOW. (2)
TF–IDF: Used to generate the most representative words among word sets from the BOW.
(3) GloVe: In this stage, a TL pack built-in from the spaCy library was used. The model is
called en_vectors_web_lg, which was trained on a common web crawl (written text, e.g.
blogs, news, and comments) using GloVe. The model provides 1,070,971 unique vectors
(each vector representing one word); each vector is represented using 300 dimensions
(columns).

4.2.3. Classification models


Machine learning model implementation: Machine learning models were implemented
using two different pipelines for each model. LR and NB models were applied in this study.
A pipeline involves BOW with a combination of unigram, bigram and trigram terms, and
TF–IDF. The first pipeline processed using an unlimited number of terms (grams) pro­
duced by the BOW. The second pipeline processed using a fixed length of the highest
repeated 20,000 g. Deep learning model implementation: The first model implemented is
a CNN. The model primarily consists of six layers: first, an embedding layer with
a dimension of 30,000 rows, 100 columns and a vocabulary size of 30,000 words. The
input length is 256 words, so reviews longer than 256 words will be truncated, starting
from the beginning of the reviews to keep the most important words that usually appear
at the end. In the opposite case, reviews shorter than the input length will be padded with
zeros, starting from the beginning of the review. Second, a 1D-convolutional layer of 256
filters, each of a kernel size 5, and ReLU as an activation function. Third, a 1D-convolutional
layer of 128 filters, each of a kernel size 3, and ReLU as an activation function. Fourth, a 1D-
global average-pooling layer. Fifth, a dense layer of 128 neurons and ReLU as an activation
function. The final layer is a dense layer of (three neurons for the ternary Yelp dataset and
two neurons for the binary Yelp dataset) and softmax as an activation function. Dropout
JOURNAL OF DECISION SYSTEMS 11

Embedding 1D-Conv 1D-Conv 1D-Avrage Dense1 Dense2


30,000×100 256-filters 128-filters Pooling 128-neurons 3-neurons

Figure 5. Plot of the CNN classification model.

and batch-normalisation were applied after both the 1D-convolutional layers and the 128-
dense layer, with a dropout rate of 0.25 for the first 1D-convolutional layer and 0.1 for
both the second 1D-convolutional layer and the 128-dense layer. Figure 5 shows the
architecture layers of the CNN model.
Other hyperparameters were: a maximum learning rate of 0.001 (1e−3), weight decay (a
regularisation for overfitting reduction) of 0.01, the type of cross-entropy loss was cate­
gorical, Adam gradient descent was used as an optimiser, a batch size of 256, and four
epochs were selected to implement the model.
Three different experiments were applied using the CNN model. In the first experiment,
the embedding layer’s weights were assigned randomly and were fine-tuned during the
experiment. In the second experiment, the GloVe model (word-embedding model) from
spaCy was used to initialise the weights of the embedding layer (30,000 rows, 300
columns) and the weights’ values of the embedding layer were frozen, not trained (fine-
tuned) during the experiment. In the third experiment, the same GloVe model from spaCy
was used but with the weights’ values fine-tuned during the training phase.
The second model is BERT. Specifically, the BERT-base model was used. Before the BERT
model was trained on the entire data, a sample of 25% of the data was selected to train
the model as an initial training stage. It ensured that the classes were presented as the
same proportion in the full dataset. The input length was determined with 256 words and
an embedding layer of 30,000 rows, and 768 columns. During this stage, the model was
fine-tuned using the recommended hyper-parameters’ values mentioned in Zhang et al.
(2015), which are: maximum learning rate of (2e−5), Adam as an optimiser, four epochs,
and a batch size of 32. In the second training stage, the BERT model was trained using the
same architecture and the optimal weights obtained from the first stage on the full
dataset. The hyper-parameters’ values used in the second stage were: a batch size of
32, maximum learning rate of (1e−6) and one epoch. The third model is ALBERT for self-
supervised learning of language representations. We used the ALBERT-base model and
applied the same training methodology and the hyper-parameters’ values that were
applied on the BERT model experiment with a maximum learning rate of (5e−6) for the
three classes dataset and (6e−6) for binary classes in the first stage and a maximum
learning rate of (2e−7) for both classes dataset in the second stage. All models’ experi­
ments were carried out using a Tesla P100-16GB GPU by NVIDIA.

4.3. Aspect extraction


4.3.1. Clustering
As an initial step, it is a good practice to investigate the data to find the most common
topics mentioned in customer reviews. A k-mean clustering method, a form of unsuper­
vised learning can easily interpret data and capture clear and consistent patterns (Huang,
2008), was implemented. Four clusters were used, on trigrams generated from BOW to
12 E. S. ALAMOUDI AND N. S. ALGHAMDI

Table 2. Illustrations of trigram clustering.


Trigram Cluster Trigram Cluster
Food great service 1 Highly recommend place 2
Staff super friendly Definitely recommend place
Good customer service Absolutely love place
Sweet potato fry 3 Great food great 4
Vanilla ice cream Food great price
Thin crust pizza Favourite place eat

determine the most significant aspects that customers discussed in their comments. The
results showed three obvious patterns, related to three aspects: food, service, and place,
whereas the fourth segment consisted of various aspects not based on a particular aspect.
Table 2 displays examples of three trigrams from each cluster.

4.3.2. Aspect embedding


An embedding vector was built to represent each aspect. A GloVe pre-trained model
available from the spaCy library was used. Four vectors were created; each one represents
an aspect. The aspects defined were food, service, ambience, and price. Each aspect was
defined using a group of words, as shown in Figure 6, and therefore, each aspect’s vector
returns an average of the vectors of their words (tokens). Different forms of defining the
aspects have been tried, and the form that gave the best result was chosen.

4.3.3. Review embedding


In order to build the vectors for reviews, a sequence of steps was implemented. Firstly, all
reviews were entered into the NLP pipeline provided by spaCy, the result of the default
pipeline is a spaCy document object which can be defined as a sequence of tokens.
Secondly, each review was broken down into sentences using a method called Doc.sents
working based on the dependency parsing technique. Thirdly, for each sentence, the
GloVe vector embedding of the full sentence was computed; this was done by taking the
average of all the word vectors in the sentence. Finally, the new words that are not found
in the training corpus are replaced with zeros.

4.3.4. Semantic similarity


In this phase, the semantic similarity was computed between the vectors of the aspects
and the vector of each sentence in a set of reviews related to a particular restaurant. The
aspect that gained the highest score (higher being more similar) with the sentence was
chosen as the aspect mentioned in the sentence. The semantic similarity score was

{'food': 'food drinks',


'service': 'service staff',
'ambience': 'ambience music loca"on',
'price': 'price money $'}

Figure 6. Aspects definitions.


JOURNAL OF DECISION SYSTEMS 13

Table 3. Illustration of aspect extraction and aspect sentiment detection.


Sentence Food Service Ambiance Price Aspect Sentiment
I ordered the grilled chicken combo but received fried 0.5631 0.3956 0.3732 0.4056 Food Negative
chicken.
The inside is dirty in the afternoon like they never 0.5098 0.4981 0.5602 0.4955 Aambience Negative
clean.
The people behind the counter are usually polite and 0.5191 0.5282 0.5193 0.5035 Service Positive
fast.
I have never had an order messed-up. 0.4992 0.5852 0.5232 0.5787 Service Negative

Table 4. Model results on the 3-class Yelp dataset.


3-Class Yelp dataset
Accuracy Precision Recall F1-score
Features Model (%) (%) (%) (%) Log loss
BOW (U + B + T) + (TF-IDF) Logistic regression 82.815 85.105 82.815 83.726 0.427
Unlimited phrases
BOW (U + B + T) + (TF-IDF) 81.024 85.123 81.024 82.513 0.460
20,000 phrases
BOW (U + B + T) + (TF-IDF) Naive Bayes 79.378 78.630 79.378 73.359 0.603
Unlimited phrases
BOW (U + B + T) + (TF-IDF) 81.112 78.117 81.112 77.964 0.471
20,000 phrases
Random weights CNN 88.295 87.600 88.295 87.859 0.294
Glove (freezing weights) 87.987 87.164 87.987 87.433 0.301
Glove (fine-tuning weights) 88.620 87.91 88.620 88.160 0.290
Pre-trained weights BERT 89.626 89.357 89.626 89.477 0.262
Pre-trained weights ALBERT 89.496 89.023 89.496 89.211 0.262

Table 5. Model results on the 2-class Yelp dataset.


2-Class Yelp dataset
Accuracy Precision Recall F1-score
Features Model (%) (%) (%) (%) Log loss
BOW (U + B + T) + (TF-IDF) Logistic regression 95.674 95.809 95.674 95.715 0.119
Unlimited phrases
BOW (U + B + T) + (TF-IDF) 95.298 95.526 95.298 95.359 0.126
20,000 phrases
BOW (U + B + T) + (TF-IDF) Naive Bayes 92.276 92.483 92.276 91.914 0.185
Unlimited phrases
BOW (U + B + T) + (TF-IDF) 92.663 92.581 92.663 92.489 0.185
20,000 phrases
Random weights CNN 97.706 97.698 97.706 97.695 0.068
Glove (freezing weights) 97.744 97.735 97.744 97.737 0.065
Glove (fine-tuning weights) 98.045 98.038 98.045 98.039 0.061
Pre-trained weights BERT 98.120 98.117 98.120 98.118 0.055
Pre-trained weights ALBERT 98.308 98.303 98.308 98.304 0.048

computed using the cosine similarity. An example of the semantic similarity implementa­
tion is explained in Table 3.

4.3.5. Aspect polarity


The LR model was applied, which was trained on a full review (document level), to predict
the sentiment of each sentence (sentence level), therefore, to determine the aspect
polarity related to each sentence. Examples of sentiment (polarity) detection are shown
in Table 3.
14 E. S. ALAMOUDI AND N. S. ALGHAMDI

5. Results and evaluation


5.1. Evaluation of sentiment classification
Tables 4 and 5 provide the results obtained from all models based on the two sentiment
classification problems: ternary Yelp dataset and binary Yelp dataset. The best results for
each classification task in terms of each evaluation metric are highlighted in bold. In
machine learning models, across the two classification problems, LR achieved the best
result, with an accuracy of 82.815% and an accuracy of 95.674% on the binary and ternary
Yelp datasets, respectively. In deep learning models, BERT performed better than others
on the ternary classification task with 89.626% for accuracy, whereas, ALBERT boosted the
performance to the best result on the binary classification task, with an accuracy of
98.308%. Significant results can be seen when comparing the performance of models
according to the vector length in the BOW. LR performed better with unlimited vector
length, whereas, NB produced the best results with a fixed length of 20,000 g. Moreover,
the CNN model of the GloVe embedding with fine-tuning technique boosted the perfor­
mance among other CNN experiments in both classification problems. Generally, CNN
with fine-tuned GloVe, BERT, and ALBERT models obtained remarkably close results on
both datasets, with error rates of 1.95, 1.87, and 1.69, respectively, on the binary Yelp
dataset. The accuracy, precision, recall, F1 score, and log loss have been calculated using
the following formulas:
Numberofcorrectpredictions
Accuracy ¼ (1)
Totalnumberoftheinputsamples

Numberofcorrectpositiveresults
Precition ¼ (2)
Numberofallresultsthatwerepredictedaspositive

Numberofcorrectpositiveresults
Recall ¼ (3)
Numberofallpositivesamples

1
F1Score ¼ 2 � 1 1 (4)
Precitionþ Recall

N X M
1X �
LogLoss ¼ yij � log pij (5)
N i¼1 j¼1

5.1.1. Interpretation of classification mechanism using lime


The classification mechanism of the best model ALBERT is interpreted in Table 6 and Figure 7
using the Lime library.6 The interpretable representation graph of two selected reviews offers
some insights into the model approach when classifying reviews, e.g. what is the impact of
each word or phrase on the model’s prediction? The words highlighted with dark green have
the largest effect on the probability of the current model’s decision, whereas, the words
highlighted with dark red have the greatest effect on the opposite decision. Thus, the words
great and good had a high impact in encouraging the model to predict with positive.
Conversely, the word price affected the model to predict the review as a negative review.
JOURNAL OF DECISION SYSTEMS 15

Table 6. Reviews’ details.


Review Model
ID Review text Stars Sentiment prediction
2578 Great price and decent food selection. Discount available if you have 2 Negative Positive
Player’s Card. 6.99 Breakfast 8.99 Lunch.
525 This place was really good. Great food. Good price. Trust the reviews. Place 4 Positive Positive
fills up quick so hurry over.

Figure 7. Interpretation of the classification mechanism of the ALBERT model.

5.1.2. Overfitting checking


Testing the model against the overfitting is a common practice in the field of machine
learning and deep learning prediction. A strong indication of a model overfitting is that its
error in the test set is substantially greater than the error in the training set. That means
the model knows the training data but does not generalise well on unseen data. This
makes the model useless for predictive purposes. All models were checked against the
overfitting and the results showed that all models are free from it. For illustration on the
best-performed model ALBERT, the values of the accuracy and log loss were compared in
the three splits of data: training, validation, and testing data on the binary Yelp dataset.
The metric values were very close, indicating that the model was not suffering from
overfitting to the trained data. Table 7 explains the evaluation metric values of the three
types of data splitting for the ALBERT model on the binary Yelp dataset.
16 E. S. ALAMOUDI AND N. S. ALGHAMDI

Table 7. The accuracy and log loss values of the


training, validation and testing datasets of the
ALBERT model on the 2-class Yelp dataset.
Accuracy
Dataset (%) Log loss
Training data 98.45 0.044
Validation data 98.30 0.048
Test data 98.10 0.051

5.1.3. Testing models with unseen examples (case study)


To test the effectiveness of the best-performed model ALBERT, its prediction ability was
examined on a new and challenging dataset. The prediction results were acceptable, and
they reflect the model generalisation efficiency. Table 8 provides an illustration of the
model test experiment.

5.2. Evaluation of aspect extraction


The Yelp dataset is not annotated data according to the restaurants’ common aspects;
therefore, it cannot be used in the field of aspect extraction for training supervised models
or for testing using standard evaluation metrics of supervised models. Three evaluation
methods have been proposed that can help in assessing the performance of the proposed
method of aspect extraction.

5.2.1. Comparing overall average rating with aspect average rating


First, the overall average rating for both the highest and the lowest rated restaurants was
computed. The actual average rating was computed by implementing Equation (6) (Ganu
et al., 2009) on the review sets of both restaurants. The number of reviews in each
sentiment class was taken from the ternary Yelp dataset. Second, the average rating for
each aspect for the aforementioned restaurants is computed using equation (6). The
equation used based only on the number of positive and negative sentences in each
restaurant reviews. Neutral-classed sentences were excluded as they do not provide detail
on the quality of the restaurant. Finally, the actual rating and the mean of the aspects’
rating were compared. Table 9 shows the average sentiment rating for the highest and
lowest restaurants, and Table 10 shows the average rating per aspect for KFC.

Table 8. Illustration of the ALBERT model test experiment on a new dataset.


Prediction
Model negative Prediction positive
Tested data prediction probabilities probabilities
Superb food Positive 0.0025 0.9974
Great food Positive 0.0033 0.9966
How wonderful place Positive 0.0083 0.9916
Such disgusting food Negative 0.9809 0.0190
I’m not sure what to say? The food is nice, but the service is Negative 0.7487 0.2512
not that great
I’m not sure what to say? The food is the best, but the service Positive 0.4740 0.5259
is not that great
JOURNAL OF DECISION SYSTEMS 17

Table 9. Average sentiment rating for highest rated and lowest rated restaurants in the 3-class Yelp
dataset.
No. of reviews No. of negative No. of positive Average sentiment
Restaurant’s name reviews No. of neutral reviews reviews rating
Brew Tea Bar 10 15 1140 4.965
KFC 1769 208 306 1.589

Table 10. An example of the aspects’ details related to a KFC restaurant.


Sentiment Average sentiment
Aspect No. of negative sentences No. of neutral sentences No. of positive sentences rating
Ambience 1137 313 832 2.690198
Food 4061 991 2053 2.343147
Price 1664 321 969 2.472085
Service 3877 615 1785 2.261039

� � � �
P 2
Rating ¼ �4 þ1¼ � 4 þ 1 ¼ 5:00 (6)
PþN 2þ0

P represents the number of positive reviews (or positive sentences in all reviews), and
N represents the number of negative reviews (or negative sentences in all reviews). The
computed average rating is scaled in the 1–5 range, to be comparable to the original
rating that was used in the evaluation system of the Yelp dataset. We compared the
overall average rating and the aspect average rating of the highest-rated restaurant, Brew
Tea Bar, and the lowest-rated restaurant, KFC. Brew Tea Bar’s aspect average rating is
4.3974, whereas the actual one is 4.9652. KFC’s aspect average rating is 2.2204, whereas
the actual value is 1.5898. There is only about six decimal point difference between both
readings. The radar graph in Figure 8 shows the average rating based on the aspects of
Brew Tea Bar and KFC. The close results indicate that a reasonable aspect average rating

Figure 8. Average rating of ‘Brew Tea Bar’ and ‘KFC’ based on aspects.
18 E. S. ALAMOUDI AND N. S. ALGHAMDI

for a restaurant can be obtained, which provides a better understanding of the important
information in customer review and can also be utilised to rank the restaurants based on
a selected aspect.

5.2.2. Comparing rating using scatter plot


We compared the overall average rating and the aspect average rating, for a sample of the
first 50 restaurants in the ternary Yelp dataset, by drawing the scatter plot. The scatter plot
in Figure 9 shows a strong positive relationship between the two sets of values.

5.2.3. Applying the aspect extraction method on SemEval dataset


The performance of the proposed aspect extraction approach was assessed by imple­
menting it on the SemEval-2014 dataset, an annotated restaurant dataset. The prediction
robustness was assessed by comparing the predicted values of the aspects that produced
by the proposed method with the actual values in the dataset. First, the aspect called
anecdotes/miscellaneous was dropped when it appears alone because it does not provide
value when assessing the quality of a restaurant. Second, any sentence in the SemEval-
2014 dataset can have more than one extracted aspect, whereas, in the proposed method,
each sentence was categorised using only one aspect. Therefore, when the predicted
outcome existed in the actual value list, which could have more than one aspect, this was
treated as a valid prediction. The accuracy result of this implementation was 83.04%. It
was calculated by dividing the correct rows by all rows. Table 11 shows an example of
applying the proposed method of sentiment analysis on the first five sentences from the
SemEval-2014 dataset.
Average Senment Rate of Yelp Dataset

Average Senment Rate of Aspect Extracon Method


Figure 9. Relationship between the overall and the aspect average rating for 50 restaurants.
JOURNAL OF DECISION SYSTEMS 19

Table 11. Illustration of aspect extraction proposed method on the SemEval-2014 dataset.
Predicted
Sentences of reviews Food Service Ambiance Price aspect Actual aspect Match
But the staff was so horrible to us. 0.48779 0.49894 0.25924 0.37803 Service [Service] True
To be completely fair, the only 0.53761 0.47861 0.23591 0.44604 Food [Food,
redeeming factor was the food, anecdotes/
which was above average, but
could not make up for all the other
deficiencies of Teodora.
miscellaneous] True
The food is uniformly exceptional, 0.59301 0.48844 0.32160 0.40226 Food [Food] True
with a very capable kitchen which
will proudly whip up whatever you
feel like eating, whether it’s on the
menu or not.
Where Gabriela personally greets you 0.51113 0.38971 0.28075 0.31797 Food [Service] False
and recommends you what to eat.
Not only was the food outstanding, 0.55962 0.48798 0.28862 0.41952 Food [Food, service] True
but the little’ perks’ were great.

5.3. Our result versus state-of-the-art results


This study obtained new, state-of-the-art results on the 2-class Yelp dataset. By comparing our
results with previous works, we obtained a 1.69 error rate (1 – accuracy; lower is better; Batista
et al., 2004) in the binary classification task with the ALBERT base model. In previous works
(Sun et al., 2019), error rates of 1.92 were obtained with BERT base, and 1.81 obtained with
BERT large on the Yelp binary classification task. Moreover, as compared with other research,
we gained a lower error rate in binary classifications of 1.69 with the ALBERT base model, 1.95
with BERT base model, and 1.87 with CNN with fine-tuned GloVe, compared to 2.16 with
ULMFiT in Howard and Ruder (2018). Furthermore, CNN model with random weights and
frozen GloVe with error rates of 2.29 and 2.25, respectively, in the binary classification
problem, outperformed 2.64 with DPCNN in Johnson and Zhang (2017). In addition, using
the LR model with unfixed length n-grams, we found a better error rate value in the binary
classification of 4.32, rather than 4.60 with CNN, 4.36 of LR with n-grams, or 5.26 with LSTM in
prior work Zhang et al. (2015). As regards the evaluation of the proposed aspect extraction on
‘the Semeval-2014 dataset task 4, subtask 1 aspect extraction’ with previous studies, our
proposed method obtained a score of 83.04% accuracy (using at least one match, true requires
the answer to have one string match with human-annotated answer spans provided in the
SemEval-2014; otherwise it is false). It outperformed the result of 74.37% F1 score in Xu et al.
(2018).

6. Conclusion and future work


Experiments on the Yelp dataset with two different classification problems have been
conducted: the ternary Yelp dataset consists of positive, negative and neutral classes, and
binary Yelp dataset consists of only positive and negative classes. Various machine
learning models with different feature extraction techniques have been used. LR and
NB were implemented with BOW, using a gram range of one to three (i.e. unigram,
bigram, and trigram) and TF–IDF. Multiple deep learning methods were included in the
research experiments. Deep learning experiments involved non-transferred learning (e.g.
CNN model with the random weights initialisation) and TL mechanism (all models
20 E. S. ALAMOUDI AND N. S. ALGHAMDI

experiments that involved pre-trained word embeddings). TL methods include CNN with
fine-tuning GloVe and frozen GloVe and a fine-tuning technique for BERT base and
ALBERT base models. A range of different evaluation metrics was applied to assess the
models’ performances. The results were evaluated using several metrics such as accuracy,
precision, recall, F1 score, and log loss. The results of all experiments were discussed and
compared to the current state-of-the-art results in the NLP sentiment analysis domain.
A novel and universal method of aspect extraction was introduced, evaluated, discussed,
and compared to the current state-of-the-art results in aspect extraction studies. Finally, it
can be observed that each classification method and each predicted model have their
own advantages and drawbacks, and choosing the appropriate approach is a difficult
decision involving a degree of compromise. Deep learning approaches have provided
accurate performance and enabled the skipping of the complicated feature extraction
process, while it often requires long training time. Machine learning models, on the other
hand, required less computational sophistication, but with high data preparing require­
ments. Moreover, the unsupervised method with the competitive results offers an easy
and universal approach that can be adapted to distinct tasks. The entire used pipeline for
the data pre-processing, the feature extraction and the built learning models by adopting
specific structures and appropriate values for hyperparameters have been contributed to
improve results. However, it was found that transfer learning (pre-trained word embed­
dings) applied when training the models and when finding the semantic similarity had the
most effect in improving the results for both the sentiment classification and the aspect
extraction.
Recommendations for future research include:
•Conducting more deep learning approaches such as sequence-to-sequence
approaches (e.g. recurrent neural network (RNN), gated recurrent units (GRUs), and
LSTM) and other transformer models (e.g. RoBERT and DistilBERT).
•Implementing human-labelled reviews where the sentiments are classified by expert
human annotators, which could help to overcome the problem of mislabelled reviews.
•Learning word embeddings on specific-domain tasks, which could boost the models’
performance.
•Training word embeddings on informal dialects, which may help improve the models’
performance.
•Expanding the current research with other languages, like Arabic.
•Implementing on reviews in other domains, such as online products.
•Implementing the attention mechanisms on the proposed method of aspect
extraction.

Notes
1. https://www.yelp.com/.
2. https://www.kaggle.com/.
3. http://metashare.ilsp.gr:8080/.
4. https://spacy.io/.
5. https://pypi.org/project/ktrain/.
6. https://github.com/marcotcr/lime/.
JOURNAL OF DECISION SYSTEMS 21

Acknowledgments
The authors offer their sincere thanks to Bank Albilad Chair of Electronic Commerce (BACEC) for their
financial support to conduct this successful research. This research was also funded by the Deanship
of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track
Research Funding Program.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work was supported by the Princess Nourah Bint Abdulrahman University [1000-FTFP-20].

ORCID
Eman Saeed Alamoudi http://orcid.org/0000-0001-7186-9406
Norah Saleh Alghamdi http://orcid.org/0000-0001-6421-6001

References
Abdi, A., Shamsuddin, S.M., Hasan, S., & Piran, J. (2019). Deep learning-based sentiment classification
of evaluative text based on multi-feature fusion. Information Processing and Management, 56(4),
1245–1259. https://doi.org/10.1016/j.ipm.2019.02.018
Batista, G.E.A.P.A., Prati, R.C., & Monard, M.C. (2004). A study of the behavior of several methods for
balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.
https://doi.org/10.1145/1007730.1007735
Chen, D., Denman, S., Fookes, C., & Sridharan, S. (2010). AWD-LSTM. Proceedings of the 2010 digital
image computing: Techniques and applications (DICTA 2010), 369–374. Sydney, Australia. https://
doi.org/10.1109/DICTA.2010.69
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. ArXiv. http://arxiv.org/abs/1810.04805
Do, H.H., Prasad, P.W.C., Maag, A., & Alsadoon, A. (2019). Deep learning for aspect-based sentiment
analysis: A comparative review. Expert Systems with Applications, 118, 272–299. https://doi.org/10.
1016/j.eswa.2018.10.003
Fu, X., Yang, J., Li, J., Fang, M., & Wang, H. (2018). Lexicon-enhanced LSTM with attention for general
sentiment analysis. IEEE Access, 6, 71884–71891. https://doi.org/10.1109/ACCESS.2018.2878425
Ganu, G., Elhadad, N., & Marian, A. (2009). Beyond the stars: Improving rating predictions using
review text content. WebDB, 9, 1–6.
Giménez, M., Palanca, J., & Botti, V. (2020). Semantic-based padding in convolutional neural net­
works for improving the performance in natural language processing: A case of study in senti­
ment analysis. Neurocomputing, 378, 315–323. https://doi.org/10.1016/j.neucom.2019.08.096
Hemmatian, F., & Sohrabi, M.K. (2019). A survey on classification techniques for opinion mining and
sentiment analysis. Artificial Intelligence Review, 52(3), 1495–1545. https://doi.org/10.1007/
s10462-017-9599-6
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. ACL 2018:
56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the
Conference, 1, 328–339. https://doi.org/10.18653/v1/p18-1031
Huang, A. (2008). Similarity measures for text document clustering. New Zealand computer science
research student conference, NZCSRSC 2008, 49–56. Christchurch, New Zealand.
22 E. S. ALAMOUDI AND N. S. ALGHAMDI

Jeong, E.H., & Jang, S.C.S. (2011). Restaurant experiences triggering positive electronic word-of-
mouth (eWOM) motivations. International Journal of Hospitality Management, 30(2), 356–366.
https://doi.org/10.1016/j.ijhm.2010.08.005
Jianqiang, Z., Xiaolin, G., & Xuejun, Z. (2018). Deep convolution neural networks for twitter senti­
ment analysis. IEEE Access, 6, 23253–23260. https://doi.org/10.1109/ACCESS.2017.2776930
Johnson, R., & Zhang, T. (2017). Deep pyramid convolutional neural networks for text categorisation.
ACL 2017: 55th annual meeting of the association for computational linguistics, proceedings of the
conference (Long papers), 1, 562–570. Vancouver, Canada. https://doi.org/10.18653/v1/P17-1052
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification.
15th conference of the European chapter of the association for computational linguistics, EACL 2017,
2, 427–431. Valencia, Spain. https://doi.org/10.18653/v1/e17-2068
Kiritchenko, S., Zhu, X., Cherry, C., & Mohammad, S. (2014). NRC-Canada-2014: Detecting aspects and
sentiment in customer reviews. Proceedings of the 8th international workshop on semantic
evaluation (SemEval 2014), 437–442. Dublin, Ireland. https://doi.org/10.3115/v1/s14-2076
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional
neural networks. Advances in Neural Information Processing Systems, 2, 1097–1105.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for
self-supervised learning of language representations. ArXiv. http://arxiv.org/abs/1909.11942
Li, G., & Liu, F. (2014). Sentiment analysis based on clustering: A framework in improving accuracy
and recognising neutral opinions. Applied Intelligence, 40(3), 441–452. https://doi.org/10.1007/
s10489-013-0463-3
Liu, B. (2010). Sentiment analysis and subjectivity. In N. Indurkhya & F. Damerau (Eds.), Handbook of
natural language processing (2nd ed., pp. 627-666). Chapman & Hall: CRC Press.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words
and phrases and their compositionality. Advances in Neural Information Processing Systems,
3111–3119. http://arxiv.org/abs/1310.4546
Osuszek, L., Stanek, S., & Twardowski, Z. (2016). Leverage big data analytics for dynamic informed
decisions with advanced case management. Journal of Decision Systems, 25(Suppl. 1), 436–449.
https://doi.org/10.1080/12460125.2016.1187401
Pennington, J., Socher, R., & Manning, C.D. (2014). GloVe: Global vectors for word representation.
EMNLP 2014: 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543.
https://doi.org/10.3115/v1/d14-1162
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
contextualised word representations. Proceedings of the 2018 conference of the North American
chapter of the association for computational linguistics: Human language technologies, 1,
2227–2237. New Orleans, Louisiana. https://doi.org/10.18653/v1/n18-1202
Ramos, J. (2003). Using TF–IDF to determine word relevance in document queries. Proceedings of the
first instructional conference on machine learning, 242, 133–142.
Rao, G., Huang, W., Feng, Z., & Cong, Q. (2018). LSTM with sentence representations for
document-level sentiment classification. Neurocomputing, 308, 49–57. https://doi.org/10.1016/j.
neucom.2018.04.045
Rezaeinia, S.M., Rahmani, R., Ghodsi, A., & Veisi, H. (2019). Sentiment analysis based on improved
pre-trained word embeddings. Expert Systems With Applications, 117, 139–147. https://doi.org/10.
1016/j.eswa.2018.08.044
Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? Lecture Notes
in Computer Science, 11856 LNAI(May), 194–206. https://doi.org/10.1007/978-3-030-32381-3_16
Vairetti, C., Martínez-Cámara, E., Maldonado, S., Luzón, V., & Herrera, F. (2020). Enhancing the
classification of social media opinions by optimizing the structural information. Future
Generation Computer Systems, 102, 838–846. https://doi.org/10.1016/j.future.2019.09.023
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I.
(2017). Attention is all you need. Advances in neural information processing systems. NIPS’17:
Proceedings of the 31st international conference on neural information processing systems,
6000–6010. Long Beach, CA, USA. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
JOURNAL OF DECISION SYSTEMS 23

Verhoef, P.C., Broekhuizen, T., Bart, Y., Bhattacharya, A., Qi Dong, J., Fabian, N., & Haenlein, M. (2021).
Digital transformation: A multidisciplinary reflection and research agenda. Journal of Business
Research, 122, 889-901. https://doi.org/10.1016/j.jbusres.2019.09.022
Wei, Z., Miao, D., Chauchat, J.H., Zhao, R., & Li, W. (2009). N-grams based feature selection and text
representation for Chinese text classification. International Journal of Computational Intelligence
Systems, 2(4), 365–374. https://doi.org/10.1080/18756891.2009.9727668
Xu, H., Liu, B., Shu, L., & Yu, P.S. (2018). Double embeddings and CNN-based sequence labeling for
aspect extraction. ACL 2018: 56th annual meeting of the association for computational linguistics,
proceedings of the conference, 2, 592–598. Melbourne, Australia. https://doi.org/10.18653/v1/p18-
2094
Yadav, A., & Vishwakarma, D.K. (2020). Sentiment analysis using deep learning architectures: A
review. Artificial Intelligence Review, 53, 4335–4385. https://doi.org/10.1007/s10462-019-09794-5
Yu, L.C., Wang, J., Lai, K.R., & Zhang, X. (2018). Refining word embeddings using intensity scores for
sentiment analysis. IEEE/ACM Transactions on Audio Speech and Language Processing, 26(3),
671–681. https://doi.org/10.1109/TASLP.2017.2788182
Zhang, H. (2005). Exploring conditions for the optimality of naïve Bayes. International Journal of
Pattern Recognition and Artificial Intelligence, 19(2), 183–198. https://doi.org/10.1142/
S0218001405003983
Zhang, X., Zhao, J., & Lecun, Y. (2015). Character-level convolutional networks for text. In NIPS,
649–657.
Zhao, R., & Mao, K. (2018). Fuzzy bag-of-words model for document representation. IEEE Transactions
on Fuzzy Systems, 26(2), 794–804. https://doi.org/10.1109/TFUZZ.2017.2690222

You might also like