Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter Sentiments Analysis

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 05 | May 2022 www.irjet.net p-ISSN: 2395-0072
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3050
Estimating the Efficacy of Efficient Machine Learning Classifiers for
Twitter Sentiments Analysis
Gaurav D Saxena1, Dharani J2, Surabhi Kakade3, Shilpy Sharma4, Parmod5
Department of Computer Science, Kamla Nehru Mahavidyalaya, Nagpur, India1
Assistant Professor, Department of Information Technology, Dr. N. G. P Institute of Technology, Coimbatore,
Tamil Nadu, India2
Assistant Professor, Department of Computer Science and Engineering, Vishwakarma Institute of Technology,
Pune, India3
Assistant Professor, Department of Computer Science and Engineering, College of Engineering, Roorkee, India4
Research Scholar, Department of Computer Science and Engineering, Chaudhary Devi Lal University, Sirsa,
Haryana, India5
--------------------------------------------------------------------***-------------------------------------------------------------------
Abstract- Sentiment Analysis is an expression that alludes to an assortment of methodology for classifying feeling addressed
in text. Sentiments investigation, frequently known as data mining, is a Natural language Processing (NLP) technique for
characterizing tweets as good, pessimistic, or nonpartisan. The fundamental issue while managing Twitter is the tweets. We
have presented the methodology of breaking down the sentiments of tweets obtained from the sentiments 140 dataset
utilizing the most significant and reliable machine learning classifiers: Decision trees, random forest, support vector machine,
Naive Bayes, logistic regression and XGBoost. The accuracy estimations are then used to assess the capacity of the calculation
we developed in this work. The dataset we used in this study comprises of 1,600,000 tweets recovered utilizing the Twitter API.
The issue with this approach is figuring out which model is generally appropriate for breaking down tweet Sentiments. In this
space, various methodologies for sentiment analysis are presently being used, which are momentarily looked into in this
article we utilized an supervised machine learning approach in this scenario and evaluated about the few calculations and
stated below. We evaluated every one of the most significant chose classifiers from a wide assortment of classifiers to see
which of the carried out models gave us the best exactness.
Keywords: Estimation, Efficacy, Sentiment analyzing, dataset, Machine Learning algorithms, Twitter tweets.
1. INTRODUCTION
Sentiment Analysis is a word that alludes to the method involved with following client perspectives or opinions across
a few channels. In the present advanced world, different regions or significant innovation spaces have created ways of
inspiring client or client feelings or sentiments about an item or administration. The universe of computerized stages is
quick expanding or filling in the present conditions. Quite possibly the most important thing in laying out a presence in a
serious market is one's standing. Sentiment examination permits us to monitor what client or client sees are on a virtual
entertainment stage for a specific help or item.
Since it works with human-composed content, feeling examination requires natural language processing. While
managing human-composed material, we generally utilize the natural language toolkit, which is utilized to deal with the
text. There are numerous web-based entertainment stage, we are individuals used to share their insight and point of view.
There are numerous utilization of the sentiment analysis , for example, securities exchange expectations, governmental
issues, Health care issues, advertising, film suggestions and some of them are commended in this study.
The first is Twitter, which is a stage that consolidates sentiments investigation. Twitter is a fundamental stage where a
great many people from everywhere the world examine or distribute their considerations or articulations on a specific
subject. Consistently, a large number of clients use Twitter to convey tweets.
Then, at that point, there's business, where organizations have conceived strategies pointed toward zeroing in on client
criticism or sentiments in regards to newly sent off items. This is the main standard, as it permits the association to
support item creation to some even out just based on great client criticism or mind-set toward the item.

Political missions are one more huge angle to think about while leading sentiment analyzing. Consistently, around 60%
of the populace in India tweets political comments on Twitter. Individuals are acquainted with communicating their
perspectives on decisions. We will actually want to observe individuals' contemplations on a given applicant utilizing
Twitter opinion on governmental issues.
We can distinguish the opponent's approach that is producing ideal input from clients in the market involving feeling
investigation for advertising or cutthroat exploration. We will actually want to inspect adversaries' promoting systems to
work on their administration or keep up with their situation in the serious advertising climate.
Fig-1: Flow diagram of overall process.
2. DATASET
The sentiment 140 dataset, which can be found on kaggle, was utilized in this examination. There are 1.6 million tweets
in the dataset. The Twitter API might be utilized to acquire these tweets from Twitter. There are three objective classes for
the tweets. They are marked '0' for negative tweets, '2' for neutral tweets, and '4' for positive tweets.
Simply by giving the suitable username, the Twitter API is used to recover client profile components and other
indispensable realities from Twitter. The information is accumulated and converted into a configuration that can be
advantageously taken care of into the prepared model. The Twitter API content's fundamental capacity is to return this
information, which is then utilized as contribution to the model.[20]
The NLTK comprises for the most part of a few libraries for text classification, marking, stemming, tokenization, and
parsing. The NLTK library was used broadly in the classification, and it additionally directed lemmatization. For execution,
we used pandas, scikit-Learn, Matplotlib, Numpy, and different libraries. [21]
3. PREPROCESSING
We should preprocess our information without preprocessing while at the same time working with text based
information. The recommended model can't be taken care of the dataset.
3.1 Casing
We at first believer each tweet in the twitter dataset from upper to bring down case in this stage.
3.2 Noise Removal
The text cleaning method begins with erasing any commotion, for example, HTML labels <>, URL joins, Hash labels #,
the @ sign, trailed by a username, dates, or a particular words like ('rt').[1]
3.3 Tokenization
Tokenization is done out after the noise removal stage. Tokenization is the most common way of separating message,
words, or sentences into more modest parts known as tokens.
Dataset
Sentiment 140
Dataset
Preprcoessing
Feature
Vectoririzationn
Dataset
Segmentation
Machine Learning
Classifiers
Result Visualization

3.4 Stop Word Removal
Stop words will be words that are routinely utilized in a language however have little effect. The tokenized words for
each tweet were then handled for stop word evacuation after tokenization. [2]
3.5 Normalization
The Normalization technique is a vital stage in preprocessing. Standardization is the method involved with changing
over an assortment of words into a more coherent request. The calculation accomplishes a more exact order by changing
the words over to a standard configuration. [3]
3.5.1 Stemming
In this stage, the Porters Stemmer Algorithm will be utilized. The stemming technique's primary objective is to
separate a word into its stem word or root word. Consider the drawings above for instance. The stem word "wait" might
be utilized to change over the words "waiting", "waited" and "waits." "Computing" "computerized" and "computer" may be
generally decreased to their stem word, "compute".
3.5.2 Lemmatization
In this stage, the word goes through a kind of progress to get back to its unique structure. For example, 'drove' was
renamed 'drive,' and 'driving' was renamed 'drive.' The name WordNetLemmatizer is utilized in this stage to depict the
lemmatization interaction. [4]
4. FEATURE VECTORIZATION
To work on the model's computational rightness, we utilized term frequency and inverse document frequency
vectorizer [22].The Term Frequency-Inverse Document Frequency Vectorizer is a capacity or procedure for changing
textual information over to numeric information design. We should manage numeric information in the AI model; along
these lines we use a Term Frequency-Inverse Document Frequency Vectorizer to change our text based input over to
numeric information.
5. DATASET SEGMENTATION
Preceding taking care of the utilized dataset to the model, we should isolate it into two essential parts: the preparation
dataset and the testing dataset, with a general parcel proportion of 80% of the dataset for training and 20% of the testing
dataset for testing.
6. CLASSIFICATION MODELS
Order in AI, there is different classifiers accessible for arrangement; some of them are examined above in this part.
6.1 Decision Trees Classifier
This classifier creates decides in English expressions that are easy to comprehend. Decision trees can be utilized to take
care of issues including arrangement or relapse. No presumptions are made about the fundamental conveyance of
information in Decision trees. The model's shape isn't foreordained; all things being equal, it squeezes into the best
practical arrangement relying upon the information. [5]
Fig-2: Decision Tree Classifier. [6]

A Decision tree, as the name infers, is a choice emotionally supportive network that is addressed by a tree-like design.
The root hub, otherwise called the root or starting hub of the tree, is the main piece of a choice tree. A preparation dataset
is isolated into branches by utilizing decision trees, and afterward further separated into sub branches. The decision will
keep on developing until it arrives at the issue's last stage. It stops at the point that the leaf hub, otherwise called the
terminal hub, is joined in, and the calculation ends its execution.
The decision tree is outlined in the chart above, with the root hub isolated into decision hubs, which are their sub
hubs, and the decision hubs isolated into the terminal hub, which is the last advance of the calculation and where the
strategy execution wraps up.
The random forest technique is made utilizing Decision trees, in which countless Decision trees are joined to give an
outcome relying upon most of votes. As recently said, an arbitrary random forest calculation is talked about.
6.2 Random forest
Random forest is an essential yet successful decision tree-based group approach. It creates various Decision trees and
arranges the information tests independently. Their decisions are then added together to distinguish the class with the
most votes, bringing down all out botches. Packing is the method involved with brushing Decision trees. [7]
Fig-3: Random forest Classifier. [9]
When another information thing is gotten, every one of these trees groups it, and the outcomes given by them act as
decisions in favor of each class. At long last, every one of the votes is counted, and the class with the biggest number of
votes is picked as the new data point class. [8]
At long last, the Decision tree is found to be the fundamental and building square of the random forest classifier.
6.3 Support Vector Machine
A Support Vector Machine is a directed AI procedure that is utilized to tackle issues like classification and Regression.
In any case, it is for the most part utilized for arrangement challenges in most of circumstances.
The hyperplane is utilized to arrange the two separate classes addressed in the image underneath.
Fig-4: Support Vector Machine. [10]

The objective of this approach is to create a hyper plane that can be utilized to segment n-layered space into classes so
the calculation can without much of a stretch spot elements in the suitable classification. The support vectors that
contribute in the making of the choice limit are picked by the calculation.
Order in Support vector machine depends on the translation of the hyper plane, what separates items into two
classifications. Corresponding to the hyper plane, quite a few peripheral planes can be drawn. The algorithm picks the best
peripheral plane with the best separation from the hyperplane, and the information focuses nearest to it are alluded to as
support vectors. [11]
The margin is the distance between the hyperplane and the closest support vectors. The distance between the two
classes isolated by the best not set in stone. The most extreme margin hyperplane is the plane that has the best distance
between the two classes. [12]
The most fundamental factors that assist with arranging the elements in the calculations are hyperplane. The two
delineations outlined in the above pictures are the first where there are two information highlights, and the hyperplane is
just a line with support vectors on the two sides of the hyper plane. Consider the situation where the quantity of
information attributes is bigger than two, say three, and the hyper plane turns into a two-layered plane. At the point when
the quantity of qualities surpasses three, as found in the image above, it turns out to be incredibly hard to anticipate the
result.[13]
Fig-5: Support Vectors [13].
For instant the data points
   ,
,
........
, 1
1 n
n
y
x
y
x


where xi addresses a genuine vector and yi indicates the class to which xi
has a place with a worth of - 1, 1. A hyper plane is characterized as follows and is utilized to upgrade the distance between
the two classes y=1, - 1. [12]
0
. 
b
x
w


Where w

signifies the ordinary vector and x
b

is the hyperplane balanced along w. [12]
6.4 Naive Bayes
The Bayes hypothesis is utilized to make a kind of probabilistic classifier known as Naive Bayes. It's a model of
contingent likelihood. In many managed learning settings, a likelihood model is utilized. Naive's Bayes classifiers can
prepare in an exceptionally productive and viable way. [14]
The Naive Bayes calculation is a direct methodology for applying the Bayes hypothesis to grouping issues. [16] What
makes it novel is that it is named after the Bayes: It utilizes Bayes hypothesis to move the probabilities of seeing info
attributes that compare to classes to a likelihood dissemination over classes. It is gullible on the grounds that it accepts
that prescient qualities are free together, which improves on likelihood calculations. [17]

Fig-6: Naive Bayes Algorithm. [15]
6.5 XGBOOST
The XGBoost technique is an inclination supported decision tree execution that produces extraordinary speed and
precision and overwhelms datasets on characterization and relapse issues. XGBoost is a decision tree-based machine
learning calculation that might be utilized to deal with relapse and characterization issues. To expand its presentation, it
utilizes the inclination supporting engineering.[20] The XGBoost calculation is an adaptable, convenient, and productive
technique. XGBoost utilizes the inclination supporting structure to chip away at Machine learning procedures. XGBoost is
an answer for an assortment of information science issues, giving speedy and dependable responses. Ensemble method,
then again, blends different machine learning techniques to resolve a particular issue. It outflanks single machine learning
techniques with regards to precision. [23]
Fig-7: XGBoost Algorithm. [25]
Weights assume a significant part in XGBoost algorithm. Every one of the autonomous factors are then stacked into the
decision trees, which utilizes the predefined weights to anticipate the results. The loads of the tree erroneously evaluated
were changed and placed into the subsequent decision tree. Individual classifiers are then connected together to make a
more proficient and compare the model, as found in the outline underneath. The XGBoost algorithm can promptly address
relapse, grouping, rating and client determined expectation challenges. [25]
6.6 LOGISTIC REGRESSION
It is the best and direct methodology for anticipating the classification mark. It finds choice limits in a component space
and endeavors to anticipate the name in light of those cutoff points. Sigmoid function is applied on the given inputs in
binary classification.[24]

Fig-8: Logistic Regression. [26]
7. ASSESSMENT OF THE MODEL
Subsequent to taking care of the informational index into the fitting model, evaluating the model's performance is
fundamental.
7.1 Confusion Matrix
True positives, true negatives, and false positive, false negative make up the disarray lattice, which has a plain like
design. The disarray lattice shows genuine classes of examples in the line and anticipated classes of events in the section.
Table 1. Confusion Matrix
7.1.1 Elements of Confusion Matrix Description
 Upsides that are valid-True Positives
The model's projected anticipated worth and the real worth are both positive.
 Negatives that are valid-True Negatives
The model's projected anticipated worth and the real worth are both negative.
 Up-sides that aren't accurate- False Positive
Albeit the model projected a positive outcome, the genuine outcome is negative.
 Negatives that aren't correct- False Negatives
Albeit the model projected an adverse outcome, the genuine outcome is positive.
7.2 Performance Metrics [18]
After we've finished our expectation, we'll have to make a grouping report that remembers data for execution
boundaries like Precision, Recall, and F1-Score.
7.2.1 Accuracy
It's determined by partitioning the quantity of right expectations made by the all out number of figures made. Above is
the condition for ascertaining
Actual
Positive (1) Negative (0)
Predicted
Positive
(1)
True Positive
(TP)
False Positives
(FP)
Negative
(0)
False
Negatives (FN)
True Negatives
(TN)

FN
FP
TN
TP
TN
TP
Accuracy





... (1)
We wanted True up-sides, True Negatives, False Positives, and False Negatives before we could decide the accuracy.
The preceding things can be obtained by confusion matrix.
7.2.2 Precision
Precision is characterized as the amount of genuine positive and false positive expectations isolated by the quantity of
genuine up-sides.
FP
TP
TP
ecision
Pr


... (2)
7.2.3 Recall
The quantity of genuine up-sides isolated by the absolute of genuine up-sides and false negatives is known as recall.
FN
TP
TP
Recall


.... (3)
7.2.4 F1-Score
An F1 Score is defined as the Harmonic mean of Precision and Recall.
Recall)
Precision
Recall))/(
*
(Precision
*
(2
F1_Score 

..... (4)
8. RESULTS & DISCUSSION
In this study, we have utilized different classifiers to do the sentiment analysis on tweets utilizing the sentiment 140
dataset available on Kaggle. The dataset used here comprises of 1.6 million tweets that have been partitioned into two
gatherings from training and testing segments: 80% for training dataset and 20% to conjecture exactness or accuracy. We
procure an accuracy score of support vector machine of 94.01%, random forest of 89.71% , Naive Bayes of 83.34%,
Decision tree of 74.03%, XGBoost of 92.14% & Logistic regression of 90.70%.
Table 2. Results of Implemented Models
Algorithm Accuracy
Support Vector Machine 94.01 %
Random forest 89.71 %
Naive Bayes Theorem 83.34 %
Decision Trees 74.03 %
XGBoost 92.14 %
Logistic Regression 90.70 %
8.1 Visualization of Results
The accuracies obtained by the distinct algorithm is plotted in graphical format.

Fig-9: Graphical illustration of Results.
9. CONCLUSION
This examination tries to offer an extensive comprehension of machine learning techniques that might be utilized to
dissect sentiment in Twitter messages. In wake of assessing the models we have a carried out in this study we got the
unmistakable accuracy of every classifiers. We come out to the conclusion that the support vector machine and XGBoost
has accomplished the most noteworthy accuracy of 94.01% and 92.14% which states that the model is equipped for
analyzing the sentiments of the tweets in a critical way.
REFERENCES
[1]. Hamoud AA, Alwehaibi A, Roy K, Bikdash M. Classifying Political Tweets using Naive's Bayes and Support Vector
Machine. InInternational Conference on Industrial, Engineering and other Applications of Applied Intelligent
Systems 2018 Jun 25 (pp. 736-744). Springer, Cham
[2]. Rushee KI, Rahim MS, Levula A, Mahadavi M. How Australian Are Copying with the Longest Restrictions: An
Exploratory Analysis of Emotion & Sentiment from Tweets. InAdvanced Information Networking and Applications
ANA 2022. Lecture Notes in Networks and Systems, Vol 451. Springer, cham.
[3]. Naing Hw, Thwe P, Mon AC, Nw N. Analyzing Sentiment Level of Social Media data based on Sum and Naive Bayes
algorithms. InInternational Conference on Big Data Analysis and Deep Learning Applications 2018 May 14 (pp.
68-76). Springer, Singapore.
[4]. Ye Z, Liv W, Jiang Q, Pan Y A. Cyrptocurrency Price Prediction model based on Twitter Sentiment Indicators.
InInternational Conference on BigData and Security 2021 Nov 26 (pp.411-425). Springer, Singapore.
[5]. Dangeti P. Statistics for machine learning. Packt Publishing Ltd; 2017 Jul 21.
[6]. [internet]. 2022 [cited 4 April 2022]. Available from:
https://www.google.com/amp/s/techvidvan.com/tutorials/decision-tree-in-r/%3famp=1.
[7]. Sakib S, Yasmin N, Tanzeem AK, Shorna F, Alam SB. Breast Cancer Detection and Classification: A Comparative
Analysis Using Machine Learning Algorithms. InProceedings of Third International Conference on Communication,
Computing and Electronics Systems 2022 (pp. 703-717). Springer, Singapore.
Support
Vector
Machine
Random forest
Naive Bayes
Theorem
Decision Trees XGBoost
Logistic
Regression
Series1 94.01% 89.71% 83.34% 74.03% 92.14% 90.70%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Plot of Classifiers with Accuracies

[8]. Kumar P, Bhatnagar A, Jameel R, Mourya AK. Machine Learning Algorithms for Breast Cancer Detection and
Prediction. InAdvances in Intelligent Computing and Communication 2021 (pp. 133-141). Springer, Singapore.
[9]. Introduction to Random forest in Machine learning [internet]. Engineering Education (EngEd) Program ‫׀‬ Section.
2022 [cited 4 April 2022]. Available from: https://www.section.io/engineering-education/introduction-to-
random-forest-in-machine-learning/.
[10]. SVM Interview Questions‫׀‬ Questions on SVM to Test your skills [internet]. Analytics vidhya. 2022 [cited
4April 2022]. Available from: https://www.analyticsvidhya.com/blog/2021/05/top-15-questions-to-test-your-
data-science-skills-on-svm/.
[11]. Ruth JA, Mahesh VG, Uma R, Ramkumar P. A Hierarchical Machine Learning Frame work to classify Breast
Tissue for Identification of Cancer. InProceedings of the 11th International Conference on Computer Engineering
and Networks 2022 (pp. 504515). Springer, Singapore.
[12]. Padmavathi MS, Sumathi CP. A Novel Approach Using Support Vector Machine for Outlier Removal and
Multilayer Perceptron in Classifying Medical Datasets. InInternational Conference on Soft Computing and Signal
Processing 2019 Jun21 (pp.339-352). Springer, Singapore.
[13]. Support Vector Machine-Introduction to Machine Learning Algorithms [internet]. Medium. 2022 [cited 4
April 2022]. Available from: htttps://towardsdatascience/support-vector-machine-introduction-to-machine-
learning-algorithms-934a444fca47.
[14]. Shrivastava AK, Singh PK, Kumar Y. A taxonomy on machine learning based technique to identify heart
disease. InInternational Conference on Next Generation Computing Technologies 2018 Nov 21 (pp. 13-25).
Springer, Singapore.
[15]. Navie Bayes Algorithm [internet]. Medium. 2022 [cited 4 April 2022]. Available from:
https://kdagiit.medium.com/naive-bayes-algorithm-4b8b990c7319.
[16]. Omondiagbbe DA, Veeramani S, Sidhu AS. Machine Learning classification techniques for breast cancer
diagnosis. InIOP Conference Series: Materials Science and Engineering 2019 Apr 1 ( Vol. 495, No. 1, p. 012022).
IOP Publishing.
[17]. Liu YH. Python Machine Learning By Example. Packt Publishing Ltd; 2017 May 31.
[18]. Gupta P, Sehgal NK. Introduction to Machine Learning in the Cloud with Python: Concepts and practices.
Springer Nature; 2021.
[19]. Twitter sentiment analysis: Implement twitter sentiment analysis model [Internet]. Analytics Vidhya.
2021 [cited 2022May20]. Available from: https://www.analyticsvidhya.com/blog/2021/06/twitter-sentiment-
analysis-a-nlp-use-case-for-
beginners/#:~:text=Sentiment%20analysis%20refers%20to%20identifying,about%20a%20variety%20of%20to
pics.
[20]. Joshi MM, Kambale Ms, Shastry NS, Khan MO, Nagarathna A. A Comprehensive approach to
misinformation analysis and detection of low creditability News. InInternational Conference on Soft Computing
and Signal Processing 2021 Jun18 (pp. 23-33. Springer, Singapore.
[21]. Brijpuriya S, Rajalakshmi M. Deployment of Sentiment Analysis of tweets using various classifiers.
InProceedings of International Conference on Deep Learning, Computing, and Intelligence 2022 (pp. 167-178).
Springer, Singapore.
[22]. Bhargavi K, Mashankar P, Sreevarsh PV, Biolikar R, Ranganathan P. Machine Learning Based Sentiment
Analysis Twords Indian Ministry. In Computational Vision and Bio-Inspired Computing 2022 (pp. 381-391).
Springer, singapore.

[23]. Kunte A, Panicker S. Personality prediction of social network users using ensemble and XGBoost.
InProgress in Computing, analytics and networking 2020 (pp. 133-140). Springer, Singapore.
[24]. vasiyani P, Prakash P, Santhivel V. A comparative Study of Students Online Learning During Pandemic
Using Machine Learning Model. InICCCE 2021, 2022 (pp.17-27). Springer , Singapore.
[25]. Nisha KA, Kulsum U, Rahman S, Hossain M, ChakrabortyP, Choudhary T. A Comparative Analysis of
Machine ELarnign Approaches in Personality Prediction Using MBTI. InComputational Intelligence in Pattern
Recognition 2022 (pp. 13-23). Springer, Singapore.
[26]. Saxena GD, Tembhare NP. Analytical and Systematic Study of Artificial Neural Network. International
Research Journal of Engineering and Technology. 2022; 9(3); 653-658.

Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter Sentiments Analysis

More Related Content

Similar to Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter Sentiments Analysis (20)

More from IRJET Journal (20)

Recently uploaded (20)

Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter Sentiments Analysis