Business Analytics Final Capstone Project Presenation PPT.pptx
1. Group Members :
Mohan Kamal Hassan
Vinay Mathukumalli
Sai Krishna Mannepalli
Sentimental Analysis Performance Advanced
approaches of Machine Learning
(Impacting People’s Daily lives with Products and
Services Reviews)
Guided By
Ahmet Ozkul
Professor, Ph.D
University Of New Haven
3. Abstract
Sentiment analysis or opinion mining is one of the major tasks of NLP
(Natural Language Processing). Sentiment analysis has gained much
attention in recent years. Sentiment analysis systems are being applied
in almost every business and social domain because opinions are central
to almost all human activities and are key influencers of our behaviors.
Our beliefs and perceptions of reality, and the choices we make, are
largely conditioned on how others see and evaluate the world. For this
reason, when we need to make a decision we often seek out the
opinions of others. This is true not only for individuals but also for
organizations.
For this reason, when we need to make a decision we often seek out the
opinions of others. This is true not only for individuals but also for
organizations. Experiments for both sentence-level categorization and
review-level categorization are performed with promising outcomes.
4. • Objective: To determine the sentiment embedded in product reviews using
computational methods.
• Approaches: Utilizing two distinct sentiment analysis strategies:
• A traditional method employing the AFINN sentiment score lexicon.
• Advanced machine learning algorithms for a more nuanced analysis.
• Scope: Analysis spans across different sentiment levels, from negative to
positive (1 to 5 scale).
• Tools: Implementing various R packages and tools for text mining, data
manipulation, and machine learning.
• Outcome: Aiming to accurately classify the sentiment of reviews and
understand customer feedback.
• Application: Enhancing customer experience and business insights through
data-driven sentiment evaluation.
Introduction
5. Problem Statement
As we know that there are several E-Commerce sites
has been in the market with different products.
Thus there are so many reviews are being generated
for only one product, thus the problem arises for the
customers as well as E-Commerce company to
understand the review
Though there is a star rating, most of the customers
go through the reviews thus classifying the reviews
with an appropriate accuracy is need to retain the
customer.
6. • So how to extract useful information and build objective
products’ quality test system automatically to deal with
the massive textual information is emerging in the related
research field. Opinion Mining is a new technology based
on the technology of text mining and natural language
processing.
• It provides the approach to generate summary of the
products.It recognizes the opinion of the contents which
authors, mainly discusses the sentence-level opinion
mining and treats the statements of the product features
for each viewpoint as analysis objects, then we can find
authors’ opinion inclinations.
6
7. Literature Review
Paper title: “Sentiment analysis using product review data”
Author: Xing Fang and Justin Zhan
• This paper, it explained that they aim to tackle the problem of
sentiment polarity categorization, which is one of the
fundamental problems of sentiment analysis.
• A general process for sentiment polarity categorization is
proposed with detailed process descriptions.
• Data used in this study are online product reviews collected
from Amazon.com.
• Experiments for both sentence-level categorization and review-
level categorization are performed with promising outcomes.
8. Paper title: “Mining the customer behavior using web usage
mining in e-commerce”
Author: Yadav, M. P.
• Explained customer behavior for E-commerce companies using
K Mean.
• With the drastic growth of WWW users can easily find, extract,
filter and evaluated whatever they want.
• With the advancement in technology servers are now able to
collect and store a lot of data which can help them to know about
customers perceptions.
• Hence, to determine the relationship between web mining data
and ecommerce. Consumers mostly prefer to choose among
millions of ones in an online store to satisfy their demands
instead to choose from a superstore. It shows that consumers
have taken interest on e-commerce site to engage in international
trade.
9. Paper title: “Web Mining Techniques in E-Commerce Applications”
Author: Ahmad Tasnim Siddiqui
• Explained that today web is the best medium of communication in
modern business. Now day’s online purchase has been increased as
compared to window shopping as it provides millions of
ranges.As, companies are able to attract most of the customers
because ecommerce is not just buying and selling over internet but
it also act as to get advantage on big giants of market.
• For this purpose data mining sometimes called as knowledge
discovery is used. As vast information has been provided on
internet, it helps to improve e-commerce applications After that
they explained the proposed architecture which contains mainly
four components business data, data obtained from consumer’s
interaction, data warehouse and data analysis. After finishing the
task by data analysis module it’ll produce report which can be
utilized by the consumers as well as the e-commerce application
owners.
10. • Data Preprocessing: Cleaned and prepared the text data by tokenizing, removing
stopwords, and stemming to ensure quality input for analysis.
• Keyword-Based Analysis: Applied the AFINN lexicon to assign sentiment scores to
individual words and aggregated these to determine the overall sentiment of each
review.
• Machine Learning Models: Trained multiple models including Logistic Regression,
Naive Bayes, SVM, Random Forest, and XGBoost to classify sentiments into different
levels.
• Document-Term Matrix (DTM): Converted text reviews into a DTM to create a
structured numerical representation of the text for machine learning processing.
• Model Evaluation: Assessed the performance of each model using confusion
matrices and ROC curves, evaluating their predictive accuracy and ability to
generalize.
• Binary and Multiclass Classification: Implemented Logistic Regression for binary
(positive/negative) sentiment analysis and other models for multiclass (1 to 5 scale)
sentiment rating.
Proposed Methodology
11. NB Maxent Classifier:
• The NB Maxent Entropy Classifier Another well-known classifier or
Maxent as some people prefer to call it. The idea behind Maxent
classifiers is that we are preferring is that the most uniform models
that satisfy any given constraint. Maxent models are feature based
models. We use these features (Based on product reviews which are
based on product features) to find a distribution over the different
classes using logistic regression (as an accuracy value).
• The probability of a data point belonging to a particular class is
calculated for measuring the accuracy of Maxent with respect to the
parameters they are Precision, Recall and Fscore. Through which we
can decide and identify its accuracy in classification process.
11
12. NB Tree Classifier:
• The NB Boosted tree is a classifier that is basically a combination of Boosting
and Decision Trees. Boosting is a machine Meta learning algorithm for
reducing ambiguity in supervised learning. In Boosting predictive classifiers
are used to develop weighted trees (classes of positive negative and neutral)
which are further combined into single prediction accuracy value, which is
based on the three parameters they are (Precision, Recall and Fscore).
• Boosted trees combine the strengths of two algorithms and provide the
accuracy value, through which we can decide and identify its accuracy in
classification process.
12
13. NB Support Vector Machine (SVM) Classifier:
• The NB Support Vector Machine algorithm has defined input and
output format. Input is a vector space and output is 0 or 1
(positive/negative). Text reviews (particularly the user reviews for the
product purchased) in original form are not suitable for learning. They
are transformed into format which matches into input of machine
learning algorithm input. For this pre-processing on text reviews is
carried out. Then we carryout transformation.
• Each word will correspond or belong to one class (the classes will be
positive, negative and neutral) and identical words (which bears same
meaning) belong to same classes. As we can calculate the TF-IDF for
this purpose. Such that the accuracy of support vector machine
algorithm can be identified and made a suitable comparison with
other algorithms.
13
14. NB Random Forest Classifier:
• The NB Random forests is an ensemble learning method for
classification that operate by constructing a vector of classes from the
pool of words consisting of different classes. It produces multi-
altitude decision based on the parameters on which the classifying
accuracy value are depended. The correlation between classes is
reduced by randomly selecting the bag or pool of words and thus the
prediction power increases and leads to increase in efficiency.
• The predictions are made by aggregating the predictions of various
ensemble data sets consisting of different classes. Such that the
accuracy of support vector machine algorithm can be identified and
made a suitable comparison with other algorithms.
14
15. NB Bagging Classifier:
• The NB Bagging is an ensemble machine learning technique which Employs
simplest way of combining predictions that belong to the same type of
classes, by differentiating the level of categorization in the process of
identifying the text level representation of the word belong to which class of
category that is (Positive, negative and neutral). Such that bagging method
incorporate the process of identifying the classifier from the given dataset.
• Therefore, improves the performances by identifying the different classifier
belong to which class. Thus, with the parameter we can determine the
accuracy in terms of value, so that in future we can use the bagging
technique for sentiment analysis of product reviews.
15
16. • Aggregated sentiment scores for each rating level. There are
more instances of positive sentiments than negative
sentiments.
Ratings / Wordcloud
Results, Comparison and Analysis
17. • The AFINN lexicon is a list of words each rated with a
sentiment score, reflecting the positivity or negativity of the
word.
• Reviews are scanned, and words are matched with the AFINN
lexicon to assign a sentiment score based on the lexicon’s
predefined ratings.
• Individual word scores are summed to determine an overall
sentiment score for each review, indicating its positive or
negative nature.
1 - Keyword Based Analysis
Sentimental Analysis
18. • Machine learning models are used to predict the sentiment of
reviews beyond simple positive or negative, often categorizing
into multiple levels of sentiment.
• Textual data is transformed into numerical form through a
Document-Term Matrix (DTM) to facilitate machine learning.
• Sparse columns in the Document-Term Matrix, where terms
appear in less than 2% of the documents, are removed to reduce
noise and computational complexity, enhancing model
performance.
• Each model is trained using a subset of the data (training set -
80% of total data) to learn the patterns associated with various
sentiment levels, and evaluated on testing set (remaining 20%
data).
2 – NB and Machine Learning Classifiers
Sentimental Analysis
19. • The Logistic Regression classifier achieved a
solid accuracy of 71.15% in sentiment
classification.
• High sensitivity (recall) at 81.31% suggests the
model is very good at identifying positive
reviews but less effective in recognizing
negative sentiments, with a specificity of
54.71%.
• The model's precision is robust at 74.39%.
• Balanced accuracy is at 68.01%, reflecting a
reasonable trade-off between sensitivity and
specificity across both sentiment classes.
• The Matthew's Correlation Coefficient (MCC)
stands at 0.3738, indicating that the classifier's
performance is better than a random guess
but still has room for improvement.
Logistic Regression Classifer
Sentimental Analysis
20. Naive Bayes Classifier and Random Forest
Sentimental Analysis
• The Naive Bayes classifier achieved an overall
accuracy of 33.75%, indicating a moderate level of
performance in multi-class sentiment classification.
• The sensitivity (recall) across classes is at 33.95%,
suggesting that the model has a uniform but
moderate ability to identify all sentiment classes
correctly.
• The model's balanced accuracy, which averages
sensitivity and specificity, is 58.72%, showing that
performance is not uniform across classes.
• The Matthew's Correlation Coefficient (MCC) for the
multi-class classification is 0.1792, which is low and
suggests that there's considerable room for
improvement in the model's predictive capability.
• The model's precision (ppv) is low at 33.38%,
indicating challenges in the accuracy of predictions
across the various sentiment classes.
21. NB and Support Vector Machines
Sentimental Analysis
• The Support Vector Machine (SVM) classifier shows an
accuracy of 38.25% in multi-class sentiment classification,
suggesting a modest ability to correctly classify
sentiment levels.
• With a balanced accuracy of 61.35%, the SVM classifier
demonstrates a moderate level of consistency in correctly
identifying sentiment across different classes.
• The sensitivity (recall) and precision (ppv) of the model
are both around 38%, indicating that the model has a fair
rate of correctly identifying true positives for each class
and that its positive predictions are correct at a similar
rate.
• The Matthew's Correlation Coefficient (MCC) at 0.2283
reflects a low but better than random correlation
between the observed and predicted classifications.
• The relatively low detection prevalence of 20% indicates
that the model is quite selective in predicting the positive
class, potentially leading to a higher number of false
negatives.
22. NB and Random Forest Classifier
Sentimental Analysis
• The Random Forest classifier demonstrates strong
performance with an accuracy of 65.85%, indicating it is
well-suited for multi-class sentiment classification.
• High balanced accuracy of 78.56% suggests a consistent
performance by the model in identifying various
sentiment classes.
• The model achieves a good sensitivity (recall) and
precision (ppv), both approximately 65.87%, showing its
effectiveness in correctly identifying and predicting
sentiment classes.
• The Matthew's Correlation Coefficient (MCC) is robust at
0.5745, suggesting a strong correlation between the
observed and predicted classifications, well above what
would be expected by chance.
• The model's specificity is very high at 91.45%, indicating
it is very effective at correctly identifying negative cases
for each sentiment class.
23. XGBoost Classifier
Sentimental Analysis
• The XGBoost classifier has achieved perfect performance
metrics across the board on the training set, indicating
that it has learned to classify the training data with 100%
accuracy.
• For the training data, every metric including accuracy,
sensitivity (recall), specificity, precision (ppv), and
Matthew's Correlation Coefficient (MCC) scored a
maximum of 1.0, implying no misclassification.
• The confusion matrix for the training data shows
complete accuracy with no instances of false positives or
false negatives, which may suggest overfitting to the
training data.
• The confusion matrix for the testing data also shows full
100% accuracy as indicated by zeros in non-diagonal
cells.
• The detection prevalence of 0.2 shows that the model
has a tendency to predict positive cases at a rate of 20%,
which is consistent across both the training and test sets.
24. • The project showcased the use of various advanced hybrid
machine learning models to analyze sentiment in product
reviews with different levels of effectiveness.
• Machine learning models, notably NB and Random Forest and
XGBoost, outperformed the traditional keyword-based
method, providing a deeper analysis of sentiments.
• NB hybrid Random Forest demonstrated high accuracy and
robust performance, marking it as a reliable classifier for
sentiment analysis tasks.
• Nb with XGBoost achieved perfect scores on training data but
the perfectness may indicate potential overfitting issues.
Conclusion
Sentimental Analysis
25. References
[1] S. ChandraKala1 and C. Sindhu2, “OPINION MINING AND
SENTIMENT CLASSIFICATION: A SURVEY,”.Vol .3(1),Oct-
2012,420-427.
[2] Kim S-M, Hovy E (2004) Determining the sentiment of
opinions In: Proceedings of the 20th international conference
on Computational Linguistics, page 1367.Association for
Computational Linguistics, Strasbourg, PA, USA.
[3] Liu B (2010) Sentiment analysis and subjectivity In:
Handbook of Natural Language Processing, Second Edition..
Taylor and Francis Group, Boca
[4] Liu B, Hu M, Cheng J (2005) Opinion observer: Analyzing
and comparing opinions on the web In: Proceedings of the
14th International Conference on World Wide Web, WWW ’05,
342–351.ACM, New York, NY, USA.
25
26. [5] Pang B, Lee L (2004) A sentimental education: Sentiment
analysis using subjectivity summarization based on minimum
cuts In: Proceedings of the 42Nd Annual Meeting on Association
for Computational Linguistics, ACL ’04.Association for
Computational Linguistics, Stroudsburg, PA, USA
[6] Liu B, Hu M, Cheng J (2005) Opinion observer: Analyzing and
comparing opinions on the web In: Proceedings of the 14th
International Conference on World Wide Web, WWW ’05, 342–
351..ACM, New York, NY, USA.
[7] Pang B, Lee L (2004) A sentimental education: Sentiment analysis
using subjectivity summarization based on minimum cuts In:
Proceedings of the 42Nd Annual Meeting on Association for
Computational Linguistics, ACL ’04..Association for Computational
Linguistics, Stroudsburg, PA, USA.