Sentiment Knowledge Discovery in Twitter Streaming Data
Albert Bifet and Eibe Frank
University of Waikato
Hamilton, New Zealand
Canberra, 7 October 2010
Discovery Science 2010
Twitter: A Massive Data Stream
Web 2.0
Micro-blogging service
Built to discover what is happening at any moment in time,
anywhere in the world.
106 million registered users
600 million search queries per day
3 billion requests a day via its API.
2 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
3 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
4 / 26
Data stream classification cycle
1 Process an example at a time,
and inspect it only once (at
most)
2 Use a limited amount of
memory
3 Work in a limited amount of
time
4 Be ready to predict at any
point
5 / 26
Data stream classification cycle
Evaluation procedures for Data
Streams
Holdout
Interleaved Test-Then-Train
("Prequential" Evaluation)
5 / 26
Twitter Streaming API
Twitter APIs
Streaming API
Two discrete REST APIs
Real-time access to Tweets
sampled form
filtered form
HTTP based
GET
POST
DELETE
6 / 26
Sentiment Analysis on Twitter
Sentiment analysis
Classifying messages into two categories depending on
whether they convey positive or negative feelings
Emoticons are visual cues associated with emotional states,
which can be used to define class labels for sentiment
classification
Positive Emoticons Negative Emoticons
:) :(
:-) :-(
: ) : (
:D
=)
Table: List of positive and negative emoticons.
7 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
8 / 26
Streaming Data Evaluation with Unbalanced Classes
Predicted Predicted
Class+ Class- Total
Correct Class+ 75 8 83
Correct Class- 7 10 17
Total 82 18 100
Table: Simple confusion matrix example
Predicted Predicted
Class+ Class- Total
Correct Class+ 68.06 14.94 83
Correct Class- 13.94 3.06 17
Total 82 18 100
Table: Confusion matrix for chance predictor
9 / 26
Streaming Data Evaluation with Unbalanced Classes
Kappa Statistic
p0: classifier’s prequential accuracy
pc: probability that a chance classifier makes a correct
prediction.
κ statistic
κ =
p0 −pc
1−pc
κ = 1 if the classifier is always correct
κ = 0 if the predictions coincide with the correct ones as
often as those of the chance classifier
Forgetting mechanism for estimating prequential kappa
Sliding window of size w with the most recent observations
10 / 26
Data Stream Mining Methods
Multinomial Naïve Bayes
Considers a document as a bag-of-words.
Estimates the probability of observing word w and the prior
probability P(c)
Probability of class c given a test document:
P(c|d) =
P(c)∏w∈d P(w|c)nwd
P(d)
11 / 26
Data Stream Mining Methods
Stochastic Gradient Descent
Vanilla stochastic gradient descent with a fixed learning
rate
Optimizing the hinge loss with an L2 penalty commonly
applied to SVM
Loss function to optimize:
λ
2
||w||2
+∑[1−(yxw+b)]+
12 / 26
Data Stream Mining Methods
Hoeffding Tree
Incremental decision tree for data streams.
Strategy based on the Hoeffding bound
ε =
R2 ln(1/δ)
2n
A node is expanded by splitting as soon as there is
sufficient statistical evidence
13 / 26
Outline
1 Twitter Streaming Data
2 Twitter Sentiment Classification: Metrics and Methods
3 Empirical results
14 / 26
What is MOA?
{M}assive {O}nline {A}nalysis is a framework for mining data
streams.
Based on experience with Weka and VFML
Focussed on classification trees, but lots of active
development: clustering, item set and sequence mining,
regression
Easy to extend
Easy to design and run experiments
15 / 26
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
Weka, but also extinct.
16 / 26
Twitter Sentiment Corpora
Twitter Sentiment Corpus
twittersentiment.appspot.com
Alec Go, Richa Bhayani, Karthik Raghunathan, and Lei
Huang
Website to research the sentiment for a brand, product, or
topic.
Training dataset with messages between April 2009 and
June 25, 2009
800,000 tweets with positive emoticons
800,000 tweets with negative emoticons
Test dataset manually annotated
177 negative tweets
182 positive ones
17 / 26
Twitter Sentiment Corpora
Edinburgh Corpus
http://demeter.inf.ed.ac.uk
Sasa Petrovic, Miles Osborne, and Victor Lavrenko
97 million tweets (14 GB)
Each tweet contains
timestamp of the tweet,
anonymized user name
the tweet’s text
the posting method that was used
Collected between November 11th 2009 and February 1st
2010, using Twitter’s streaming API.
18 / 26
Twitter Empirical Evaluation
Sliding Window Prequential Accuracy
30
40
50
60
70
80
90
100
0,01
0,08
0,15
0,22
0,29
0,36
0,43
0,5
0,57
0,64
0,71
0,78
0,85
0,92
0,99
1,06
1,13
1,2
1,27
1,34
1,41
1,48
1,55
Millions of Instances
Accuracy%
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on twittersentiment
corpus
19 / 26
Twitter Empirical Evaluation
Sliding Window Kappa Statistic
0
10
20
30
40
50
60
70
80
0,01
0,08
0,15
0,22
0,29
0,36
0,43
0,50
0,57
0,64
0,71
0,78
0,85
0,92
0,99
1,06
1,13
1,20
1,27
1,34
1,41
1,48
1,55
Millions of Instances
KappaStatistic
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on twittersentiment
corpus
19 / 26
Twitter Empirical Evaluation
Sliding Window Prequential Accuracy
75
77
79
81
83
85
87
89
91
93
95
0,01
0,1
0,19
0,28
0,37
0,46
0,55
0,64
0,73
0,82
0,91
1
1,09
1,18
1,27
1,36
1,45
1,54
1,63
1,72
1,81
1,9
1,99
2,08
Millions of Instances
Accuracy%
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on Edinburgh corpus
20 / 26
Twitter Empirical Evaluation
Sliding Window Kappa Statistic
0
10
20
30
40
50
60
70
80
90
100
0,01
0,1
0,19
0,28
0,37
0,46
0,55
0,64
0,73
0,82
0,91
1
1,09
1,18
1,27
1,36
1,45
1,54
1,63
1,72
1,81
1,9
1,99
2,08
Millions of Instances
KappaStatistic
NB Multinomial SGD Hoeffding Tree Class Distribution
Figure: Accuracy and Kappa Statistic on Edinburgh corpus
20 / 26
twittersentiment Corpus
Prequential Accuracy and Kappa
Accuracy Kappa Time
Multinomial Naïve Bayes 75.05% 50.10% 116.62 sec.
SGD 82.80% 62.60% 219.54 sec.
Hoeffding Tree 73.11% 46.23% 5525.51 sec.
Total prequential accuracy and Kappa measured on the
twittersentiment data stream
21 / 26
Edinburgh Corpus
Prequential Accuracy and Kappa
Accuracy Kappa Time
Multinomial Naïve Bayes 86.11% 36.15% 173.28, sec.
SGD 86.26% 31.88% 293.98 sec.
Hoeffding Tree 84.76% 20.40% 6151.51 sec.
Total prequential accuracy and Kappa obtained on the
Edinburgh corpus data stream.
22 / 26
SGD coefficient variations on the Edinburgh corpus
Middle of Stream End of Stream
Tags Coefficient Coefficient Variation
apple 0.3 0.7 0.4
microsoft -0.4 -0.1 0.3
facebook -0.3 0.4 0.7
mcdonalds 0.5 0.1 -0.4
google 0.3 0.6 0.3
disney 0.0 0.0 0.0
bmw 0.0 -0.2 -0.2
pepsi 0.1 -0.6 -0.7
dell 0.2 0.0 -0.2
gucci -0.4 0.6 1.0
amazon -0.1 -0.4 -0.3
23 / 26
Summary
Twitter is a new “what’s-happening-right-now” tool
Twitter as a stream mining dataset for real-time predictions
Sliding window Kappa statistic
Recommend SGD-based model
24 / 26
twittersentiment Corpus
Hold-out Accuracy and Kappa
Accuracy Kappa
Multinomial Naïve Bayes 82.45% 64.89%
SGD 78.55% 57.23%
Hoeffding Tree 69.36% 38.73%
Accuracy and Kappa for the test dataset obtained from
twittersentiment
25 / 26
Edinburgh Corpus
Hold-out Accuracy and Kappa
Accuracy Kappa
Multinomial Naïve Bayes 73.81% 47.28%
SGD 67.41% 34.23%
Hoeffding Tree 60.72% 20.59%
Accuracy and Kappa for the test dataset obtained from
twittersentiment using the Edinburgh corpus as training
data stream.
26 / 26

More Related Content

PDF
Moa: Real Time Analytics for Data Streams
PDF
Fast Perceptron Decision Tree Learning from Evolving Data Streams
PDF
Pitfalls in benchmarking data stream classification and how to avoid them
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
PDF
Leveraging Bagging for Evolving Data Streams
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PDF
Mining big data streams with APACHE SAMOA by Albert Bifet
PDF
MOA for the IoT at ACML 2016
Moa: Real Time Analytics for Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Pitfalls in benchmarking data stream classification and how to avoid them
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Leveraging Bagging for Evolving Data Streams
Efficient Online Evaluation of Big Data Stream Classifiers
Mining big data streams with APACHE SAMOA by Albert Bifet
MOA for the IoT at ACML 2016

What's hot (20)

PDF
Artificial intelligence and data stream mining
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PPTX
STRIP: stream learning of influence probabilities.
PDF
Real-Time Big Data Stream Analytics
PDF
Mining Big Data Streams with APACHE SAMOA
PPTX
Streaming Algorithms
PPTX
ReComp project kickoff presentation 11-03-2016
PPTX
The data, they are a-changin’
PPTX
Mining high speed data streams: Hoeffding and VFDT
PPT
5.1 mining data streams
PDF
Introduction to Data streaming - 05/12/2014
PPTX
ReComp for genomics
PPTX
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
PPTX
The lifecycle of reproducible science data and what provenance has got to do ...
PDF
Mining Big Data in Real Time
PPTX
Selective and incremental re-computation in reaction to changes: an exercise ...
PDF
Metric based meta_learning
PPTX
Project Matsu: Elastic Clouds for Disaster Relief
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPT
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Artificial intelligence and data stream mining
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
STRIP: stream learning of influence probabilities.
Real-Time Big Data Stream Analytics
Mining Big Data Streams with APACHE SAMOA
Streaming Algorithms
ReComp project kickoff presentation 11-03-2016
The data, they are a-changin’
Mining high speed data streams: Hoeffding and VFDT
5.1 mining data streams
Introduction to Data streaming - 05/12/2014
ReComp for genomics
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
The lifecycle of reproducible science data and what provenance has got to do ...
Mining Big Data in Real Time
Selective and incremental re-computation in reaction to changes: an exercise ...
Metric based meta_learning
Project Matsu: Elastic Clouds for Disaster Relief
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Ad

Similar to Sentiment Knowledge Discovery in Twitter Streaming Data (20)

PDF
Sentiment Analysis of Twitter Data
DOCX
Abstract
PDF
unit-5.pdf
PDF
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
PDF
PDF
SENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING DEEP LEARNING
PDF
Text mining on Twitter information based on R platform
PDF
Msd seminar
PDF
Sentiment Analysis and Classification of Tweets using Data Mining
PPTX
Twitter sentiment classifications 1
PDF
Twitter sentimentanalysis report
PPTX
Sentiment Analysis.pptx
PDF
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
PDF
An adaptive clustering and classification algorithm for Twitter data streamin...
PDF
IRJET- User Behavior Analysis on Social Media Data using Sentiment Analysis o...
PDF
Sentiment Analysis of Twitter tweets using supervised classification technique
PDF
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
PDF
Emotion Recognition By Textual Tweets Using Machine Learning
PDF
IRJET- Interpreting Public Sentiments Variation by using FB-LDA Technique
PPTX
Major presentation
Sentiment Analysis of Twitter Data
Abstract
unit-5.pdf
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
SENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING DEEP LEARNING
Text mining on Twitter information based on R platform
Msd seminar
Sentiment Analysis and Classification of Tweets using Data Mining
Twitter sentiment classifications 1
Twitter sentimentanalysis report
Sentiment Analysis.pptx
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
An adaptive clustering and classification algorithm for Twitter data streamin...
IRJET- User Behavior Analysis on Social Media Data using Sentiment Analysis o...
Sentiment Analysis of Twitter tweets using supervised classification technique
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
Emotion Recognition By Textual Tweets Using Machine Learning
IRJET- Interpreting Public Sentiments Variation by using FB-LDA Technique
Major presentation
Ad

More from Albert Bifet (18)

PDF
Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Introduction to Big Data Science
PDF
Introduction to Big Data
PDF
Internet of Things Data Science
PDF
Real Time Big Data Management
PDF
A Short Course in Data Stream Mining
PDF
Multi-label Classification with Meta-labels
PPTX
Mining Big Data in Real Time
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
PDF
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PDF
MOA : Massive Online Analysis
PDF
New ensemble methods for evolving data streams
PDF
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
PDF
Adaptive XML Tree Mining on Evolving Data Streams
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
PDF
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
PDF
Mining Implications from Lattices of Closed Trees
PDF
Kalman Filters and Adaptive Windows for Learning in Data Streams
Apache Samoa: Mining Big Data Streams with Apache Flink
Introduction to Big Data Science
Introduction to Big Data
Internet of Things Data Science
Real Time Big Data Management
A Short Course in Data Stream Mining
Multi-label Classification with Meta-labels
Mining Big Data in Real Time
Mining Frequent Closed Graphs on Evolving Data Streams
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
MOA : Massive Online Analysis
New ensemble methods for evolving data streams
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Implications from Lattices of Closed Trees
Kalman Filters and Adaptive Windows for Learning in Data Streams

Recently uploaded (20)

DOCX
Basics of Cloud Computing - Cloud Ecosystem
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
giants, standing on the shoulders of - by Daniel Stenberg
PPTX
Microsoft User Copilot Training Slide Deck
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PPTX
Build Your First AI Agent with UiPath.pptx
PPTX
future_of_ai_comprehensive_20250822032121.pptx
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Basics of Cloud Computing - Cloud Ecosystem
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
sustainability-14-14877-v2.pddhzftheheeeee
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Consumable AI The What, Why & How for Small Teams.pdf
giants, standing on the shoulders of - by Daniel Stenberg
Microsoft User Copilot Training Slide Deck
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Flame analysis and combustion estimation using large language and vision assi...
Co-training pseudo-labeling for text classification with support vector machi...
INTERSPEECH 2025 「Recent Advances and Future Directions in Voice Conversion」
Convolutional neural network based encoder-decoder for efficient real-time ob...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
Build Your First AI Agent with UiPath.pptx
future_of_ai_comprehensive_20250822032121.pptx
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf

Sentiment Knowledge Discovery in Twitter Streaming Data

  • 1. Sentiment Knowledge Discovery in Twitter Streaming Data Albert Bifet and Eibe Frank University of Waikato Hamilton, New Zealand Canberra, 7 October 2010 Discovery Science 2010
  • 2. Twitter: A Massive Data Stream Web 2.0 Micro-blogging service Built to discover what is happening at any moment in time, anywhere in the world. 106 million registered users 600 million search queries per day 3 billion requests a day via its API. 2 / 26
  • 3. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 3 / 26
  • 4. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 4 / 26
  • 5. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 5 / 26
  • 6. Data stream classification cycle Evaluation procedures for Data Streams Holdout Interleaved Test-Then-Train ("Prequential" Evaluation) 5 / 26
  • 7. Twitter Streaming API Twitter APIs Streaming API Two discrete REST APIs Real-time access to Tweets sampled form filtered form HTTP based GET POST DELETE 6 / 26
  • 8. Sentiment Analysis on Twitter Sentiment analysis Classifying messages into two categories depending on whether they convey positive or negative feelings Emoticons are visual cues associated with emotional states, which can be used to define class labels for sentiment classification Positive Emoticons Negative Emoticons :) :( :-) :-( : ) : ( :D =) Table: List of positive and negative emoticons. 7 / 26
  • 9. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 8 / 26
  • 10. Streaming Data Evaluation with Unbalanced Classes Predicted Predicted Class+ Class- Total Correct Class+ 75 8 83 Correct Class- 7 10 17 Total 82 18 100 Table: Simple confusion matrix example Predicted Predicted Class+ Class- Total Correct Class+ 68.06 14.94 83 Correct Class- 13.94 3.06 17 Total 82 18 100 Table: Confusion matrix for chance predictor 9 / 26
  • 11. Streaming Data Evaluation with Unbalanced Classes Kappa Statistic p0: classifier’s prequential accuracy pc: probability that a chance classifier makes a correct prediction. κ statistic κ = p0 −pc 1−pc κ = 1 if the classifier is always correct κ = 0 if the predictions coincide with the correct ones as often as those of the chance classifier Forgetting mechanism for estimating prequential kappa Sliding window of size w with the most recent observations 10 / 26
  • 12. Data Stream Mining Methods Multinomial Naïve Bayes Considers a document as a bag-of-words. Estimates the probability of observing word w and the prior probability P(c) Probability of class c given a test document: P(c|d) = P(c)∏w∈d P(w|c)nwd P(d) 11 / 26
  • 13. Data Stream Mining Methods Stochastic Gradient Descent Vanilla stochastic gradient descent with a fixed learning rate Optimizing the hinge loss with an L2 penalty commonly applied to SVM Loss function to optimize: λ 2 ||w||2 +∑[1−(yxw+b)]+ 12 / 26
  • 14. Data Stream Mining Methods Hoeffding Tree Incremental decision tree for data streams. Strategy based on the Hoeffding bound ε = R2 ln(1/δ) 2n A node is expanded by splitting as soon as there is sufficient statistical evidence 13 / 26
  • 15. Outline 1 Twitter Streaming Data 2 Twitter Sentiment Classification: Metrics and Methods 3 Empirical results 14 / 26
  • 16. What is MOA? {M}assive {O}nline {A}nalysis is a framework for mining data streams. Based on experience with Weka and VFML Focussed on classification trees, but lots of active development: clustering, item set and sequence mining, regression Easy to extend Easy to design and run experiments 15 / 26
  • 17. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 16 / 26
  • 18. Twitter Sentiment Corpora Twitter Sentiment Corpus twittersentiment.appspot.com Alec Go, Richa Bhayani, Karthik Raghunathan, and Lei Huang Website to research the sentiment for a brand, product, or topic. Training dataset with messages between April 2009 and June 25, 2009 800,000 tweets with positive emoticons 800,000 tweets with negative emoticons Test dataset manually annotated 177 negative tweets 182 positive ones 17 / 26
  • 19. Twitter Sentiment Corpora Edinburgh Corpus http://demeter.inf.ed.ac.uk Sasa Petrovic, Miles Osborne, and Victor Lavrenko 97 million tweets (14 GB) Each tweet contains timestamp of the tweet, anonymized user name the tweet’s text the posting method that was used Collected between November 11th 2009 and February 1st 2010, using Twitter’s streaming API. 18 / 26
  • 20. Twitter Empirical Evaluation Sliding Window Prequential Accuracy 30 40 50 60 70 80 90 100 0,01 0,08 0,15 0,22 0,29 0,36 0,43 0,5 0,57 0,64 0,71 0,78 0,85 0,92 0,99 1,06 1,13 1,2 1,27 1,34 1,41 1,48 1,55 Millions of Instances Accuracy% NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on twittersentiment corpus 19 / 26
  • 21. Twitter Empirical Evaluation Sliding Window Kappa Statistic 0 10 20 30 40 50 60 70 80 0,01 0,08 0,15 0,22 0,29 0,36 0,43 0,50 0,57 0,64 0,71 0,78 0,85 0,92 0,99 1,06 1,13 1,20 1,27 1,34 1,41 1,48 1,55 Millions of Instances KappaStatistic NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on twittersentiment corpus 19 / 26
  • 22. Twitter Empirical Evaluation Sliding Window Prequential Accuracy 75 77 79 81 83 85 87 89 91 93 95 0,01 0,1 0,19 0,28 0,37 0,46 0,55 0,64 0,73 0,82 0,91 1 1,09 1,18 1,27 1,36 1,45 1,54 1,63 1,72 1,81 1,9 1,99 2,08 Millions of Instances Accuracy% NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on Edinburgh corpus 20 / 26
  • 23. Twitter Empirical Evaluation Sliding Window Kappa Statistic 0 10 20 30 40 50 60 70 80 90 100 0,01 0,1 0,19 0,28 0,37 0,46 0,55 0,64 0,73 0,82 0,91 1 1,09 1,18 1,27 1,36 1,45 1,54 1,63 1,72 1,81 1,9 1,99 2,08 Millions of Instances KappaStatistic NB Multinomial SGD Hoeffding Tree Class Distribution Figure: Accuracy and Kappa Statistic on Edinburgh corpus 20 / 26
  • 24. twittersentiment Corpus Prequential Accuracy and Kappa Accuracy Kappa Time Multinomial Naïve Bayes 75.05% 50.10% 116.62 sec. SGD 82.80% 62.60% 219.54 sec. Hoeffding Tree 73.11% 46.23% 5525.51 sec. Total prequential accuracy and Kappa measured on the twittersentiment data stream 21 / 26
  • 25. Edinburgh Corpus Prequential Accuracy and Kappa Accuracy Kappa Time Multinomial Naïve Bayes 86.11% 36.15% 173.28, sec. SGD 86.26% 31.88% 293.98 sec. Hoeffding Tree 84.76% 20.40% 6151.51 sec. Total prequential accuracy and Kappa obtained on the Edinburgh corpus data stream. 22 / 26
  • 26. SGD coefficient variations on the Edinburgh corpus Middle of Stream End of Stream Tags Coefficient Coefficient Variation apple 0.3 0.7 0.4 microsoft -0.4 -0.1 0.3 facebook -0.3 0.4 0.7 mcdonalds 0.5 0.1 -0.4 google 0.3 0.6 0.3 disney 0.0 0.0 0.0 bmw 0.0 -0.2 -0.2 pepsi 0.1 -0.6 -0.7 dell 0.2 0.0 -0.2 gucci -0.4 0.6 1.0 amazon -0.1 -0.4 -0.3 23 / 26
  • 27. Summary Twitter is a new “what’s-happening-right-now” tool Twitter as a stream mining dataset for real-time predictions Sliding window Kappa statistic Recommend SGD-based model 24 / 26
  • 28. twittersentiment Corpus Hold-out Accuracy and Kappa Accuracy Kappa Multinomial Naïve Bayes 82.45% 64.89% SGD 78.55% 57.23% Hoeffding Tree 69.36% 38.73% Accuracy and Kappa for the test dataset obtained from twittersentiment 25 / 26
  • 29. Edinburgh Corpus Hold-out Accuracy and Kappa Accuracy Kappa Multinomial Naïve Bayes 73.81% 47.28% SGD 67.41% 34.23% Hoeffding Tree 60.72% 20.59% Accuracy and Kappa for the test dataset obtained from twittersentiment using the Edinburgh corpus as training data stream. 26 / 26