0% found this document useful (0 votes)
12 views64 pages

05 Text Classification - Naive Bayes

The document discusses text classification, focusing on the Naïve Bayes classifier and its applications in various domains such as spam detection and sentiment analysis. It covers the principles of the Naïve Bayes method, including the Bag of Words representation, learning process, and evaluation metrics like precision, recall, and F1 score. Additionally, it highlights the importance of handling unknown words and negation in sentiment classification, as well as the relationship between Naïve Bayes and language modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views64 pages

05 Text Classification - Naive Bayes

The document discusses text classification, focusing on the Naïve Bayes classifier and its applications in various domains such as spam detection and sentiment analysis. It covers the principles of the Naïve Bayes method, including the Bag of Words representation, learning process, and evaluation metrics like precision, recall, and F1 score. Additionally, it highlights the importance of handling unknown words and negation in sentiment classification, as well as the relationship between Naïve Bayes and language modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

TM340 Natural Language

Processing

Text Classification

Naïve Bayes

Based on slides by Dan Jurafsky and Chris Manning


Agenda

 The Task of Text Classification

 The Naive Bayes Classifier

 Naive Bayes: Learning

 Sentiment and Binary Naive Bayes

 More on Sentiment Classification

 Naïve Bayes: Relationship to Language Modeling

 Text Classification Evaluation: Precision, Recall, and F1


2
The Task of Text Classification

3
Is this spam?

4
Positive or negative movie review?

 unbelievably disappointing

 Full of zany characters and richly applied satire,


and some great plot twists

 this is the greatest screwball comedy ever filmed

 It was pathetic. The worst part about it was the


boxing scenes.
5
What is the subject of this article?

 MeSH Subject Category Hierarchy

• Antogonists and Inhibitors


• Blood Supply
• Chemistry
?
• Drug Therapy
• Embryology
• Epidemiology
• …
6
Text Classification

 Assigning subject categories, topics, or genres

• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
•…
7
Text Classification: definition

 Input:

• a document d
• a fixed set of classes C = {c1, c2,…, cJ}

 Output: a predicted class

8
Classification Methods:
Hand-coded rules
 Rules based on combinations of words or other features

• Spam: black-list-address OR (“dollars” AND“have been selected”)


 Accuracy can be high, If rules carefully refined by expert

 But building and maintaining these rules is expensive

9
Classification Methods:
Supervised Machine Learning
 Input:

• a document d

• a fixed set of classes C = {c1, c2,…, cj }

• A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

 Output:

• a learned classifier γ:d  c

10
Classification Methods:
Supervised Machine Learning
 Any kind of classifier can be used:

• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
•…

11
The Naive Bayes Classifier

12
Naive Bayes Intuition

 Simple ("naive") classification method based on Bayes


rule

 Relies on very simple representation of document


(Bag of words)

13
The Bag of Words Representation

seen 2

γ
sweet 1
whimsical
recommend
1
1 )=c
(
happy 1
... ... 14
Bayes’ Rule Applied to Documents and Classes

 For a document d and a class c

P ( d | c ) P (c )
P (c | d ) 
P(d )

15
Naive Bayes Classifier

MAP is “maximum a
posteriori” = most likely
class

Bayes Rule

Dropping the denominator

16
Naive Bayes Classifier

"Likelihood "Prior
" "

argmax P( x1 , x2 ,  , xn | c) P (c)
cC
Document d represented as features
x1..xn
17
Naive Bayes Classifier

cMAP argmax P ( x1 , x2 ,..., xn | c) P (c)


cC

O(|X| •|C|) parameters


n
How often does this class
Could only be occur?
estimated if a very, We can just count
very large number of the relative
training examples frequencies in a
was available. corpus
18
Multinomial Naive Bayes Independence
Assumptions

P( x1 , x2 ,  , xn | c)
 Bag of Words assumption: Assume position doesn’t
matter

 Conditional Independence: Assume the feature

probabilities P(xi | cj) are independent given the class c


P( x1 ,..., xn | c) P ( x1 | c)  P ( x2 | c)  P( x3 | c)  ...  P ( xn | c)
19
Multinomial Naive Bayes Classifier

cMAP argmax P ( x1 , x2 ,..., xn | c) P (c)


cC

20
Applying Multinomial Naive Bayes Classifiers to Text
Classification
positions  all word positions in test document

21
Problems with multiplying lots of probs

 There's a problem with this:

 Multiplying lots of probabilities can result in floating-point underflow!

0.0006 * 0.0007 * 0.0009 * 0.01 * 0.5 * 0.000008….

 Idea: Use logs, because log(ab) = log(a) + log(b)

We'll sum logs of probabilities instead of multiplying probabilities!

22
We do everything in log space

 Instead of this:

 This:

 Notes:
1) Taking log doesn't change the ranking of classes!
The class with highest probability also has highest log probability!
2) It's a linear model:
Just a max of a sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
23
Naive Bayes: Learning

24
Learning the Multinomial Naive Bayes Model

 First attempt: maximum likelihood estimates

• simply use the frequencies in the data


𝑁𝑐
^
𝑃 (𝑐 𝑗)= 𝑗

𝑁 𝑡𝑜𝑡𝑎𝑙

25
Parameter Estimation

fraction of times word wi appears


among all words in documents of topic cj

 Create mega-document for topic j by concatenating all


docs in this topic
• Use frequency of w in mega-document
26
Problem with Maximum Likelihood

 What if we have seen no training documents with the


word fantastic and classified in the topic positive?

 Zero probabilities cannot be conditioned away, no


matter the other evidence!

27
Laplace (add-1) smoothing for Naïve Bayes

 The solution: apply Laplace (add-1) smoothing for


Naïve Bayes

28
Multinomial Naïve Bayes: Learning

 From training corpus, extract Vocabulary

• Calculate P (cj) terms • Calculate P (wk | cj) terms


• For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj

29
Unknown words
 What about unknown words

• that appear in our test data


• but not in our training data or vocabulary?

 We ignore them

• Remove them from the test document!


• Pretend they weren't there!
• Don't include any probability for them at all!

 Why don't we build an unknown word model?

• It doesn't help: knowing which class has more unknown words is not generally
helpful!
30
Stop words

 Some systems ignore stop words

• Stop words: very frequent words like the and a.


- Sort the vocabulary by word frequency in training set
- Call the top 10 or 50 words the stopword list.
- Remove all stop words from both training and test sets
• As if they were never there!

 But removing stop words doesn't usually help

• So, in practice most NB algorithms use all words and don't use
stopword lists
31
Sentiment and Binary Naive
Bayes

32
Let's do a worked sentiment example!

33
A worked sentiment example with add-1 smoothing

1. Prior from training:


𝑁𝑐 P(-) = 3/5
P(+) =
^ (𝑐 )=
𝑃 𝑗

2/5
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙

2. Drop "with"
3. Likelihoods from training:
𝑝 ( 𝑤 𝑖|𝑐 ) =
𝑐𝑜𝑢𝑛𝑡 ( 𝑤 𝑖 , 𝑐 ) +1
4. Scoring the test set:
(∑
𝑤 ∈𝑉
)
𝑐𝑜𝑢𝑛𝑡 ( 𝑤 , 𝑐 ) + ¿ 𝑉 ∨¿ ¿

34
Optimizing for sentiment analysis

 For tasks like sentiment, word occurrence seems to


be more important than word frequency.
• The occurrence of the word fantastic tells us a lot
• The fact that it occurs 5 times may not tell us much more.
 Binary multinominal naive bayes, or binary NB

• Clip our word counts at 1

35
Binary Multinomial Naïve Bayes: Learning

• From training corpus, extract Vocabulary


 Calculate P(cj) terms • Calculate P(wk | cj) terms
• For each cj in C do • Textj  single doc containing all docsj
• For each word wk in Vocabulary
docsj  all docs with class =cj
nk  # of occurrences of wk in Textj

36
Binary Multinomial Naïve Bayes: Learning

• From training corpus, extract Vocabulary


 Calculate P(cj) terms • Calculate P(wk | cj) terms
• For each cj in C do • Remove duplicates in each doc:
• For each word type w in docj
docsj  all docs with class =cj
• Retain only a single instance of w
• Textj  single doc containing all docsj
• For each word wk in Vocabulary
nk  # of occurrences of wk in Textj

37
Binary Multinomial Naive Bayes
on a test document d
 First remove all duplicate words from d

 Then compute NB using the same equation:

38
Binary multinominal naive Bayes

Counts can still be 2! Binarization is within-doc! 39


More on Sentiment Classification

40
Sentiment Classification: Dealing with Negation

 I really like this movie

 I really don't like this movie

 Negation changes the meaning of "like" to negative.

 Negation can also change negative to positive.

• Don't dismiss this film


• Doesn't let us get bored
41
Sentiment Classification: Dealing with Negation

 Simple baseline method:

 Add NOT_ to every word between negation and following


punctuation:

 didn’t like this movie , but I

 didn’t NOT_like NOT_this NOT_movie but I


42
Sentiment Classification: Lexicons

 Sometimes we don't have enough labeled training


data

 In that case, we can make use of pre-built word lists


called lexicons

 There are various publicly available lexicons

43
MPQA Subjectivity Cues Lexicon

 Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

 6885 words from 8221 lemmas, annotated for intensity (strong/weak)

• 2718 positive
• 4912 negative
 + : admirable, beautiful, confident, dazzling, ecstatic, favor, glee,
great

 − : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh,


hate

44
The General Inquirer

 Home page: http://www.wjh.harvard.edu/~inquirer


 List of Categories: http://www.wjh.harvard.edu/~inquirer/homecat.htm
 Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
 Categories:

• Positiv (1915 words) and Negativ (2291 words)


• Strong vs Weak, Active vs Passive, Overstated versus Understated
• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc

 Free for Research Use

45
Using Lexicons in Sentiment Classification

 Add a feature that gets a count whenever a word from the lexicon
occurs
• E.g., a feature called "this word occurs in the positive lexicon" or "this
word occurs in the negative lexicon"

 Now all positive words (good, great, beautiful, wonderful) or negative


words count for that feature.

 Using 1-2 features isn't as good as using all the words.

• But when training data is sparse or not representative of the test set, dense
lexicon features can help
46
Naive Bayes in Spam Filtering

 Spam Assassin Features:

• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)


• From: starts with many numbers
• Subject is all capitals
• HTML has a low ratio of text to image area
• "One hundred percent guaranteed"
• Claims you can be removed from the list
47
Naive Bayes in Language ID

 Determining what language a piece of text is written


in.

 Features based on character n-grams do very well

 Important to train on lots of varieties of each language

• (e.g., American English varieties like African-American


English, or English varieties around the world like Indian
English)
48
Summary: Naive Bayes is Not So Naive

 Very Fast, low storage requirements

 Work well with very small amounts of training data

 Robust to Irrelevant Features

 Very good in domains with many equally important features

 Optimal if the independence assumptions hold

 A good dependable baseline for text classification

49
Naïve Bayes: Relationship to
Language Modeling

50
Naïve Bayes and Language Modeling

 Naïve bayes classifiers can use any sort of feature

• URL, email address, dictionaries, network features


 But if:

• We use only word features


• we use all of the words in the text (not a subset)
 Then:

• Naïve bayes has an important similarity to language modeling.


51
Each class = a unigram language model

Assigning each word: P(word | c)

Assigning each sentence: P(s | c)=Π P(word | c)


Class pos I love this fun film
0.1 I 0.1 0. 0.05 0.0 0.

P(s | pos) = 0.0000005


0.1 love 1 1 1
0.01
this
0.05 fun
0.1 film
… 52
Naïve Bayes as a Language Model

 Which class assigns the higher probability to s?

Model pos Model neg


I love this fun film
0.1 I 0.2 I
0.1 0.1 0.01 0.05 0.1
0.1 love 0.001 love 0.001 0.01 0.005 0.1
0.2
0.01 0.01
this P(s|pos) > P(s|neg)
this
0.05 fun
0.005 fun
0.1 film
0.1 film 53
Evaluation of Text Classification
Precision, Recall, and F1

54
Evaluating Classifiers
How well does our classifier work?
Let's first address binary classifiers:

•Is this email spam?


spam (+) or not spam (-)

•Is this post about Delicious Pie Company?


about Del. Pie Co (+) or not about Del. Pie Co(-)

We'll need to know

1. What did our classifier say about each email or post?


2. What should our classifier have said, i.e., the correct answer, usually as
defined by humans ("gold label")

55
First step in evaluation: The confusion matrix

56
Why don't we use accuracy?

Accuracy doesn't work well when we're dealing with uncommon or


imbalanced classes

Suppose we look at 1,000,000 social media posts to find Delicious Pie-


lovers (or haters)

• 100 of them talk about our pie


• 999,900 are posts about something unrelated

Imagine the following simple classifier

Every post is "not about pie"


57
Why don't we use accuracy?

Accuracy of our "nothing is pie" classifier

999,900 true negatives and 100 false negatives

Accuracy is 999,900/1,000,000 = 99.99%!

But useless at finding pie-lovers (or haters)!!

Which was our goal!

Accuracy doesn't work well for unbalanced classes

Most tweets are not about pie!


58
Instead of accuracy we use precision and recall

Precision: % of selected items that are correct

Recall: % of correct items that are selected

59
Precision/Recall aren't fooled by the
“ just call everything negative" classifier!
Stupid classifier: Just say no: every tweet is "not about pie"

•100 tweets talk about pie, 999,900 tweets don't

•Accuracy = 999,900/1,000,000 = 99.99%

But the Recall and Precision for this classifier are terrible:

60
A combined measure: F1

 F1 is a combination of precision and recall.

61
Suppose we have more than 2 classes?
 Lots of text classification tasks have more than two classes.

 Sentiment analysis (positive, negative, neutral) , named entities (person, location,


organization)

 We can define precision and recall for multiple classes like this 3-way email task:

62
How to combine P/R values for different classes:
Microaveraging vs Macroaveraging

63
Thank You

You might also like