Introduction to text classification using naive bayes

Positive or negative movie
review?
• unbelievably disappointing
• Full of zany characters and richly applied satire,
and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was the
boxing scenes.
2

What is the subject of this
article?
• Management/mba
• admission
• arts
• exam preparation
• nursing
• technology
• …
3
Subject Category
?

Text Classification
• Assigning subject categories, topics, or
genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …

Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c ∈ C

Classification Methods:
Hand-coded rules
• Rules based on combinations of words or other
features
• spam: black-list-address OR (“dollars” AND“have been
selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is
expensive

Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d  c
7

Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• Maximum Entropy Model
• Generative Vs Discriminative
• …

Naïve Bayes Intuition
• Simple (“naïve”) classification method
based on Bayes rule
• Relies on very simple representation of
document
• Bag of words

The bag of words
representation
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
γ
(
)=c

The bag of words representation:
using a subset of words
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxx
γ
(
)=c

Planning GUIGarbage
Collection
Machine
Learning NLP
parser
tag
training
translation
language...
learning
training
algorithm
shrinkage
network...
garbage
collection
memory
optimization
region...
Test
document
parser
language
label
translation
…
Bag of words for document
classification
...planning
temporal
reasoning
plan
language...
?

Bayes’ Rule Applied to
Documents and Classes
• For a document d and a class c
P(c| d) =
P(d| c)P(c)
P(d)

Naïve Bayes Classifier (I)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
cMAP = argmax
c∈C
P(c| d)
= argmax
c∈C
P(d| c)P(c)
P(d)
= argmax
c∈C
P(d| c)P(c)

Naïve Bayes Classifier (II)
Document d
represented
as features
x1..xn
cMAP = argmax
c∈C
P(d| c)P(c)
= argmax
c∈C
P(x1, x2,…, xn | c)P(c)

Naïve Bayes Classifier (IV)
How often does this
class occur?
O(|X|n•|C|) parameters
We can just count the
relative frequencies
in a corpus
Could only be estimated if
a very, very large number
of training examples was
available.
cMAP = argmax
c∈C
P(x1, x2,…, xn | c)P(c)

Multinomial Naïve Bayes
Independence Assumptions
• Bag of Words assumption: Assume position
doesn’t matter
• Conditional Independence: Assume the
feature probabilities P(xi|cj) are independent
given the class c.
P(x1, x2,…, xn | c)
P(x1,…, xn |c) = P(x1 |c)•P(x2 |c)•P(x3 |c)•...•P(xn | c)

Multinomial Naïve Bayes
Classifier
cMAP = argmax
c∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmax
c∈C
P(cj ) P(x| c)
x∈X
∏

Learning the Multinomial Naïve
Bayes Model
• First attempt: maximum likelihood
estimates
• simply use the frequencies in the data
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w∈V
∑
ˆP(cj ) =
doccount(C = cj )
Ndoc

• Create mega-document for topic j by
concatenating all docs in this topic
• Use frequency of w in mega-document
Parameter estimation
fraction of times word wi
appears
among all words in documents
of topic cj
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w∈V
∑

Summary: Naive Bayes is Not
So Naive
• Very Fast, low storage requirements
• Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
• Very good in domains with many equally important
features
Decision Trees suffer from fragmentation in such cases –
especially if little data
• Optimal if the independence assumptions hold: If
assumed independence is correct, then it is the Bayes Optimal
Classifier for problem
• A good dependable baseline for text classification

Real-world systems generally
combine:
• Automatic classification
• Manual review of
uncertain/difficult/"new” cases
23

24
The Real World
• Gee, I’m building a text classifier for real, now!
• What should I do?

25
The Real World
• Write your own classifier code.
• Tools:
●
Apache Mahout (java)
●
NLTK (python)
●
Lingpipe
●
Stanford Classifier …..
• APIs:
●
OpenCalais
●
AlchemiApi
●
UIUC CCG.....

Introduction to text classification using naive bayes

More Related Content

What's hot (20)

Similar to Introduction to text classification using naive bayes (10)

Recently uploaded (20)

Introduction to text classification using naive bayes