SlideShare a Scribd company logo
Text
Classification
Positive or negative movie
review?
• unbelievably disappointing
• Full of zany characters and richly applied satire,
and some great plot twists
• this is the greatest screwball comedy ever
filmed
• It was pathetic. The worst part about it was the
boxing scenes.
2
What is the subject of this
article?
• Management/mba
• admission
• arts
• exam preparation
• nursing
• technology
• …
3
Subject Category
?
Text Classification
• Assigning subject categories, topics, or
genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c ∈ C
Classification Methods:
Hand-coded rules
• Rules based on combinations of words or other
features
• spam: black-list-address OR (“dollars” AND“have been
selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is
expensive
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d  c
7
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• Maximum Entropy Model
• Generative Vs Discriminative
• …
Naïve Bayes Intuition
• Simple (“naïve”) classification method
based on Bayes rule
• Relies on very simple representation of
document
• Bag of words
The bag of words
representation
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
γ
(
)=c
The bag of words
representation
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen
it several times, and I'm
always happy to see it again
whenever I have a friend who
hasn't seen it yet.
γ
(
)=c
The bag of words representation:
using a subset of words
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xx several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
γ
(
)=c
Planning GUIGarbage
Collection
Machine 
Learning NLP
parser
tag
training
translation
language...
learning
training
algorithm
shrinkage
network...
garbage
collection
memory
optimization
region...
Test
document
parser
language
label
translation
…
Bag of words for document
classification
...planning
temporal
reasoning
plan
language...
?
Bayes’ Rule Applied to
Documents and Classes
• For a document d and a class c
P(c| d) =
P(d| c)P(c)
P(d)
Naïve Bayes Classifier (I)
MAP is “maximum a
posteriori” = most
likely class
Bayes Rule
Dropping the
denominator
cMAP = argmax
c∈C
P(c| d)
= argmax
c∈C
P(d| c)P(c)
P(d)
= argmax
c∈C
P(d| c)P(c)
Naïve Bayes Classifier (II)
Document d
represented
as features
x1..xn
cMAP = argmax
c∈C
P(d| c)P(c)
= argmax
c∈C
P(x1, x2,…, xn | c)P(c)
Naïve Bayes Classifier (IV)
How often does this
class occur?
O(|X|n•|C|) parameters
We can just count the
relative frequencies
in a corpus
Could only be estimated if
a very, very large number
of training examples was
available.
cMAP = argmax
c∈C
P(x1, x2,…, xn | c)P(c)
Multinomial Naïve Bayes
Independence Assumptions
• Bag of Words assumption: Assume position
doesn’t matter
• Conditional Independence: Assume the
feature probabilities P(xi|cj) are independent
given the class c.
P(x1, x2,…, xn | c)
P(x1,…, xn |c) = P(x1 |c)•P(x2 |c)•P(x3 |c)•...•P(xn | c)
Multinomial Naïve Bayes
Classifier
cMAP = argmax
c∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmax
c∈C
P(cj ) P(x| c)
x∈X
∏
Learning the Multinomial Naïve
Bayes Model
• First attempt: maximum likelihood
estimates
• simply use the frequencies in the data
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w∈V
∑
ˆP(cj ) =
doccount(C = cj )
Ndoc
• Create mega-document for topic j by
concatenating all docs in this topic
• Use frequency of w in mega-document
Parameter estimation
fraction of times word wi
appears
among all words in documents
of topic cj
ˆP(wi | cj ) =
count(wi,cj )
count(w,cj )
w∈V
∑
Summary: Naive Bayes is Not
So Naive
• Very Fast, low storage requirements
• Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
• Very good in domains with many equally important
features
Decision Trees suffer from fragmentation in such cases –
especially if little data
• Optimal if the independence assumptions hold: If
assumed independence is correct, then it is the Bayes Optimal
Classifier for problem
• A good dependable baseline for text classification
Real-world systems generally
combine:
• Automatic classification
• Manual review of
uncertain/difficult/"new” cases
23
24
The Real World
• Gee, I’m building a text classifier for real, now!
• What should I do?
25
The Real World
• Write your own classifier code.
• Tools:
●
Apache Mahout (java)
●
NLTK (python)
●
Lingpipe
●
Stanford Classifier …..
• APIs:
●
OpenCalais
●
AlchemiApi
●
UIUC CCG.....

More Related Content

PPTX
Language models
Maryam Khordad
 
PPTX
Feed forward ,back propagation,gradient descent
Muhammad Rasel
 
PDF
Graph Based Clustering
SSA KPI
 
PDF
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
PDF
Introduction to Recurrent Neural Network
Knoldus Inc.
 
PDF
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 
Language models
Maryam Khordad
 
Feed forward ,back propagation,gradient descent
Muhammad Rasel
 
Graph Based Clustering
SSA KPI
 
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Recurrent Neural Networks. Part 1: Theory
Andrii Gakhov
 

What's hot (20)

PDF
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
PPTX
Ensemble learning
Haris Jamil
 
PDF
Bayesian networks in AI
Byoung-Hee Kim
 
PPTX
Attention Is All You Need
Illia Polosukhin
 
PPTX
Handwritten Digit Recognition(Convolutional Neural Network) PPT
RishabhTyagi48
 
PDF
Naive Bayes
CloudxLab
 
PPTX
Naive bayes
Ashraf Uddin
 
PDF
Reinforcement learning, Q-Learning
Kuppusamy P
 
PDF
Design and analysis of algorithms
Dr Geetha Mohan
 
PDF
Meta learning tutorial
Joaquin Vanschoren
 
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
PPTX
mesh algorithms
gopi krishna
 
PDF
Naive Bayes Classifier
Yiqun Hu
 
PPTX
Text clustering
KU Leuven
 
PPT
Graph coloring problem
V.V.Vanniaperumal College for Women
 
PPTX
0 1 knapsack using branch and bound
Abhishek Singh
 
PPT
Unit 1 chapter 1 Design and Analysis of Algorithms
P. Subathra Kishore, KAMARAJ College of Engineering and Technology, Madurai
 
PPT
2.3 bayesian classification
Krish_ver2
 
PPTX
Inductive analytical approaches to learning
swapnac12
 
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Ensemble learning
Haris Jamil
 
Bayesian networks in AI
Byoung-Hee Kim
 
Attention Is All You Need
Illia Polosukhin
 
Handwritten Digit Recognition(Convolutional Neural Network) PPT
RishabhTyagi48
 
Naive Bayes
CloudxLab
 
Naive bayes
Ashraf Uddin
 
Reinforcement learning, Q-Learning
Kuppusamy P
 
Design and analysis of algorithms
Dr Geetha Mohan
 
Meta learning tutorial
Joaquin Vanschoren
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
mesh algorithms
gopi krishna
 
Naive Bayes Classifier
Yiqun Hu
 
Text clustering
KU Leuven
 
Graph coloring problem
V.V.Vanniaperumal College for Women
 
0 1 knapsack using branch and bound
Abhishek Singh
 
Unit 1 chapter 1 Design and Analysis of Algorithms
P. Subathra Kishore, KAMARAJ College of Engineering and Technology, Madurai
 
2.3 bayesian classification
Krish_ver2
 
Inductive analytical approaches to learning
swapnac12
 
Ad

Similar to Introduction to text classification using naive bayes (10)

PPTX
Topic_5_NB_Sentiment_Classification_.pptx
HassaanIbrahim2
 
PDF
Normalizing flow
Jong-Jin Kim
 
PDF
Text Classification.pdf
AparnaDas827261
 
PPT
My7class
ketan533
 
PPTX
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
PDF
Introduction to Big Data Science
Albert Bifet
 
PDF
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
PDF
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
PDF
Significant scales in community structure
Vincent Traag
 
PDF
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Topic_5_NB_Sentiment_Classification_.pptx
HassaanIbrahim2
 
Normalizing flow
Jong-Jin Kim
 
Text Classification.pdf
AparnaDas827261
 
My7class
ketan533
 
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Introduction to Big Data Science
Albert Bifet
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
Significant scales in community structure
Vincent Traag
 
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Ad

Recently uploaded (20)

PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 

Introduction to text classification using naive bayes

  • 2. Positive or negative movie review? • unbelievably disappointing • Full of zany characters and richly applied satire, and some great plot twists • this is the greatest screwball comedy ever filmed • It was pathetic. The worst part about it was the boxing scenes. 2
  • 3. What is the subject of this article? • Management/mba • admission • arts • exam preparation • nursing • technology • … 3 Subject Category ?
  • 4. Text Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • …
  • 5. Text Classification: definition • Input: • a document d • a fixed set of classes C = {c1, c2,…, cJ} • Output: a predicted class c ∈ C
  • 6. Classification Methods: Hand-coded rules • Rules based on combinations of words or other features • spam: black-list-address OR (“dollars” AND“have been selected”) • Accuracy can be high • If rules carefully refined by expert • But building and maintaining these rules is expensive
  • 7. Classification Methods: Supervised Machine Learning • Input: • a document d • a fixed set of classes C = {c1, c2,…, cJ} • A training set of m hand-labeled documents (d1,c1),....,(dm,cm) • Output: • a learned classifier γ:d  c 7
  • 8. Classification Methods: Supervised Machine Learning • Any kind of classifier • Naïve Bayes • Logistic regression • Support-vector machines • Maximum Entropy Model • Generative Vs Discriminative • …
  • 9. Naïve Bayes Intuition • Simple (“naïve”) classification method based on Bayes rule • Relies on very simple representation of document • Bag of words
  • 10. The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. γ ( )=c
  • 11. The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. γ ( )=c
  • 12. The bag of words representation: using a subset of words x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx γ ( )=c
  • 14. Bayes’ Rule Applied to Documents and Classes • For a document d and a class c P(c| d) = P(d| c)P(c) P(d)
  • 15. Naïve Bayes Classifier (I) MAP is “maximum a posteriori” = most likely class Bayes Rule Dropping the denominator cMAP = argmax c∈C P(c| d) = argmax c∈C P(d| c)P(c) P(d) = argmax c∈C P(d| c)P(c)
  • 16. Naïve Bayes Classifier (II) Document d represented as features x1..xn cMAP = argmax c∈C P(d| c)P(c) = argmax c∈C P(x1, x2,…, xn | c)P(c)
  • 17. Naïve Bayes Classifier (IV) How often does this class occur? O(|X|n•|C|) parameters We can just count the relative frequencies in a corpus Could only be estimated if a very, very large number of training examples was available. cMAP = argmax c∈C P(x1, x2,…, xn | c)P(c)
  • 18. Multinomial Naïve Bayes Independence Assumptions • Bag of Words assumption: Assume position doesn’t matter • Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c. P(x1, x2,…, xn | c) P(x1,…, xn |c) = P(x1 |c)•P(x2 |c)•P(x3 |c)•...•P(xn | c)
  • 19. Multinomial Naïve Bayes Classifier cMAP = argmax c∈C P(x1, x2,…, xn | c)P(c) cNB = argmax c∈C P(cj ) P(x| c) x∈X ∏
  • 20. Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates • simply use the frequencies in the data ˆP(wi | cj ) = count(wi,cj ) count(w,cj ) w∈V ∑ ˆP(cj ) = doccount(C = cj ) Ndoc
  • 21. • Create mega-document for topic j by concatenating all docs in this topic • Use frequency of w in mega-document Parameter estimation fraction of times word wi appears among all words in documents of topic cj ˆP(wi | cj ) = count(wi,cj ) count(w,cj ) w∈V ∑
  • 22. Summary: Naive Bayes is Not So Naive • Very Fast, low storage requirements • Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results • Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data • Optimal if the independence assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem • A good dependable baseline for text classification
  • 23. Real-world systems generally combine: • Automatic classification • Manual review of uncertain/difficult/"new” cases 23
  • 24. 24 The Real World • Gee, I’m building a text classifier for real, now! • What should I do?
  • 25. 25 The Real World • Write your own classifier code. • Tools: ● Apache Mahout (java) ● NLTK (python) ● Lingpipe ● Stanford Classifier ….. • APIs: ● OpenCalais ● AlchemiApi ● UIUC CCG.....