SlideShare a Scribd company logo
Naïve Bayes Classifier
Palacode Narayana Iyer Anantharaman
23 Aug 2017
Copyright 2016 JNResearch, All Rights Reserved
Classification Problems with Naïve Bayes
• Two step process
• Build the model – this is the training process where
we estimate model parameters
• Use the model – here, we predict the output class
given the inputs
Copyright 2016 JNResearch, All Rights Reserved
Train
Predict
Training
Data
Input to be
classified
Model
Prediction
When to use Naïve Bayes: Three Scenarios
• You have come up with a neat solution to a ML problem.
Your manager wants you to do a quick demo to your CEO
in the next couple of hours.
• You are assigned a classification problem similar to spam
filtering. Your manager says: We need this feature in our
next release, a less accurate model is okay.
• You have come up with a sophisticated deep learning
based model. You submitted this for review and you are
asked to benchmark your results against standard
approaches
Copyright 2016 JNResearch, All Rights Reserved
Naïve Bayes Classifier
A simple classifier model that is:
• Based on the Bayes theorem
• Uses Supervised Learning
• Easy to build
• Faster to train, compared to the other models
• Often used as a baseline classifier for benchmarking
Copyright 2016 JNResearch, All Rights Reserved
Foundation: Bayes Theorem
𝐹𝑟𝑜𝑚 𝐵𝑎𝑦𝑒𝑠 𝑡ℎ𝑒𝑜𝑟𝑒𝑚, 𝑤𝑒 ℎ𝑎𝑣𝑒: 𝑃 𝑌 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌)/𝑃(𝑋)
Suppose Y represents the class variable and X1, X2, X3, … Xn are inputs:
𝑃 𝑌 𝑋1, … , 𝑋 𝑛 =
𝑃 𝑋1, … , 𝑋 𝑛 𝑌 𝑃(𝑌)
𝑃(𝑋1, … , 𝑋 𝑛)
Assuming Xi ⊥ Xj given Y for all i, j , we may write the above equation as:
𝑃 𝑌 𝑋1, … , 𝑋 𝑛 =
𝑃 𝑋1|𝑌) 𝑃(𝑋2 𝑌 … 𝑃(𝑋 𝑛|𝑌) 𝑃(𝑌)
𝑃(𝑋1, … , 𝑋 𝑛)
𝑃 𝑌 𝑋1, … , 𝑋 𝑛 ∝ 𝑃(𝑌) 𝑖=1
𝑛
𝑃(𝑋𝑖|𝑌)
𝑌 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑃(𝑌)
𝑖=1
𝑛
𝑃(𝑋𝑖|𝑌)
Copyright 2016 JNResearch, All Rights Reserved
(Naïve Assumption)
What is Naïve about Naïve Bayes?
• In many applications, treating one element in the input (Xi) independent of every other element in
the input (Xj for all j) and also ignoring word order is quite a strong assumption
• Why?
• The sentence: “day great is today a” is a jumbled form of “Today is a great day”. This suggests that the
ordering of words is important for us to perform a semantic interpretation. NB classifier treats each word
as independent and hence ignores the order, which in many cases will not hold.
• Take a selfie of yourself, the picture looks great! What if we randomly shift the pixels around throughout
the image? Though all pixels are still present in the modified image, their order is severely altered.
• But:
• Despite the Naïve assumption, NB Classifier still works and produces accurate results for a
number of applications!
• Consider the problem of search using key words. Does the word order matter?
Copyright 2016 JNResearch, All Rights Reserved
Estimating the model parameters
• Naïve Bayes Model: 𝑌 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑃(𝑌) 𝑖=1
𝑛
𝑃(𝑋𝑖|𝑌)
• The model parameters are: 𝑃(𝑌) and 𝑃(𝑋𝑖|𝑌) for all values of Xi
• Given a dataset that has several training examples, where each example has an
input (𝑋1, … , 𝑋 𝑛) and the expected target output (Y), we need to “learn” the
model
• We can perform maximum likelihood estimates in order to determine model
parameters
Copyright 2016 JNResearch, All Rights Reserved
Naïve Bayes Classifier for
the real world use cases
(Ref: Kaggle)
Naïve Bayes Case Study (Ref: Kaggle)
Copyright 2016 JNResearch, All Rights Reserved
Document Classification With Naïve Bayes
• Document (or text in our discussion today) classification assigns a class label to
the given document. Formally:
• Given the input as document d and a set of classes 𝐶 = {𝑐1, 𝑐2, … , 𝑐 𝑛}, predict a class 𝑐 ∈ 𝐶
• Example:
• Gmail categorizes the incoming mails in to Primary, Social, Promotions, Junk – we can define:
𝐶 = {𝑃𝑟𝑖𝑚𝑎𝑟𝑦, 𝑆𝑜𝑐𝑖𝑎𝑙, 𝑃𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑠, 𝐽𝑢𝑛𝑘}
• Assign a class label 𝑐 ∈ 𝐶 for every incoming mail d
• The term document could be a plain text or even a compound multimedia
document. We use the term document to refer to any entity that can be classified.
Copyright 2016 JNResearch, All Rights Reserved
Can we build rule based models to do this?
• For instance, in email categorization, one may apply a set of if-then-else rules and
determine the class of input mail
• With well written rules, in general, one can get high precision but often low recall
• Drafting a set of comprehensive rules is difficult and expensive as they need
expert knowledge
• Example: Suppose I receive a mail from flipkart, should that be classified as a
promotion or primary?
• It depends!
Copyright 2016 JNResearch, All Rights Reserved
Supervised Machine Learning
• Input
• A document d, consisting of word tokens 𝑤1, 𝑤1, … , 𝑤𝑗
• A finite set of classes: 𝐶 = {𝑐1, 𝑐2, … , 𝑐 𝑛}
• The training dataset: 𝐷𝑡𝑟𝑎𝑖𝑛 = { 𝑑1, 𝑐1 , 𝑑2, 𝑐2 , … , (𝑑 𝑚, 𝑐 𝑚)}
• Output
• A model M such that: 𝑀: 𝑑 → 𝑐
Copyright 2016 JNResearch, All Rights Reserved
Bag of words representation
• A document can be considered to be an ordered
sequence of words
• Naïve Bayes classifier ignores the word order and
correlations and hence we may represent a
document d as bag of words or unigrams
• In a typical English sentence, there may be many
words that are required for grammatical purposes
and may not contribute to the classification decision.
• We can do some pre-processing and remove such
words before sending the document to the classifier
Copyright 2016 JNResearch, All Rights Reserved
𝑀 "I love my Samsung Galaxy Grand 2" = 𝑐
𝑀 "love Samsung Galaxy Grand" = 𝑐
Text Classification with Multinomial Naïve Bayes
• Recall: 𝑌 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑃(𝑌) 𝑖=1
𝑛
𝑃(𝑋𝑖|𝑌) where the model parameters are: 𝑃(𝑌) and
𝑃(𝑋𝑖|𝑌) for all values of Xi
• For the document classification problem with bag of words model, Xi is a word in the
document and Y is the document class
• Estimate the model parameters as below and save the model as a table T:
• For each class 𝑐 ∈ 𝐶, determine
• 𝑀𝐿𝐸: 𝑃 𝐶 = 𝑐 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑐
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
• For each word 𝑤𝑖 ∈ 𝑉 𝑎𝑛𝑑 𝑎 𝑐𝑙𝑎𝑠𝑠 𝑐𝑗 ∈ 𝐶, 𝑀𝐿𝐸 𝑃 𝑤𝑖 𝑐𝑗 =
𝑐𝑜𝑢𝑛𝑡(𝑤 𝑖,𝑐 𝑗)
𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤,𝑐 𝑗)
• Prediction: Given a new input, generate the word tokens using the same procedure used
for training. Retrieve the model values from the Table T and compute 𝑌
Copyright 2016 JNResearch, All Rights Reserved
Are we done? Not yet 
• What happens if 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖, 𝑐𝑗) = 0 ?
• This can happen if the new unseen input has a word 𝑤𝑖, that was not
encountered in the training data.
• Example: The training data contains “I love my Samsung Galaxy Grand
2” but doesn’t have the word, say, “adore”. If the unseen input is: “I
adore my Samsung Galaxy Grand 2”, the entire probability
computation will be zero!
Copyright 2016 JNResearch, All Rights Reserved
Laplace Smoothing (Add 1)
• Assume any word in the vocabulary has occurred at least once
• This assumption results in the estimation:
𝑃 𝑤𝑖 𝑐𝑗 =
𝑐𝑜𝑢𝑛𝑡 𝑤𝑖, 𝑐𝑗 + 1
( 𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡 𝑤, 𝑐𝑗 ) + |𝑉|
• The above ensures that the probabilities don’t go to zero
• What happens when you encounter a word that is not in the vocabulary?
Copyright 2016 JNResearch, All Rights Reserved
Variants of Naïve Bayes
• In the previous slides we showed the MLE probability computation to be based on counts of
words in each document. This is called the multinomial model.
• Multinomial is a natural fit when we solve topic classification kind of problems
• E.g. Consider the problem of classifying a given article in to Scientific, Business, Sports.
• Sometimes, just the presence or absence of a given word in a document is adequate in order to
classify. We may choose a Binarized Naïve Bayes in such cases.
• E.g Consider the problem of sentiment analysis. If there is a word “fantastic”, it doesn’t need to be
repeated in the same document in order for us to conclude the polarity of the sentiment.
• A number of applications may involve features that are real valued. We can use a Gaussian (or
some other) variant for these.
Copyright 2016 JNResearch, All Rights Reserved
Multinomial Naïve Bayes
• In a multinomial classification model, the frequency of occurrence of each word
in the document is taken in to account (instead of presence/absence)
• Compute prior for classes using maximum likelihood estimates
• The algorithm to compute 𝑃 𝑤𝑖 𝑐𝑗 is:
• Concatenate all documents that have the class 𝑐𝑗, let it be textj
• Let n be the number of tokens in textj and 𝛼 be the constant used for smoothing
• For each word 𝑤 𝑘 in the vocabulary, let nk be the number of occurrences of 𝑤 𝑘 in the textj
𝑃 𝑤𝑖 𝑐𝑗 =
𝑛 𝑘 + 𝛼
𝑛 + 𝛼|𝑉|
Copyright 2016 JNResearch, All Rights Reserved
Binarized Multinomial Naïve Bayes
• In a binarized multinomial classification model, we count only the presence or
absence of a given word (or feature) in the given document as opposed to using
the frequency. That is, we clamp the word count of a word w in a document j to 1
• Compute prior for classes using maximum likelihood estimates
• The algorithm to compute 𝑃 𝑤𝑖 𝑐𝑗 is:
• In each document d, keep only one instance of the given word w (Remove duplicates)
• Concatenate all documents that have the class 𝑐𝑗, let it be textj
• Let n be the number of tokens in textj and 𝛼 be the constant used for smoothing
• For each word 𝑤 𝑘 in the vocabulary, let nk be the number of occurrences of 𝑤 𝑘 in the textj
𝑃 𝑤𝑖 𝑐𝑗 =
𝑛 𝑘 + 𝛼
𝑛 + 𝛼|𝑉|
Copyright 2016 JNResearch, All Rights Reserved
Gaussian Naïve Bayes
• So far, we have looked at text and dealt with word occurrence counts that are discrete
values
• What happens when the features are continuous valued or even vectors of continuous
values? E.g images with RGB values?
• Gaussian Naïve Bayes is useful to classify such inputs
𝑃 𝑋 𝑛 𝑌 =
1
2𝜋𝜎 𝑦
2
exp(−
𝑥𝑖 − 𝜇 𝑦
2
2𝜎 𝑦
2 )
• Estimate the parameters 𝜎 𝑦 𝑎𝑛𝑑 𝜇 𝑦 using maximum likelihood
Copyright 2016 JNResearch, All Rights Reserved
Estimating Parameters (Ref: T Mitchelle)
Copyright 2016 JNResearch, All Rights Reserved
How many parameters
must we estimate for
Gaussian Naïve Bayes
if Y has k possible
values:
𝑋 = 𝑋1 𝑋2 … 𝑋 𝑛
Gaussian Naïve Bayes: Example
• Suppose we are required to predict the price range (high_end, mid_range, low_end) of a
mobile phone given its specifications.
• We observe that some elements in the specification (e.g screen size) are continuous
variables.
• We can either discretize these elements and use discrete NB classifier or we can directly use
a Gaussian NB
Copyright 2016 JNResearch, All Rights Reserved
Practical Considerations
• Probability computations in a joint distribution involve multiplying many terms
that are small fractions. This might sometime cause underflow errors. Use log
probabilities to avoid this issue
• Use the distribution that is natural to the problem on hand
• The choice of the distribution to use is your decision. There is no rule that says you should
use Gaussian all the time!
• You can discretize continuous variables so that you can use Binarized, Bernoulli or
Multinomial discrete Naïve Bayes. But you might lose fidelity due to discretization.
• Exercise judgement while choosing the features, you can minimize the data
required by removing redundant features
Copyright 2016 JNResearch, All Rights Reserved
Case Study: Accurate Searching on Twitter
• Twitter’s search API allows keywords based search
• Suppose we search Twitter with the keyword “congress”,
we might end up getting tweets that pertain to Indian
National Congress, Mobile World Congress, American
Congress, Science Congress and so on.
• A narrow search using “exact” search phrase would
improve precision but will miss many relevant and
interesting tweets
• Is there a way to search the Twitter such that we get
precise matches without missing interesting and relevant
tweets?
Copyright 2016 JNResearch, All Rights Reserved
Summary
• Despite the naïve assumptions, Naïve Bayes classifier is pretty useful. Do not skip
it in favour of complex models without evaluating it for your application. You may
be in for surprise!
• There are many variants of Naïve Bayes Classifier, the common thing about them
is that all are based on Bayes theorem and make same assumptions.
• Choose binarized model if number of occurrences of a given word do not
contribute to the classification process.
• If the features are continuous variables, use Gaussian Naïve Bayes or perform
discretization if you get good accuracy
Copyright 2016 JNResearch, All Rights Reserved
Code Walkthrough
Copyright 2016 JNResearch, All Rights Reserved

More Related Content

What's hot (20)

PPTX
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
PPTX
Neural networks.ppt
SrinivashR3
 
PPTX
Binary Class and Multi Class Strategies for Machine Learning
Paxcel Technologies
 
PPTX
Word2Vec
mohammad javad hasani
 
PPTX
Association rule mining.pptx
maha797959
 
PDF
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
PPT
Extensible hashing
rajshreemuthiah
 
PPT
Ontology engineering
Aliabbas Petiwala
 
PPT
3. mining frequent patterns
Azad public school
 
PPTX
Dynamic Programming
Sahil Kumar
 
PPT
Planning
ahmad bassiouny
 
PDF
Scaling and Normalization
Kush Kulshrestha
 
PPTX
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
PDF
Convolutional neural network
Yan Xu
 
PPT
Support Vector Machines
nextlib
 
PDF
Introduction to Recurrent Neural Network
Knoldus Inc.
 
PPTX
Activation functions
PRATEEK SAHU
 
PDF
Matrix Factorization
Yusuke Yamamoto
 
PPTX
Perceptron & Neural Networks
NAGUR SHAREEF SHAIK
 
PDF
Machine Learning in 5 Minutes— Classification
Brian Lange
 
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
Neural networks.ppt
SrinivashR3
 
Binary Class and Multi Class Strategies for Machine Learning
Paxcel Technologies
 
Association rule mining.pptx
maha797959
 
ViT (Vision Transformer) Review [CDM]
Dongmin Choi
 
Extensible hashing
rajshreemuthiah
 
Ontology engineering
Aliabbas Petiwala
 
3. mining frequent patterns
Azad public school
 
Dynamic Programming
Sahil Kumar
 
Planning
ahmad bassiouny
 
Scaling and Normalization
Kush Kulshrestha
 
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
Convolutional neural network
Yan Xu
 
Support Vector Machines
nextlib
 
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Activation functions
PRATEEK SAHU
 
Matrix Factorization
Yusuke Yamamoto
 
Perceptron & Neural Networks
NAGUR SHAREEF SHAIK
 
Machine Learning in 5 Minutes— Classification
Brian Lange
 

Viewers also liked (6)

PDF
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
Dongseo University
 
PPTX
Naive bayes
abaldove
 
PDF
Lecture10 - Naïve Bayes
Albert Orriols-Puig
 
PDF
Dwdm naive bayes_ankit_gadgil_027
ankitgadgil
 
PPTX
Naive bayes
Ashraf Uddin
 
PPTX
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
Dongseo University
 
Naive bayes
abaldove
 
Lecture10 - Naïve Bayes
Albert Orriols-Puig
 
Dwdm naive bayes_ankit_gadgil_027
ankitgadgil
 
Naive bayes
Ashraf Uddin
 
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
Ad

Similar to An Overview of Naïve Bayes Classifier (20)

PPT
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
joyaluca2
 
PDF
Bayes 6
uddingias
 
PPTX
"Naive Bayes Classifier" @ Papers We Love Bucharest
Stefan Adam
 
PPT
lecture15-supervised.ppt
Indra Hermawan
 
PPTX
Naïve Bayes Classifier Algorithm.pptx
Shubham Jaybhaye
 
PPTX
Nave Bias algorithm in Nature language processing
attaurahman
 
PDF
Naive Bayes Classifier
Yiqun Hu
 
PPTX
best data science courses
marketer1234
 
PPTX
data science training in hyderabad
sanfrrans
 
PPTX
Artificial intelligence training in bangalore
TejaspathiLV
 
PDF
Naive Bayes for the Superbowl
John Liu
 
PPTX
Introduction to Machine Learning Concepts
RajeswariBsr1
 
PPTX
Naive Bayesian classifier Naive Bayesian classifier Naive Bayesian classifier
MeenakshiR43
 
PPTX
Naive Bayes Classification
ZenithAcharya
 
PPTX
Supervised models
Hasan Badran
 
PPT
UNIT2_NaiveBayes algorithms used in machine learning
michaelaaron25322
 
PDF
04_NBayes-Machine Learning 10-601-1-26-2015.pptx.pdf
samy619743
 
PPTX
Lecture 10
Jeet Das
 
PPTX
Naïve Bayes Classifier Algorithm.pptx
PriyadharshiniG41
 
PDF
Machine learning naive bayes and svm.pdf
SubhamKumar3239
 
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
joyaluca2
 
Bayes 6
uddingias
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
Stefan Adam
 
lecture15-supervised.ppt
Indra Hermawan
 
Naïve Bayes Classifier Algorithm.pptx
Shubham Jaybhaye
 
Nave Bias algorithm in Nature language processing
attaurahman
 
Naive Bayes Classifier
Yiqun Hu
 
best data science courses
marketer1234
 
data science training in hyderabad
sanfrrans
 
Artificial intelligence training in bangalore
TejaspathiLV
 
Naive Bayes for the Superbowl
John Liu
 
Introduction to Machine Learning Concepts
RajeswariBsr1
 
Naive Bayesian classifier Naive Bayesian classifier Naive Bayesian classifier
MeenakshiR43
 
Naive Bayes Classification
ZenithAcharya
 
Supervised models
Hasan Badran
 
UNIT2_NaiveBayes algorithms used in machine learning
michaelaaron25322
 
04_NBayes-Machine Learning 10-601-1-26-2015.pptx.pdf
samy619743
 
Lecture 10
Jeet Das
 
Naïve Bayes Classifier Algorithm.pptx
PriyadharshiniG41
 
Machine learning naive bayes and svm.pdf
SubhamKumar3239
 
Ad

More from ananth (20)

PDF
Generative Adversarial Networks : Basic architecture and variants
ananth
 
PDF
Convolutional Neural Networks : Popular Architectures
ananth
 
PDF
Foundations: Artificial Neural Networks
ananth
 
PDF
Overview of Convolutional Neural Networks
ananth
 
PDF
Artificial Intelligence Course: Linear models
ananth
 
PDF
Mathematical Background for Artificial Intelligence
ananth
 
PDF
Search problems in Artificial Intelligence
ananth
 
PDF
Introduction to Artificial Intelligence
ananth
 
PDF
Word representation: SVD, LSA, Word2Vec
ananth
 
PDF
Deep Learning For Speech Recognition
ananth
 
PDF
Overview of TensorFlow For Natural Language Processing
ananth
 
PDF
Convolutional Neural Networks: Part 1
ananth
 
PDF
Machine Learning Lecture 3 Decision Trees
ananth
 
PDF
Machine Learning Lecture 2 Basics
ananth
 
PDF
Introduction To Applied Machine Learning
ananth
 
PDF
Recurrent Neural Networks, LSTM and GRU
ananth
 
PDF
MaxEnt (Loglinear) Models - Overview
ananth
 
PDF
An overview of Hidden Markov Models (HMM)
ananth
 
PDF
L06 stemmer and edit distance
ananth
 
PDF
L05 language model_part2
ananth
 
Generative Adversarial Networks : Basic architecture and variants
ananth
 
Convolutional Neural Networks : Popular Architectures
ananth
 
Foundations: Artificial Neural Networks
ananth
 
Overview of Convolutional Neural Networks
ananth
 
Artificial Intelligence Course: Linear models
ananth
 
Mathematical Background for Artificial Intelligence
ananth
 
Search problems in Artificial Intelligence
ananth
 
Introduction to Artificial Intelligence
ananth
 
Word representation: SVD, LSA, Word2Vec
ananth
 
Deep Learning For Speech Recognition
ananth
 
Overview of TensorFlow For Natural Language Processing
ananth
 
Convolutional Neural Networks: Part 1
ananth
 
Machine Learning Lecture 3 Decision Trees
ananth
 
Machine Learning Lecture 2 Basics
ananth
 
Introduction To Applied Machine Learning
ananth
 
Recurrent Neural Networks, LSTM and GRU
ananth
 
MaxEnt (Loglinear) Models - Overview
ananth
 
An overview of Hidden Markov Models (HMM)
ananth
 
L06 stemmer and edit distance
ananth
 
L05 language model_part2
ananth
 

Recently uploaded (20)

DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PDF
smart lot access control system with eye
rasabzahra
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPT
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPT
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
MRRS Strength and Durability of Concrete
CivilMythili
 
smart lot access control system with eye
rasabzahra
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Electrical Safety Presentation for Basics Learning
AliJaved79382
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Thermal runway and thermal stability.pptx
godow93766
 

An Overview of Naïve Bayes Classifier

  • 1. Naïve Bayes Classifier Palacode Narayana Iyer Anantharaman 23 Aug 2017 Copyright 2016 JNResearch, All Rights Reserved
  • 2. Classification Problems with Naïve Bayes • Two step process • Build the model – this is the training process where we estimate model parameters • Use the model – here, we predict the output class given the inputs Copyright 2016 JNResearch, All Rights Reserved Train Predict Training Data Input to be classified Model Prediction
  • 3. When to use Naïve Bayes: Three Scenarios • You have come up with a neat solution to a ML problem. Your manager wants you to do a quick demo to your CEO in the next couple of hours. • You are assigned a classification problem similar to spam filtering. Your manager says: We need this feature in our next release, a less accurate model is okay. • You have come up with a sophisticated deep learning based model. You submitted this for review and you are asked to benchmark your results against standard approaches Copyright 2016 JNResearch, All Rights Reserved
  • 4. Naïve Bayes Classifier A simple classifier model that is: • Based on the Bayes theorem • Uses Supervised Learning • Easy to build • Faster to train, compared to the other models • Often used as a baseline classifier for benchmarking Copyright 2016 JNResearch, All Rights Reserved
  • 5. Foundation: Bayes Theorem 𝐹𝑟𝑜𝑚 𝐵𝑎𝑦𝑒𝑠 𝑡ℎ𝑒𝑜𝑟𝑒𝑚, 𝑤𝑒 ℎ𝑎𝑣𝑒: 𝑃 𝑌 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌)/𝑃(𝑋) Suppose Y represents the class variable and X1, X2, X3, … Xn are inputs: 𝑃 𝑌 𝑋1, … , 𝑋 𝑛 = 𝑃 𝑋1, … , 𝑋 𝑛 𝑌 𝑃(𝑌) 𝑃(𝑋1, … , 𝑋 𝑛) Assuming Xi ⊥ Xj given Y for all i, j , we may write the above equation as: 𝑃 𝑌 𝑋1, … , 𝑋 𝑛 = 𝑃 𝑋1|𝑌) 𝑃(𝑋2 𝑌 … 𝑃(𝑋 𝑛|𝑌) 𝑃(𝑌) 𝑃(𝑋1, … , 𝑋 𝑛) 𝑃 𝑌 𝑋1, … , 𝑋 𝑛 ∝ 𝑃(𝑌) 𝑖=1 𝑛 𝑃(𝑋𝑖|𝑌) 𝑌 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑃(𝑌) 𝑖=1 𝑛 𝑃(𝑋𝑖|𝑌) Copyright 2016 JNResearch, All Rights Reserved (Naïve Assumption)
  • 6. What is Naïve about Naïve Bayes? • In many applications, treating one element in the input (Xi) independent of every other element in the input (Xj for all j) and also ignoring word order is quite a strong assumption • Why? • The sentence: “day great is today a” is a jumbled form of “Today is a great day”. This suggests that the ordering of words is important for us to perform a semantic interpretation. NB classifier treats each word as independent and hence ignores the order, which in many cases will not hold. • Take a selfie of yourself, the picture looks great! What if we randomly shift the pixels around throughout the image? Though all pixels are still present in the modified image, their order is severely altered. • But: • Despite the Naïve assumption, NB Classifier still works and produces accurate results for a number of applications! • Consider the problem of search using key words. Does the word order matter? Copyright 2016 JNResearch, All Rights Reserved
  • 7. Estimating the model parameters • Naïve Bayes Model: 𝑌 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑃(𝑌) 𝑖=1 𝑛 𝑃(𝑋𝑖|𝑌) • The model parameters are: 𝑃(𝑌) and 𝑃(𝑋𝑖|𝑌) for all values of Xi • Given a dataset that has several training examples, where each example has an input (𝑋1, … , 𝑋 𝑛) and the expected target output (Y), we need to “learn” the model • We can perform maximum likelihood estimates in order to determine model parameters Copyright 2016 JNResearch, All Rights Reserved
  • 8. Naïve Bayes Classifier for the real world use cases (Ref: Kaggle)
  • 9. Naïve Bayes Case Study (Ref: Kaggle) Copyright 2016 JNResearch, All Rights Reserved
  • 10. Document Classification With Naïve Bayes • Document (or text in our discussion today) classification assigns a class label to the given document. Formally: • Given the input as document d and a set of classes 𝐶 = {𝑐1, 𝑐2, … , 𝑐 𝑛}, predict a class 𝑐 ∈ 𝐶 • Example: • Gmail categorizes the incoming mails in to Primary, Social, Promotions, Junk – we can define: 𝐶 = {𝑃𝑟𝑖𝑚𝑎𝑟𝑦, 𝑆𝑜𝑐𝑖𝑎𝑙, 𝑃𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑠, 𝐽𝑢𝑛𝑘} • Assign a class label 𝑐 ∈ 𝐶 for every incoming mail d • The term document could be a plain text or even a compound multimedia document. We use the term document to refer to any entity that can be classified. Copyright 2016 JNResearch, All Rights Reserved
  • 11. Can we build rule based models to do this? • For instance, in email categorization, one may apply a set of if-then-else rules and determine the class of input mail • With well written rules, in general, one can get high precision but often low recall • Drafting a set of comprehensive rules is difficult and expensive as they need expert knowledge • Example: Suppose I receive a mail from flipkart, should that be classified as a promotion or primary? • It depends! Copyright 2016 JNResearch, All Rights Reserved
  • 12. Supervised Machine Learning • Input • A document d, consisting of word tokens 𝑤1, 𝑤1, … , 𝑤𝑗 • A finite set of classes: 𝐶 = {𝑐1, 𝑐2, … , 𝑐 𝑛} • The training dataset: 𝐷𝑡𝑟𝑎𝑖𝑛 = { 𝑑1, 𝑐1 , 𝑑2, 𝑐2 , … , (𝑑 𝑚, 𝑐 𝑚)} • Output • A model M such that: 𝑀: 𝑑 → 𝑐 Copyright 2016 JNResearch, All Rights Reserved
  • 13. Bag of words representation • A document can be considered to be an ordered sequence of words • Naïve Bayes classifier ignores the word order and correlations and hence we may represent a document d as bag of words or unigrams • In a typical English sentence, there may be many words that are required for grammatical purposes and may not contribute to the classification decision. • We can do some pre-processing and remove such words before sending the document to the classifier Copyright 2016 JNResearch, All Rights Reserved 𝑀 "I love my Samsung Galaxy Grand 2" = 𝑐 𝑀 "love Samsung Galaxy Grand" = 𝑐
  • 14. Text Classification with Multinomial Naïve Bayes • Recall: 𝑌 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦 𝑃(𝑌) 𝑖=1 𝑛 𝑃(𝑋𝑖|𝑌) where the model parameters are: 𝑃(𝑌) and 𝑃(𝑋𝑖|𝑌) for all values of Xi • For the document classification problem with bag of words model, Xi is a word in the document and Y is the document class • Estimate the model parameters as below and save the model as a table T: • For each class 𝑐 ∈ 𝐶, determine • 𝑀𝐿𝐸: 𝑃 𝐶 = 𝑐 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑙𝑎𝑏𝑒𝑙𝑙𝑒𝑑 𝑐 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 • For each word 𝑤𝑖 ∈ 𝑉 𝑎𝑛𝑑 𝑎 𝑐𝑙𝑎𝑠𝑠 𝑐𝑗 ∈ 𝐶, 𝑀𝐿𝐸 𝑃 𝑤𝑖 𝑐𝑗 = 𝑐𝑜𝑢𝑛𝑡(𝑤 𝑖,𝑐 𝑗) 𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡(𝑤,𝑐 𝑗) • Prediction: Given a new input, generate the word tokens using the same procedure used for training. Retrieve the model values from the Table T and compute 𝑌 Copyright 2016 JNResearch, All Rights Reserved
  • 15. Are we done? Not yet  • What happens if 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖, 𝑐𝑗) = 0 ? • This can happen if the new unseen input has a word 𝑤𝑖, that was not encountered in the training data. • Example: The training data contains “I love my Samsung Galaxy Grand 2” but doesn’t have the word, say, “adore”. If the unseen input is: “I adore my Samsung Galaxy Grand 2”, the entire probability computation will be zero! Copyright 2016 JNResearch, All Rights Reserved
  • 16. Laplace Smoothing (Add 1) • Assume any word in the vocabulary has occurred at least once • This assumption results in the estimation: 𝑃 𝑤𝑖 𝑐𝑗 = 𝑐𝑜𝑢𝑛𝑡 𝑤𝑖, 𝑐𝑗 + 1 ( 𝑤∈𝑉 𝑐𝑜𝑢𝑛𝑡 𝑤, 𝑐𝑗 ) + |𝑉| • The above ensures that the probabilities don’t go to zero • What happens when you encounter a word that is not in the vocabulary? Copyright 2016 JNResearch, All Rights Reserved
  • 17. Variants of Naïve Bayes • In the previous slides we showed the MLE probability computation to be based on counts of words in each document. This is called the multinomial model. • Multinomial is a natural fit when we solve topic classification kind of problems • E.g. Consider the problem of classifying a given article in to Scientific, Business, Sports. • Sometimes, just the presence or absence of a given word in a document is adequate in order to classify. We may choose a Binarized Naïve Bayes in such cases. • E.g Consider the problem of sentiment analysis. If there is a word “fantastic”, it doesn’t need to be repeated in the same document in order for us to conclude the polarity of the sentiment. • A number of applications may involve features that are real valued. We can use a Gaussian (or some other) variant for these. Copyright 2016 JNResearch, All Rights Reserved
  • 18. Multinomial Naïve Bayes • In a multinomial classification model, the frequency of occurrence of each word in the document is taken in to account (instead of presence/absence) • Compute prior for classes using maximum likelihood estimates • The algorithm to compute 𝑃 𝑤𝑖 𝑐𝑗 is: • Concatenate all documents that have the class 𝑐𝑗, let it be textj • Let n be the number of tokens in textj and 𝛼 be the constant used for smoothing • For each word 𝑤 𝑘 in the vocabulary, let nk be the number of occurrences of 𝑤 𝑘 in the textj 𝑃 𝑤𝑖 𝑐𝑗 = 𝑛 𝑘 + 𝛼 𝑛 + 𝛼|𝑉| Copyright 2016 JNResearch, All Rights Reserved
  • 19. Binarized Multinomial Naïve Bayes • In a binarized multinomial classification model, we count only the presence or absence of a given word (or feature) in the given document as opposed to using the frequency. That is, we clamp the word count of a word w in a document j to 1 • Compute prior for classes using maximum likelihood estimates • The algorithm to compute 𝑃 𝑤𝑖 𝑐𝑗 is: • In each document d, keep only one instance of the given word w (Remove duplicates) • Concatenate all documents that have the class 𝑐𝑗, let it be textj • Let n be the number of tokens in textj and 𝛼 be the constant used for smoothing • For each word 𝑤 𝑘 in the vocabulary, let nk be the number of occurrences of 𝑤 𝑘 in the textj 𝑃 𝑤𝑖 𝑐𝑗 = 𝑛 𝑘 + 𝛼 𝑛 + 𝛼|𝑉| Copyright 2016 JNResearch, All Rights Reserved
  • 20. Gaussian Naïve Bayes • So far, we have looked at text and dealt with word occurrence counts that are discrete values • What happens when the features are continuous valued or even vectors of continuous values? E.g images with RGB values? • Gaussian Naïve Bayes is useful to classify such inputs 𝑃 𝑋 𝑛 𝑌 = 1 2𝜋𝜎 𝑦 2 exp(− 𝑥𝑖 − 𝜇 𝑦 2 2𝜎 𝑦 2 ) • Estimate the parameters 𝜎 𝑦 𝑎𝑛𝑑 𝜇 𝑦 using maximum likelihood Copyright 2016 JNResearch, All Rights Reserved
  • 21. Estimating Parameters (Ref: T Mitchelle) Copyright 2016 JNResearch, All Rights Reserved How many parameters must we estimate for Gaussian Naïve Bayes if Y has k possible values: 𝑋 = 𝑋1 𝑋2 … 𝑋 𝑛
  • 22. Gaussian Naïve Bayes: Example • Suppose we are required to predict the price range (high_end, mid_range, low_end) of a mobile phone given its specifications. • We observe that some elements in the specification (e.g screen size) are continuous variables. • We can either discretize these elements and use discrete NB classifier or we can directly use a Gaussian NB Copyright 2016 JNResearch, All Rights Reserved
  • 23. Practical Considerations • Probability computations in a joint distribution involve multiplying many terms that are small fractions. This might sometime cause underflow errors. Use log probabilities to avoid this issue • Use the distribution that is natural to the problem on hand • The choice of the distribution to use is your decision. There is no rule that says you should use Gaussian all the time! • You can discretize continuous variables so that you can use Binarized, Bernoulli or Multinomial discrete Naïve Bayes. But you might lose fidelity due to discretization. • Exercise judgement while choosing the features, you can minimize the data required by removing redundant features Copyright 2016 JNResearch, All Rights Reserved
  • 24. Case Study: Accurate Searching on Twitter • Twitter’s search API allows keywords based search • Suppose we search Twitter with the keyword “congress”, we might end up getting tweets that pertain to Indian National Congress, Mobile World Congress, American Congress, Science Congress and so on. • A narrow search using “exact” search phrase would improve precision but will miss many relevant and interesting tweets • Is there a way to search the Twitter such that we get precise matches without missing interesting and relevant tweets? Copyright 2016 JNResearch, All Rights Reserved
  • 25. Summary • Despite the naïve assumptions, Naïve Bayes classifier is pretty useful. Do not skip it in favour of complex models without evaluating it for your application. You may be in for surprise! • There are many variants of Naïve Bayes Classifier, the common thing about them is that all are based on Bayes theorem and make same assumptions. • Choose binarized model if number of occurrences of a given word do not contribute to the classification process. • If the features are continuous variables, use Gaussian Naïve Bayes or perform discretization if you get good accuracy Copyright 2016 JNResearch, All Rights Reserved
  • 26. Code Walkthrough Copyright 2016 JNResearch, All Rights Reserved