0% found this document useful (0 votes)

95 views5 pages

MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory

This document demonstrates how to perform text classification using a Naive Bayes classifier. It uses the 20 newsgroups dataset to train a Multinomial Naive Bayes model with TF-IDF preprocessing on 4 categories of documents. The model is evaluated on a test set, showing it can accurately classify documents into the categories of comp.graphics and sci.space while sometimes confusing soc.religion.christian and talk.religion.misc.

Uploaded by

Meer Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views5 pages

MLP Week 6 NaiveBayesImplementation - Ipynb - Colaboratory

Uploaded by

Meer Hassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Text classification with Naive Bayes classifier

In this colab, we will use Naive Bayes classifier for classifying text.

Naive Bayes classifier is used for text classification and spam detection tasks.

Here is an example as how to perform the text classification with Naive Bayes classifier.

# Data loading
from sklearn.datasets import fetch_20newsgroups

# Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

# Model/estimator
from sklearn.naive_bayes import MultinomialNB

# Pipeline utility
from sklearn.pipeline import make_pipeline

# Model evaluation
from sklearn.metrics import ConfusionMatrixDisplay

# Plotting library
import matplotlib.pyplot as plt

Exercise: Read about TfidfVectorizer API.

Dataset
We will be using 20 newsgroup data set for classification.

As a first step, let's download 20 newsgroup dataset with fetch_20newsgroups API.

data = fetch_20newsgroups()

Let's look at the names of the classes.

data.filenames.shape

data.target.shape

(11314,)
---------------------------------------------------------------------------

NameError Traceback (most recent call last)

<ipython-input-10-a43f67349c20> in <module>

1 from sklearn.feature_extraction.text import TfidfVectorizer

2 vectorizer = TfidfVectorizer()

----> 3 vectors = vectorizer.fit_transform(filenames.data)

4 vectors.shape

NameError: name 'filenames' is not defined

SEARCH STACK OVERFLOW

data.target_names

['alt.atheism',

'comp.graphics',

'comp.os.ms-windows.misc',

'comp.sys.ibm.pc.hardware',

'comp.sys.mac.hardware',

'comp.windows.x',

'misc.forsale',

'rec.autos',

'rec.motorcycles',

'rec.sport.baseball',

'rec.sport.hockey',

'sci.crypt',

'sci.electronics',

'sci.med',

'sci.space',

'soc.religion.christian',

'talk.politics.guns',

'talk.politics.mideast',

'talk.politics.misc',

'talk.religion.misc']

There are 20 categories in the dataset. For simplicity, we will select 4 of these categories and
download training and test sets.

categories = ['talk.religion.misc', 'soc.religion.christian',

'sci.space', 'comp.graphics']

train = fetch_20newsgroups(subset='train', categories=categories)

test = fetch_20newsgroups(subset='test', categories=categories)

Let's look at a sample training document:

print(train.data[6])

From: [email protected] (Don McGee)

Subject: Federal Hearing

Originator: dmcgee@uluhe

Organization: School of Ocean and Earth Science and Technology

Distribution: usa

Lines: 10

Fact or rumor....? Madalyn Murray O'Hare an atheist who eliminated the

use of the bible reading and prayer in public schools 15 years ago is now

going to appear before the FCC with a petition to stop the reading of the

Gospel on the airways of America. And she is also campaigning to remove

Christmas programs, songs, etc from the public schools. If it is true

then mail to Federal Communications Commission 1919 H Street Washington DC

20054 expressing your opposition to her request. Reference Petition number

2493.

This data is different than what we have seen so far. Here the training data contains document
in text form.

Data preprocessing and modeling

As we have mentioned this in the first week of machine learning techniques course, we need to
convert the text data to numeric form.

TfidfVectorizer is one such API that converts text input into a vector of numerical values.

We will use TfidfVectorizer as a preprocessing step to obtain feature vector corresponding to

the text document.

We will be using multinomial naive Bayes classifier for categorizing documents from
20newsgroup corpus.

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

Let's train the model.

model.fit(train.data, train.target)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),

('multinomialnb', MultinomialNB())])

Model evaluation

Let's first predict the labels for the test set and then calculate the confusion matrix for the test
set.

ConfusionMatrixDisplay.from_estimator(

model, test.data, test.target,

display_labels=test.target_names,

xticks_rotation='vertical')

plt.show()

Observe that:

There is a confusion between documents of class soc.religion.christian and

talk.religion.misc , which is along the expected lines.
The classes comp.graphics and sci.space are well separated by such a simple classifier.

Now we have a tool to classify statements into one of these four classes.

Make use of predict function on pipeline for predicting category of a test string.

def predict_category(s, train=train, model=model):

pred = model.predict([s])

return train.target_names[pred[0]]

Using this function for prediction:

predict_category('sending a payload to the ISS')

'sci.space'

predict_category('discussing islam vs atheism')

'soc.religion.christian'
predict_category('determining the screen resolution')

'comp.graphics'

Colab paid products

-
Cancel contracts here

Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Irs Lab Week-4
No ratings yet
Irs Lab Week-4
2 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Machen e Learning
No ratings yet
Machen e Learning
9 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Naive Bayes Classification Guide
No ratings yet
Naive Bayes Classification Guide
2 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
AI Phase4
No ratings yet
AI Phase4
5 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
Exp 9
No ratings yet
Exp 9
2 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
Span News Detection
No ratings yet
Span News Detection
7 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Practical 3
No ratings yet
Practical 3
11 pages
10253.exp 5
No ratings yet
10253.exp 5
12 pages
Naive Bayes Classification For TEXT Classification
No ratings yet
Naive Bayes Classification For TEXT Classification
2 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
FND Imp Points
No ratings yet
FND Imp Points
6 pages
AIML - Ex.3 Manual
No ratings yet
AIML - Ex.3 Manual
4 pages
Naïve Bayes Classifier Guide
No ratings yet
Naïve Bayes Classifier Guide
8 pages
Lec 09
No ratings yet
Lec 09
50 pages
Part B
No ratings yet
Part B
6 pages
Unstructured Data Classification
100% (2)
Unstructured Data Classification
83 pages
Lab 6
No ratings yet
Lab 6
47 pages
Naive Bayes Classifier in Machine Learning Javatpoint
No ratings yet
Naive Bayes Classifier in Machine Learning Javatpoint
23 pages
Lec 09
No ratings yet
Lec 09
50 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
Naive Bates Classifier
No ratings yet
Naive Bates Classifier
18 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
Report On Email Spam
No ratings yet
Report On Email Spam
7 pages
Naive Bayes Classifier Project
No ratings yet
Naive Bayes Classifier Project
5 pages
Fake News Detection Using NLP
No ratings yet
Fake News Detection Using NLP
11 pages
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
No ratings yet
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
31 pages
MLT Lab 06
No ratings yet
MLT Lab 06
3 pages
Naive Bayes Classification - Jupyter Notebook
No ratings yet
Naive Bayes Classification - Jupyter Notebook
4 pages
Vamshi ml-4
No ratings yet
Vamshi ml-4
3 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
5 Text Multimodal NB
No ratings yet
5 Text Multimodal NB
2 pages
NLP NB
No ratings yet
NLP NB
52 pages
Jeb Am Am Assignment T 1
No ratings yet
Jeb Am Am Assignment T 1
16 pages
Parabot Notes PDF
No ratings yet
Parabot Notes PDF
2 pages
Naive Bayes Classifier Presentation
No ratings yet
Naive Bayes Classifier Presentation
10 pages
Text Classification Techniques
No ratings yet
Text Classification Techniques
17 pages
Purva Rawale - BDA Practical No 2
No ratings yet
Purva Rawale - BDA Practical No 2
9 pages
Practical Exam Aug 2021
No ratings yet
Practical Exam Aug 2021
5 pages
Prog 6
No ratings yet
Prog 6
3 pages
Unstructured
No ratings yet
Unstructured
37 pages
Prac4 AAM
No ratings yet
Prac4 AAM
2 pages
Module4 TextAnalytics
No ratings yet
Module4 TextAnalytics
9 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
No ratings yet
Text Classification Using Decision Forests and Pretrained Embeddings - 1716327972920
12 pages
COL774: Assignment 4 Naive Bayes & Collaborative Filtering: Released On: 2nd October, 2024
No ratings yet
COL774: Assignment 4 Naive Bayes & Collaborative Filtering: Released On: 2nd October, 2024
4 pages
Text Classification
No ratings yet
Text Classification
60 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Tell Me About Yourself
No ratings yet
Tell Me About Yourself
8 pages
POD23S2C21890053
No ratings yet
POD23S2C21890053
2 pages
RAJA
No ratings yet
RAJA
1 page
Week-7 (SWI)
No ratings yet
Week-7 (SWI)
19 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
No ratings yet
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
15 pages
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
No ratings yet
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
19 pages
Teaching Strategies, Techniques, Methods and Approach
100% (1)
Teaching Strategies, Techniques, Methods and Approach
43 pages
For Information: Doosan Heavy Industries & Construction
No ratings yet
For Information: Doosan Heavy Industries & Construction
9 pages
Aptitude & Reasoning (Bilingual)
No ratings yet
Aptitude & Reasoning (Bilingual)
177 pages
2021-22 M.B.a. Pharmaceutical Management Syllabus (IMR)
No ratings yet
2021-22 M.B.a. Pharmaceutical Management Syllabus (IMR)
76 pages
Solutions 3e Advanced TRD Booklet
No ratings yet
Solutions 3e Advanced TRD Booklet
4 pages
Result 2016-17 - Results at IET Lucknow
No ratings yet
Result 2016-17 - Results at IET Lucknow
1 page
UPES Placement Brochure (Oil & Gas - Upstream 2009-10)
No ratings yet
UPES Placement Brochure (Oil & Gas - Upstream 2009-10)
20 pages
IELTS General Training Guide
No ratings yet
IELTS General Training Guide
35 pages
NSTP Bindingggg
No ratings yet
NSTP Bindingggg
11 pages
Structural Analysis and Design Tcm4-118204
No ratings yet
Structural Analysis and Design Tcm4-118204
140 pages
Master of The Rose Cross PDF
100% (4)
Master of The Rose Cross PDF
368 pages
Pythagorean Theorem Maze: Name Directions
No ratings yet
Pythagorean Theorem Maze: Name Directions
2 pages
(Ebook PDF) Self-Leadership The Definitive Guide To Personal Excellence PDF Download
100% (1)
(Ebook PDF) Self-Leadership The Definitive Guide To Personal Excellence PDF Download
150 pages
Blockchain Seminar Report CSE
No ratings yet
Blockchain Seminar Report CSE
8 pages
Grade 11 Math Lesson Plan: Functions
No ratings yet
Grade 11 Math Lesson Plan: Functions
8 pages
Writing Task 2 Qoliplar (@aimforhigher)
100% (1)
Writing Task 2 Qoliplar (@aimforhigher)
6 pages
NOTICE - 2026 Batch - Xeno - GCET - B.tech (CS & IT) - Register by 11 - 00 AM, 25th April 2025.
No ratings yet
NOTICE - 2026 Batch - Xeno - GCET - B.tech (CS & IT) - Register by 11 - 00 AM, 25th April 2025.
2 pages
New BSA Curriculum 2023-2024
No ratings yet
New BSA Curriculum 2023-2024
2 pages
Migrating To Azure Sentinel - Data Sheet
No ratings yet
Migrating To Azure Sentinel - Data Sheet
2 pages
Point of View: The Position of Stance of The Work's Narrator or Speaker
No ratings yet
Point of View: The Position of Stance of The Work's Narrator or Speaker
13 pages
Bnys Jan 2025
No ratings yet
Bnys Jan 2025
7 pages
Neuroscience Letters: Qingguo Ma, Manlin Wang, Yijin He, Yulin Tan, Linanzi Zhang
No ratings yet
Neuroscience Letters: Qingguo Ma, Manlin Wang, Yijin He, Yulin Tan, Linanzi Zhang
7 pages
Physical Science Exam - 4thquarter - 2nd Sem
No ratings yet
Physical Science Exam - 4thquarter - 2nd Sem
4 pages
GMFCS 2-21
No ratings yet
GMFCS 2-21
8 pages
Malala
No ratings yet
Malala
8 pages
Effective Teaching Strategies Guide
No ratings yet
Effective Teaching Strategies Guide
4 pages
Appendix PAE-1 (Outcome-Based PAE)
0% (1)
Appendix PAE-1 (Outcome-Based PAE)
1 page
1 United States Tax Court 3 4 5 6 The Council For Educaton, Docket No.17890-11X 7 Petitioner
No ratings yet
1 United States Tax Court 3 4 5 6 The Council For Educaton, Docket No.17890-11X 7 Petitioner
5 pages
QUESTION BANK NEP CC Sports
No ratings yet
QUESTION BANK NEP CC Sports
4 pages