0% found this document useful (0 votes)
75 views26 pages

Pruthviraj Micor Foml

This micro project focuses on developing a spam classifier using machine learning techniques to accurately distinguish between spam and legitimate emails. The project utilizes the SMS Spam Collection Dataset and employs algorithms such as Naive Bayes, Support Vector Machines, and K-Nearest Neighbors, with Naive Bayes achieving the highest accuracy. The methodology includes data preprocessing, model selection, and evaluation metrics to ensure effective spam detection.

Uploaded by

iknowexplain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views26 pages

Pruthviraj Micor Foml

This micro project focuses on developing a spam classifier using machine learning techniques to accurately distinguish between spam and legitimate emails. The project utilizes the SMS Spam Collection Dataset and employs algorithms such as Naive Bayes, Support Vector Machines, and K-Nearest Neighbors, with Naive Bayes achieving the highest accuracy. The methodology includes data preprocessing, model selection, and evaluation metrics to ensure effective spam detection.

Uploaded by

iknowexplain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Micro Project

On
Spam classifier

By
SISODIYA PRUTHIVRAJ
Enrollment No: 236040316087

A Micro Project in FUNDAMENTALS OF MACHINE LEARNING


(4332604)
Submitted to

Information Technology Department


B & B Institute of Technology, Vallabh Vidyanagar
Certificate

This is to certify that SISODIYA PRUTHVIRAJ have/has


successfully completed the Micro project on for the subject
FUNDAMENTALS OF MACHINE LEARNING (4332604)under
my guidance and supervision.

Date:25/4/25
Place:B&B INSTITUE OF TECHNLOGY , VV NAGAR
,ANAND

Signature of Subject Coordinator:


HITESH PATEL
Acknowledgement
spam has become a significant problem in today's digital
age, posing challenges for individuals, businesses, and
organizations alike. Spam emails are unsolicited messages
that flood inboxes, wasting valuable time and resources
while potentially exposing users tomalicious content or
scams. To combat this issue, machine learning techniques
have emerged as powerful tools for email spam detection.
The objective of email spam detection is to accurately
classify incoming emails as either legitimate (ham) or spam.
Traditional rule-based approaches have limited
effectiveness due to the constantly evolving nature of spam.
Machine learning offers a more dynamic and adaptable
approach by leveraging patterns and features extracted
from large email datasets. Machine learning algorithms can
learn from labeled email datasets to build models capable of
recognizing patterns indicative of spam. These models can
then be used to automatically classify new, unseen emails.
By analyzing various email attributes such as sender
information subject line, content, and embedded URLs,
machine learning algorithms can identify spam
characteristics and make accurate predictions. There are
several machine learning techniques commonly employed
for email spam detection.
These include Naive Bayes, Support Vector Machines
(SVM), Decision Trees, Random Forests, and Neural
Networks. These algorithms can be trained on labeled
datasets, allowing them to learn the underlying patterns and
relationships between spam and nonspam emails. The
success of email spam detection using machine learning
heavily relies on the quality and diversity of the training
data. A comprehensive dataset that covers a wide range of
spam types and legitimate emails is essential for training
robust models. Additionally, feature engineering plays a
crucial role in identifying relevant attributes and extract
information
October 2024
Table of Contents

1 Introduction
.
1.1 What is Spam classifier
1.2 What The Use Of It Spam classifier
1.2.1 What The Need Of Spam Classifier
1.2.2 Objective
1.3 Dataset Description
2 Methodology
.
2.1 Data Preprocessing
2.2 Code Implementation
2.3 Model Selection
2.4 Python libraries

Conclusion
Refrences
1. Introduction

1.1 What is Spam classifier

spam classification is a critical task in today's digital world, where the amount
ofspam emails has increased dramatically. In this project, we propose to use
machine learning (ML) and natural language processing (NLP) techniques to
classify email messages as either spam or legitimate. The project aims to develop
an efficient spam classifier that can accurately identify and filter spam emails
from legitimate ones. The dataset used in this project will consistof a large
number of email messages with their corresponding labels (spam/ham). We will
useNLP techniques such as tokenization, stop word removal, stemming, and
feature extraction to preprocess the text data and extract relevant features.We
will evaluate several ML algorithms such as Naive Bayes, Support Vector
Machines (SVMs), and Random Forests to determine thebest model for spam
classification. We will also perform hyper parameter tuning to optimize the
model's performance. The accuracy of the classifier will be measured using
evaluation metrics such as precision, recall, and F1-score. The project's
outcomes will include a spam classifier model that can be integrated into an
email system to automatically filter spam emails, improving email security and
productivity. Additionally, the project will contribute to the advancement of
NLP and ML techniques for email spam classification.

1.2 What The Use Of It Spam classifier


The problem addressed in this project is the increasing amount of spam emails that are
invading user inboxes without their consent, consuming valuable network capacity, and
causing financial damage to companies. Despite measures taken to eliminate spam, it
remains a viable source of income for spammers, and over-sensitive filtering can even
eliminatelegitimate emails. The goal is to develop an effective spam filter using machine
learning and natural language processing techniques to accurately classify incoming
emails as either spam or non-spam. The existing system for email spam classification
typically relies on rule-based filtering techniques, such as blacklisting known spam
email addresses or domains, and whitelisting trusted senders. These techniques are not
always effective, as spammers can easilychange their email addresses or use techniques
such as phishing to impersonate trusted senders.Moreover, traditional rule-based
filtering methods require frequent updates and maintenance, which can be time-
consuming and resource-intensive. They may also mistakenly flag legitimate emails as
spam, leading to a loss of important messages or even business opportunities. To address
these limitations, machine learning and natural language processing techniques can be
used to develop more accurate and automated email spam classifiers. Theseapproaches
can learn to recognize spam based on patterns and characteristics in the text, ratherthan
relying on pre-defined rules. We proposed in the Machine Learning Models such as
Naïve Bayes, SVM, KNN Models are will having the highest accuracy when compared to
the existing system. The proposed system will provide an efficient and accurate way

1.2.1 What The Need Of Spam Classifier

Machine learning and natural language processing (NLP) techniques can be


effectively used for email spam classification. By leveraging the power of
supervised learning algorithms such as Naive Bayes, Support Vector Machines, and
KNN, and bypreprocessing the text data using techniques such as tokenization,
stop-word removal, and stemming, it is possible to build accurate and reliable
spam filters that can automatically detectand filter out unwanted emails. These
techniques can also be extended to handle more complexspamming strategies
such as phishing attacks and spear phishing. Overall, in the proposed models Naïve
Bayes having the accuracy of 99% SVM having 98% and KNN having 97%. Finally
naïve bayes having the highestaccuracy so we predict the Naïve bayes model. The
useof ML and NLP for email spam classification can save users valuable time and
resources and improve the overall productivity andsecurity of email
communication.

1.2.2 Objective

To build a spam classifier that can accurately distinguish between spam and ham
(non-spam) SMS messages using the Naïve Bayes classification algorithm.

 Converted all text to lowercase.

 Removed punctuation marks.

 Mapped labels to binary values: ham = 0, spam = 1.

traditional rule-based filtering methods require frequent updates and maintenance,


which can be time- consuming and resource-intensive. They may also mistakenly flag
legitimate emails as spam, leading to a loss of important messages or even business
opportunities. To address these limitations, machine learning and natural language
processing techniques can be used to develop more accurate and automated email
spam classifiers. Theseapproaches can learn to recognize spam based on patterns
and characteristics in the text, ratherthan relying on pre-defined rules.

1.3 Dataset Description

We use the SMS Spam Collection Dataset, a


widely-used corpus that includes 5,572 labeled
SMS messages in English. Each entry in the dataset
is labeled as either:
 spam: Unwanted commercial or promotional
messages.
 ham: Legitimate messages.
Dataset source:
https://raw.githubusercontent.com/
justmarkham/pycon-2016-
tutorial/master/data/sms.tsv
2. Methodology

2.1 Data Preprocessing

 Converted all text to lowercase.

 Removed punctuation marks.

 Mapped labels to binary values: ham = 0, spam = 1.

Improved data quality:

Preprocessing helps address issues like missing values, outliers, and


inconsistencies, leading to more accurate and reliable insights.

Enhanced model performance:

By cleaning and preparing the data, machine learning models can learn more
effectively and make better predictions.

Simplified analysis:

Preprocessed data is easier to work with and analyze, making it more efficient to
extract meaningful information.

Consistency and uniformity:

Data preprocessing ensures that all data sets are formatted consistently, which is
essential for comparing and combining data from different sources.
2.2 Code Implementation

Code Implementation:-
python
CopyEdit
import pandas as pd
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 1. Load the dataset


url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-
tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# 2. Preprocessing function
def clean_text(text):
text = text.lower()
text = ''.join([ch for ch in text if ch not in string.punctuation])
return text

df['message'] = df['message'].apply(clean_text)

# 3. Convert labels to binary


df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# 4. Vectorize the messages


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label_num']

# 5. Split into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=42)

# 6. Train the Naive Bayes model


model = MultinomialNB()
model.fit(X_train, y_train)

# 7. Evaluate the model


y_pred = model.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\


nClassification Report:\n", classification_report(y_test, y_pred)) print("Accuracy
Score:", accuracy_score(y_test, y_pred))
# 8. Test on new input
def predict_spam(text):
cleaned = clean_text(text)
vect = vectorizer.transform([cleaned])
prediction = model.predict(vect)[0]
return "Spam" if prediction == 1 else "Ham"

# Example:
print(predict_spam("Congratulations! You've won a free ticket to Bahamas. Text
WIN to 12345!"))
print(predict_spam("Hey, are we still meeting for lunch?"))
Example Output :

Confusion Matrix:

[[1203 10]

[ 16 184]]

Classification Report:

precision recall f1-score support

0 0.99 0.99 0.99 1213

1 0.95 0.92 0.94 200

Accuracy Score: 0.982


2.3 Model Selection

Choosing the right machine learning algorithm is crucial to the performance of any
classification system. For this spam detection task, we evaluated various classification
algorithms based on their suitability for text data and classification performance.
Below is the rationale for selecting Naïve Bayes:

Why Naïve Bayes?

The Multinomial Naïve Bayes classifier was selected for this project due to the
following reasons:

 ◻ Text Suitability: Naïve Bayes, particularly the Multinomial variant, is


widely used for text classification problems such as spam detection,
sentiment analysis, and document categorization.
 ◻ Efficiency: It is computationally efficient and can be trained quickly
even on large datasets.
 ◻ Interpretability: Naïve Bayes provides clear probabilistic outputs,
making it easier to interpret the confidence of predictions.
 ◻ Performance: Despite its simplicity, it performs surprisingly well in many
real-world scenarios, especially when the assumption of feature independence
approximately holds (as is often the case with word presence/absence in texts).

. Considered Alternatives

While Naïve Bayes was ultimately chosen, other models were considered:

Model Pros Cons


Strong baseline for binary Slower on large feature sets; less
Logistic Regression
classification interpretable
Support Vector High accuracy, good Requires more tuning; can be
Machine (SVM) for sparse data computationally heavy
Handles non-
Random Forest linearities, good Slower training; less interpretable
accuracy
K-Nearest Neighbors Simple and intuitive Inefficient for large datasets
However, in preliminary tests, Multinomial Naïve Bayes achieved:

 High accuracy
 Faster training time
 Lower memory usage

Given these advantages and the nature of the problem (text-based binary
classification), Naïve Bayes was chosen as the most appropriate model.

Under the Supervised Learning approach, one of the most common Machine Learning
algorithms is logistic regression. It's a method for predicting a categorical dependent
variable from a set of independent variables. A categorical dependent variable's
output is predicted using logistic regression. As a result, the result must be a discrete
or categorical value. It can be Yes or No, 0 or 1, true or false, and so on, but instead
of giving exact values like 0 and 1, it delivers probabilistic values that are somewhere
between 0 and 1. Except for how they are employed, Logistic Regression is very
similar to Linear Regression. Regression problems are solved using Linear
Regression, while classification problems are solved using Logistic Regression.
Instead of fitting a regression line, we fit a "S" shaped logistic function in logistic
regression, which predicts two maximum values

it is usually insightful to take a look at examples from the dataset. The sample email contains a URL, an
email address (at the end), numbers, and dollar amounts. While many emails would contain similar
types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the
specific URL or specific dollar amount) will be different in almost every email. Therefore, one method
often employed in processing emails is to “normalize” these values, so that all URLs are treated the
same, all numbers are treated the same, etc. For example, we could replace each URL in the email with
the unique string “httpaddr” to indicate that a URL was present.
This has the effect of letting the spam classifier make a classification decision based on whether any
URL was present, rather than whether a specific URL was present. This typically improves the
performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of
seeing any particular URL again in a new piece of spam is very small.
In processEmail, the following email preprocessing and normalization steps have been implemented:

 Lower-casing: The entire email is converted into lower case, so that captialization is ignored
(e.g., IndIcaTE is treated the same as Indicate).
 Stripping HTML: All HTML tags are removed from the emails. Many emails often come with
HTML formatting; we remove all the HTML tags, so that only the content remains.
 Normalizing URLs: All URLs are replaced with the text “httpaddr”.
 Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.
 Normalizing Numbers: All numbers are replaced with the text “number”.
 Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.
 Word Stemming: Words are reduced to their stemmed form. For example, “discount”,
“discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the
Stemmer actually strips off additional characters from the end, so “include”, “includes”,
“included”, and “including” are all replaced with “includ”.
 Removal of non-words: Non-words and punctuation have been removed. All white spaces
(tabs, newlines, spaces) have all been trimmed to a single space character.

The result of these preprocessing steps looks like the following paragraph:
anyon know how much it cost to host a web portal well it depend on how mani
visitor your expect thi can be anywher from less than number buck a month to
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if
your run someth big to unsubscrib yourself from thi mail list send an email
to emailaddr

While preprocessing has left word fragments and non-words, this form turns out to be much easier to
work with for performing feature extraction
After preprocessing the emails, there is a list of words for each email. The next step is to choose which
words will be used in the classifier and which will be left out.
For simplicity reasons, only the most frequently occuring words as the set of words considered (the
vocabulary list) have been chosen. Since words that occur rarely in the training set are only in a few
emails, they might cause the model to overfit the training set. The complete vocabulary list is in the
file vocab.txt. The vocabulary list was selected by choosing all words which occur at least a 100
times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about
10,000 to 50,000 words is often used.
Given the vocabulary list, each word can be now mapped in the preprocessed emails into a list of word
indices that contains the index of the word in the vocabulary list. For example, in the sample email, the
word “anyone” was first normalized to “anyon” and then mapped onto the index 86 in the vocabulary
list.
The code in processEmail performs this mapping. In the code, a given string str which is a single
word from the processed email is searched in the vocabulary list vocabList. If the word exists, the
index of the word is added into the word_indices variable. If the word does not exist, and is therefore
not in the vocabulary, the word can be skipped.

# Read the txt file.


with open('emailSample1.txt', 'r') as email:
file_contents = email.read()

file_contents
"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're
expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should
checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo
unsubscribe yourself from this mailing list, send an email to:\[email protected]\n\n"
import re
from string import punctuation
from nltk.stem.snowball import SnowballStemmer

# Create a function to read the fixed vocab list.


def getVocabList():
"""
Reads the fixed vocabulary list in vocab.txt
and returns a dictionary of the words in vocabList.
"""
# Read the fixed vocabulary list.
with open('vocab.txt', 'r') as vocab:

# Store all dictionary words in dictionary vocabList.


vocabList = {}
for line in vocab.readlines():
i, word = line.split()
vocabList[word] = int(i)

return vocabList

# Create a function to process the email contents.


def processEmail(email_contents):
"""
Preprocesses the body of an email and returns a
list of indices of the words contained in the email.
Args:
email_contents: str
Returns:
word_indices: list of ints
"""
# Load Vocabulary.
vocabList = getVocabList()

# Init return value.


word_indices = []

# ============================ Preprocess Email


============================

# Find the Headers ( \n\n and remove ).


# Uncomment the following lines if you are working with raw emails with the
# full headers.

# hdrstart = email_contents.find("\n\n")
# if hdrstart:
# email_contents = email_contents[hdrstart:]
# Convert to lower case.
email_contents = email_contents.lower()

# Strip all HTML.


# Look for any expression that starts with < and ends with > and
# does not have any < or > in the tag and replace it with a space.
email_contents = re.sub('<[^<>]+>', ' ', email_contents)

# Handle Numbers.
# Look for one or more characters between 0-9.
email_contents = re.sub('[0-9]+', 'number', email_contents)

# Handle URLS.
# Look for strings starting with http:// or https://.
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)

# Handle Email Addresses.


# Look for strings with @ in the middle.
email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)

# Handle $ sign.
# Look for "$" and replace it with the text "dollar".
email_contents = re.sub('[$]+', 'dollar', email_contents)

# ============================ Tokenize Email


============================

# Output the email to screen as well.


print('\n==== Processed Email ====\n')

# Process file
l=0

# Get rid of any punctuation.


email_contents = email_contents.translate(str.maketrans('', '', punctuation))
# Split the email text string into individual words.
email_contents = email_contents.split()

for token in email_contents:

# Remove any non alphanumeric characters.


token = re.sub('[^a-zA-Z0-9]', '', token)

# Create the stemmer.


stemmer = SnowballStemmer("english")

# Stem the word.


token = stemmer.stem(token.strip())

# Skip the word if it is too short


if len(token) < 1:
continue

# Look up the word in the dictionary and add to word_indices if found.


if token in vocabList:
idx = vocabList[token]
word_indices.append(idx)

#
=================================================================
===

# Print to screen, ensuring that the output lines are not too long.
if l + len(token) + 1 > 78:
print()
l=0
print(token, end=' ')
l = l + len(token) + 1

# Print footer.
print('\n\n=========================\n')

return word_indices
# Extract features.
word_indices = processEmail(file_contents)

# Print stats.
print('Word Indices: \n')
print(word_indices)
print('\n\n')
==== Processed Email ====

anyon know how much it cost to host a web portal well it depend on how mani
visitor your expect this can be anywher from less than number buck a month to
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if
your run someth big to unsubscrib yourself from this mail list send an email
to emailaddr

=========================

Word Indices:

[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1895, 592, 238, 162, 89,
688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 1893, 1510, 799, 1182, 1237, 810, 1895, 1440, 1547, 181,
1699, 1758, 1896, 688, 992, 961, 1477, 71, 530, 1699, 531]

Naïve Bayes algorithm The Nave Bayes method is a supervised learning technique for addressing
classification issues that is based on the Bayes theorem. It is mostly utilized in text classification
tasks that require a large training dataset. The Nave Bayes Classifier is a simple and effective
classification method that aids in the development of fast machine learning models capable of
making quick predictions. It is a probabilistic classifier, which means it makes predictions based on
an object's probability. Spam 18 filtration, sentiment analysis, and article classification are all
frequent uses of the Nave Bayes Algorithm
Types of spam email The following list of five different spam email categories that can occur in a
dataset. Common types of spam mail are:

• Commercial advertisements

• Antivirus warnings

• Email spoofing

• Sweepstakes winners

• Money scams

The system in this project focuses on detecting the spam mail by reviewing it in two
stages: feature extraction and classification. In the first stage, the basic concepts and
principles of spammail are highlighted in social media. The data selected for the training
dataset all the feature extraction and classification is performed by using two methods, TF-
IDF vectorizer and countvectorizer. Feature extraction is performed for the text with email
content. During the discoverystage, the current methods are reviewed for detection of spam
mail using different supervised learning algorithms including high range methods like
naïve Bayes and decision tree classification methods. The outcome of all the algorithms in
this project gives the best score form Extra Tree Classifier algorithm with 98% of the
accuracy and 99% level of precision from the dataset. Spam mail detection is always
improving, but hackers are still able to break security, thus it is falling behind. Networking
isthe source of many computersecurity threats, but it also amplifiesothers. Secure
computing is dependent on secure networks, and vice versa. It's no coincidencethat as
network security grows more fragile, people are becoming increasingly concerned
aboutsecurity and privacy. The use of these technologies in real time with numerous future
scenariosto analyze the depth and classify spam based on itslevel ofsusceptibility. The
current proposedsolution is for English language mails, however we can expand the scope
to include more languages in the future.

To introduce the methods for evaluating text classification, let’s first consider some simple
binary detection tasks. For example, in spam detection, our goal is to label every text as
being in the spam category (“positive”) or not in the spam category (“negative”). For each
item (email document) we therefore need to know whether our system called it spam or
not. We also need to know whether the email is actually spam or not, i.e. the human-
defined labels for each document that we are trying to gold labels match. We will refer to
these human labels as the gold labels. Or imagine you’re the CEO of the Delicious Pie
Company and you need to know what people are saying about your pies on social media,
so you build a system that detects tweets concerning Delicious Pie. Here the positive class
is tweets about Delicious Pie and the negative class is all other tweets. In both cases, we
need a metric for knowing how well our spam detector (or pie-tweet-detector) is doing. To
evaluate any system for detecting things, we start by building a confusion matrix like the
one shown in Fig. 4.4. A confusion matrix confusion matrix is a table for visualizing how
an algorithm performs with respect to the human gold labels, using two dimensions
(system output and gold labels), and each cell labeling a set of possible outcomes. In the
spam detection case, for example, true positives are documents that are indeed spam
(indicated by human-created gold labels) that our system correctly said were spam. False
negatives are documents that are indeed spam but our system incorrectly labeled as non-
spam. To the bottom right of the table is the equation for accuracy, which asks what
percentage of all the observations (for the spam or pie examples that means all emails or
tweets) our system labeled correctly. Although accuracy might seem a natural metric, we
generally don’t use it for text classification tasks. That’s because accuracy doesn’t work
well when the classes are unbalanced (as indeed they are with spam, which is a large
majority of email, or with tweets, which are mainly not about pie). To make this more
explicit, imagine that we looked at a million tweets, and let’s say that only 100 of them are
discussing their love (or hatred)

But we’ll need to slightly modify our definitions of precision and recall. Consider the
sample confusion matrix for a hypothetical 3-way one-of email categorization decision
(urgent, normal, spam) shown in Fig. 4.5. The matrix shows, for example, that the system
mistakenly labeled one spam document as urgent, and we have shown how to compute a
distinct precision and recall value for each class. In order to derive a single metric that tells
us how well the system is doing, we can commacroaveraging bine these values in two
ways. In macroaveraging, we compute the performance microaveraging for each class, and
then average over classes. In microaveraging, we collect the decisions for all classes into a
single confusion matrix, and then compute precision and recall from that table. Fig. 4.6
shows the confusion matrix for each class separately, and shows the computation of
microaveraged and macroaveraged precision. As the figure shows, a microaverage is
dominated by the more frequent class (in this case spam), since the counts are pooled. The
macroaverage better reflects the statistics of the smaller classes, and so is more appropriate
when performance on all the classes is equally important.

To introduce the methods for evaluating text classification, let’s first consider some simple
binary detection tasks. For example, in spam detection, our goal is to label every text as
being in the spam category (“positive”) or not in the spam category (“negative”). For each
item (email document) we therefore need to know whether our system called it spam or
not. We also need to know whether the email is actually spam or not, i.e. the human-
defined labels for each document that we are trying to gold labels match. We will refer to
these human labels as the gold labels. Or imagine you’re the CEO of the Delicious Pie
Company and you need to know what people are saying about your pies on social media,
so you build a system that detects tweets concerning Delicious Pie. Here the positive class
is tweets about Delicious Pie and the negative class is all other tweets. In both cases, we
need a metric for knowing how well our spam detector (or pie-tweet-detector) is doing. To
evaluate any system for detecting things, we start by building a confusion matrix like the
one shown in Fig. 4.4. A confusion matrix confusion matrix is a table for visualizing how
an algorithm performs with respect to the human gold labels, using two dimensions
(system output and gold labels), and each cell labeling a set of possible outcomes. In the
spam detection case, for example, true positives are documents that are indeed spam
(indicated by human-created gold labels) that our system correctly said were spam. False
negatives are documents that are indeed spam but our system incorrectly labeled as non-
spam. To the bottom right of the table is the equation for accuracy, which asks what
percentage of all the observations (for the spam or pie examples that means all emails or
tweets) our system labeled correctly. Although accuracy might seem a natural metric, we
generally don’t use it for text classification tasks. That’s because accuracy doesn’t work
well when the classes are unbalanced (as indeed they are with spam, which is a large
majority of email, or with tweets, which are mainly not about pie). To make this more
explicit, imagine that we looked at a million tweets, and let’s say that only 100 of them are
discussing their love (or hatred) for our pie, while the other 999,900 are tweets about
something completely unrelated. Imagine a simple classifier that stupidly classified every
tweet as “not about pie”. This classifier would have 999,900 true negatives and only 100
false negatives for an accuracy of 999,900/1,000,000 or 99.99%! What an amazing
accuracy level! Surely we should be happy with this classifier? But of course this fabulous
‘no pie’ classifier would be completely useless, since it wouldn’t find a single one of the
customer comments we are looking for. In other words, accuracy is not a good metric
when the goal is to discover something that is rare, or at least not completely balanced in
frequency, which is a very common situation in the world.

The training and testing procedure for text classification follows what we saw with
language modeling (Section ??): we use the training set to train the model, then use the
development test set (also called a devset) to perhaps tune some parameters, development
test set devset and in general decide what the best model is. Once we come up with what
we think is the best model, we run it on the (hitherto unseen) test set to report its
performance. While the use of a devset avoids overfitting the test set, having a fixed
training set, devset, and test set creates another problem: in order to save lots of data for
training, the test set (or devset) might not be large enough to be representative. Wouldn’t it
be better if we could somehow use all our data for training and still use cross-validation all
our data for test? We can do this by cross-validation. In cross-validation, we choose a
number k, and partition our data into k disjoint folds subsets called folds. Now we choose
one of those k folds as a test set, train our classifier on the remaining k − 1 folds, and then
compute the error rate on the test set. Then we repeat with another fold as the test set, again
training on the other k−1 folds. We do this sampling process k times and average the test
set error rate from these k runs to get an average error rate. If we choose k = 10, we would
train 10 different models (each on 90% of our data), test the model 10 times, and average
these 10 values. This is called 10-fold cross-validation. 10-fold cross-validation The only
problem with cross-validation is that because all the data is used for testing, we need the
whole corpus to be blind; we can’t examine any of the data to suggest possible features and
in general see what’s going on, because we’d be peeking at the test set, and such cheating
would cause us to overestimate the performance of our system. However, looking at the
corpus to understand what’s going on is important in designing NLP systems! What to do?
For this reason, it is common to create a fixed training set and test set, then do 10-fold
cross-validation inside the training set, but compute error rate the normal way in the test
set, as shown in
So in our example, this p-value is the probability that we would see δ(x) assuming A is not
better than B. If δ(x) is huge (let’s say A has a very respectable F1 of .9 and B has a
terrible F1 of only .2 on x), we might be surprised, since that would be extremely unlikely
to occur if H0 were in fact true, and so the p-value would be low (unlikely to have such a
large δ if A is in fact not better than B). But if δ(x) is very small, it might be less surprising
to us even if H0 were true and A is not really better than B, and so the p-value would be
higher. A very small p-value means that the difference we observed is very unlikely under
the null hypothesis, and we can reject the null hypothesis. What counts as very small? It is
common to use values like .05 or .01 as the thresholds. A value of .01 means that if the p-
value (the probability of observing the δ we saw assuming H0 is true) is less than .01, we
reject the null hypothesis and assume that A is indeed better than B. We say that a result
(e.g., “A is better than B”) is statistically significant if statistically significant the δ we saw
has a probability that is below the threshold and we therefore reject this null hypothesis.
How do we compute this probability we need for the p-value? In NLP we generally don’t
use simple parametric tests like t-tests or ANOVAs that you

as well. Now that we have the b test sets, providing a sampling distribution, we can do
statistics on how often A has an accidental advantage. There are various ways to compute
this advantage; here we follow the version laid out in Berg-Kirkpatrick et al. (2012).
Assuming H0 (A isn’t better than B), we would expect that δ(X), estimated over many test
sets, would be zero or negative; a much higher value would be surprising, since H0
specifically assumes A isn’t better than B. To measure exactly how surprising our observed
δ(x) is, we would in other circumstances compute the p-value by counting over many test
sets how often δ(x (i) ) exceeds the expected zero value by δ(x) or more: p-value(x) = 1 b
X b i=1 1 δ(x (i) )−δ(x) ≥ 0 (We use the notation 1(x) to mean “1 if x is true, and 0
otherwise”.) However, although it’s generally true that the expected value of δ(X) over
many test sets, (again assuming A isn’t better than B) is 0, this isn’t true for the
bootstrapped test sets we created. That’s because we didn’t draw these samples from a
distribution with 0 mean; we happened to create them from the original test set x, which
happens to be biased (by .20) in favor of A. So to measure how surprising is our observed
δ(x), we actually compute the p-value by counting over m
It is important to avoid harms that may result from classifiers, harms that exist both for
naive Bayes classifiers and for the other classification algorithms we introduce in later
chapters. One class of harms is representational harms (Crawford 2017, Blodgett et al.
representational harms 2020), harms caused by a system that demeans a social group, for
example by perpetuating negative stereotypes about them. For example Kiritchenko and
Mohammad (2018) examined the performance of 200 sentiment analysis systems on pairs
of sentences that were identical except for containing either a common African American
first name (like Shaniqua) or a common European American first name (like Stephanie),
chosen from the Caliskan et al. (2017) study discussed in Chapter 6. They found that most
systems assigned lower sentiment and more negative emotion to sentences with African
American names, reflecting and perpetuating stereotypes that associate African Americans
with negative emotions (Popp et al., 2003). In other tasks classifiers may lead to both
representational harms and other harms, such as silencing. For example the important text
classification task of tox- 4.11 • SUMMARY 19 icity detection is the task of detecting hate
speech, abuse, harassment, or other toxicity detection kinds of toxic language. While the
goal of such classifiers is to help reduce societal harm, toxicity classifiers can themselves
cause harms. For example, researchers have shown that some widely used toxicity
classifiers incorrectly flag as being toxic sentences that are non-toxic but simply mention
identities like women (Park et al., 2018), blind people (Hutchinson et al., 2020) or gay
people (Dixon et al., 2018; Dias Oliva et al., 2021), or simply use linguistic features
characteristic of varieties like African-American Vernacular English (Sap et al. 2019,
Davidson et al. 2019). Such false positive errors could lead to the silencing of discourse by
or about these groups
Conclsuion : -
In conclusion, machine learning and natural language
processing (NLP) techniques can be effectively used for email
spam classification. By leveraging the power of supervised
learning algorithms such as Naive Bayes, Support Vector
Machines, and KNN, and bypreprocessing the text data using
techniques such as tokenization, stop-word removal, and
stemming, it is possible to build accurate and reliable spam
filters that can automatically detectand filter out unwanted
emails. These techniques can also be extended to handle more
complexspamming strategies such as phishing attacks and
spear phishing. Overall, in the proposed models Naïve Bayes
having the accuracy of 99% SVM having 98% and KNN having
97%. Finally naïve bayes having the highestaccuracy so we
predict the Naïve bayes model. The useof ML and NLP for email
spam classification can save users valuable time and resources
and improve the overall productivity andsecurity of email
communication.
References
[1] https://abcd.com

[2] https://data-flair.training/blogs/python-anaconda-tutorial/

[3] https://www.tutorialspoint.com/machine_learning_with_python/index.htm

You might also like