Pruthviraj Micor Foml
Pruthviraj Micor Foml
On
Spam classifier
By
SISODIYA PRUTHIVRAJ
Enrollment No: 236040316087
Date:25/4/25
Place:B&B INSTITUE OF TECHNLOGY , VV NAGAR
,ANAND
1 Introduction
.
1.1 What is Spam classifier
1.2 What The Use Of It Spam classifier
1.2.1 What The Need Of Spam Classifier
1.2.2 Objective
1.3 Dataset Description
2 Methodology
.
2.1 Data Preprocessing
2.2 Code Implementation
2.3 Model Selection
2.4 Python libraries
Conclusion
Refrences
1. Introduction
spam classification is a critical task in today's digital world, where the amount
ofspam emails has increased dramatically. In this project, we propose to use
machine learning (ML) and natural language processing (NLP) techniques to
classify email messages as either spam or legitimate. The project aims to develop
an efficient spam classifier that can accurately identify and filter spam emails
from legitimate ones. The dataset used in this project will consistof a large
number of email messages with their corresponding labels (spam/ham). We will
useNLP techniques such as tokenization, stop word removal, stemming, and
feature extraction to preprocess the text data and extract relevant features.We
will evaluate several ML algorithms such as Naive Bayes, Support Vector
Machines (SVMs), and Random Forests to determine thebest model for spam
classification. We will also perform hyper parameter tuning to optimize the
model's performance. The accuracy of the classifier will be measured using
evaluation metrics such as precision, recall, and F1-score. The project's
outcomes will include a spam classifier model that can be integrated into an
email system to automatically filter spam emails, improving email security and
productivity. Additionally, the project will contribute to the advancement of
NLP and ML techniques for email spam classification.
1.2.2 Objective
To build a spam classifier that can accurately distinguish between spam and ham
(non-spam) SMS messages using the Naïve Bayes classification algorithm.
By cleaning and preparing the data, machine learning models can learn more
effectively and make better predictions.
Simplified analysis:
Preprocessed data is easier to work with and analyze, making it more efficient to
extract meaningful information.
Data preprocessing ensures that all data sets are formatted consistently, which is
essential for comparing and combining data from different sources.
2.2 Code Implementation
Code Implementation:-
python
CopyEdit
import pandas as pd
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# 2. Preprocessing function
def clean_text(text):
text = text.lower()
text = ''.join([ch for ch in text if ch not in string.punctuation])
return text
df['message'] = df['message'].apply(clean_text)
# Example:
print(predict_spam("Congratulations! You've won a free ticket to Bahamas. Text
WIN to 12345!"))
print(predict_spam("Hey, are we still meeting for lunch?"))
Example Output :
Confusion Matrix:
[[1203 10]
[ 16 184]]
Classification Report:
Choosing the right machine learning algorithm is crucial to the performance of any
classification system. For this spam detection task, we evaluated various classification
algorithms based on their suitability for text data and classification performance.
Below is the rationale for selecting Naïve Bayes:
The Multinomial Naïve Bayes classifier was selected for this project due to the
following reasons:
. Considered Alternatives
While Naïve Bayes was ultimately chosen, other models were considered:
High accuracy
Faster training time
Lower memory usage
Given these advantages and the nature of the problem (text-based binary
classification), Naïve Bayes was chosen as the most appropriate model.
Under the Supervised Learning approach, one of the most common Machine Learning
algorithms is logistic regression. It's a method for predicting a categorical dependent
variable from a set of independent variables. A categorical dependent variable's
output is predicted using logistic regression. As a result, the result must be a discrete
or categorical value. It can be Yes or No, 0 or 1, true or false, and so on, but instead
of giving exact values like 0 and 1, it delivers probabilistic values that are somewhere
between 0 and 1. Except for how they are employed, Logistic Regression is very
similar to Linear Regression. Regression problems are solved using Linear
Regression, while classification problems are solved using Logistic Regression.
Instead of fitting a regression line, we fit a "S" shaped logistic function in logistic
regression, which predicts two maximum values
it is usually insightful to take a look at examples from the dataset. The sample email contains a URL, an
email address (at the end), numbers, and dollar amounts. While many emails would contain similar
types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the
specific URL or specific dollar amount) will be different in almost every email. Therefore, one method
often employed in processing emails is to “normalize” these values, so that all URLs are treated the
same, all numbers are treated the same, etc. For example, we could replace each URL in the email with
the unique string “httpaddr” to indicate that a URL was present.
This has the effect of letting the spam classifier make a classification decision based on whether any
URL was present, rather than whether a specific URL was present. This typically improves the
performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of
seeing any particular URL again in a new piece of spam is very small.
In processEmail, the following email preprocessing and normalization steps have been implemented:
Lower-casing: The entire email is converted into lower case, so that captialization is ignored
(e.g., IndIcaTE is treated the same as Indicate).
Stripping HTML: All HTML tags are removed from the emails. Many emails often come with
HTML formatting; we remove all the HTML tags, so that only the content remains.
Normalizing URLs: All URLs are replaced with the text “httpaddr”.
Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.
Normalizing Numbers: All numbers are replaced with the text “number”.
Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.
Word Stemming: Words are reduced to their stemmed form. For example, “discount”,
“discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the
Stemmer actually strips off additional characters from the end, so “include”, “includes”,
“included”, and “including” are all replaced with “includ”.
Removal of non-words: Non-words and punctuation have been removed. All white spaces
(tabs, newlines, spaces) have all been trimmed to a single space character.
The result of these preprocessing steps looks like the following paragraph:
anyon know how much it cost to host a web portal well it depend on how mani
visitor your expect thi can be anywher from less than number buck a month to
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if
your run someth big to unsubscrib yourself from thi mail list send an email
to emailaddr
While preprocessing has left word fragments and non-words, this form turns out to be much easier to
work with for performing feature extraction
After preprocessing the emails, there is a list of words for each email. The next step is to choose which
words will be used in the classifier and which will be left out.
For simplicity reasons, only the most frequently occuring words as the set of words considered (the
vocabulary list) have been chosen. Since words that occur rarely in the training set are only in a few
emails, they might cause the model to overfit the training set. The complete vocabulary list is in the
file vocab.txt. The vocabulary list was selected by choosing all words which occur at least a 100
times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about
10,000 to 50,000 words is often used.
Given the vocabulary list, each word can be now mapped in the preprocessed emails into a list of word
indices that contains the index of the word in the vocabulary list. For example, in the sample email, the
word “anyone” was first normalized to “anyon” and then mapped onto the index 86 in the vocabulary
list.
The code in processEmail performs this mapping. In the code, a given string str which is a single
word from the processed email is searched in the vocabulary list vocabList. If the word exists, the
index of the word is added into the word_indices variable. If the word does not exist, and is therefore
not in the vocabulary, the word can be skipped.
file_contents
"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're
expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should
checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo
unsubscribe yourself from this mailing list, send an email to:\[email protected]\n\n"
import re
from string import punctuation
from nltk.stem.snowball import SnowballStemmer
return vocabList
# hdrstart = email_contents.find("\n\n")
# if hdrstart:
# email_contents = email_contents[hdrstart:]
# Convert to lower case.
email_contents = email_contents.lower()
# Handle Numbers.
# Look for one or more characters between 0-9.
email_contents = re.sub('[0-9]+', 'number', email_contents)
# Handle URLS.
# Look for strings starting with http:// or https://.
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)
# Handle $ sign.
# Look for "$" and replace it with the text "dollar".
email_contents = re.sub('[$]+', 'dollar', email_contents)
# Process file
l=0
#
=================================================================
===
# Print to screen, ensuring that the output lines are not too long.
if l + len(token) + 1 > 78:
print()
l=0
print(token, end=' ')
l = l + len(token) + 1
# Print footer.
print('\n\n=========================\n')
return word_indices
# Extract features.
word_indices = processEmail(file_contents)
# Print stats.
print('Word Indices: \n')
print(word_indices)
print('\n\n')
==== Processed Email ====
anyon know how much it cost to host a web portal well it depend on how mani
visitor your expect this can be anywher from less than number buck a month to
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if
your run someth big to unsubscrib yourself from this mail list send an email
to emailaddr
=========================
Word Indices:
[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1895, 592, 238, 162, 89,
688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 1893, 1510, 799, 1182, 1237, 810, 1895, 1440, 1547, 181,
1699, 1758, 1896, 688, 992, 961, 1477, 71, 530, 1699, 531]
Naïve Bayes algorithm The Nave Bayes method is a supervised learning technique for addressing
classification issues that is based on the Bayes theorem. It is mostly utilized in text classification
tasks that require a large training dataset. The Nave Bayes Classifier is a simple and effective
classification method that aids in the development of fast machine learning models capable of
making quick predictions. It is a probabilistic classifier, which means it makes predictions based on
an object's probability. Spam 18 filtration, sentiment analysis, and article classification are all
frequent uses of the Nave Bayes Algorithm
Types of spam email The following list of five different spam email categories that can occur in a
dataset. Common types of spam mail are:
• Commercial advertisements
• Antivirus warnings
• Email spoofing
• Sweepstakes winners
• Money scams
The system in this project focuses on detecting the spam mail by reviewing it in two
stages: feature extraction and classification. In the first stage, the basic concepts and
principles of spammail are highlighted in social media. The data selected for the training
dataset all the feature extraction and classification is performed by using two methods, TF-
IDF vectorizer and countvectorizer. Feature extraction is performed for the text with email
content. During the discoverystage, the current methods are reviewed for detection of spam
mail using different supervised learning algorithms including high range methods like
naïve Bayes and decision tree classification methods. The outcome of all the algorithms in
this project gives the best score form Extra Tree Classifier algorithm with 98% of the
accuracy and 99% level of precision from the dataset. Spam mail detection is always
improving, but hackers are still able to break security, thus it is falling behind. Networking
isthe source of many computersecurity threats, but it also amplifiesothers. Secure
computing is dependent on secure networks, and vice versa. It's no coincidencethat as
network security grows more fragile, people are becoming increasingly concerned
aboutsecurity and privacy. The use of these technologies in real time with numerous future
scenariosto analyze the depth and classify spam based on itslevel ofsusceptibility. The
current proposedsolution is for English language mails, however we can expand the scope
to include more languages in the future.
To introduce the methods for evaluating text classification, let’s first consider some simple
binary detection tasks. For example, in spam detection, our goal is to label every text as
being in the spam category (“positive”) or not in the spam category (“negative”). For each
item (email document) we therefore need to know whether our system called it spam or
not. We also need to know whether the email is actually spam or not, i.e. the human-
defined labels for each document that we are trying to gold labels match. We will refer to
these human labels as the gold labels. Or imagine you’re the CEO of the Delicious Pie
Company and you need to know what people are saying about your pies on social media,
so you build a system that detects tweets concerning Delicious Pie. Here the positive class
is tweets about Delicious Pie and the negative class is all other tweets. In both cases, we
need a metric for knowing how well our spam detector (or pie-tweet-detector) is doing. To
evaluate any system for detecting things, we start by building a confusion matrix like the
one shown in Fig. 4.4. A confusion matrix confusion matrix is a table for visualizing how
an algorithm performs with respect to the human gold labels, using two dimensions
(system output and gold labels), and each cell labeling a set of possible outcomes. In the
spam detection case, for example, true positives are documents that are indeed spam
(indicated by human-created gold labels) that our system correctly said were spam. False
negatives are documents that are indeed spam but our system incorrectly labeled as non-
spam. To the bottom right of the table is the equation for accuracy, which asks what
percentage of all the observations (for the spam or pie examples that means all emails or
tweets) our system labeled correctly. Although accuracy might seem a natural metric, we
generally don’t use it for text classification tasks. That’s because accuracy doesn’t work
well when the classes are unbalanced (as indeed they are with spam, which is a large
majority of email, or with tweets, which are mainly not about pie). To make this more
explicit, imagine that we looked at a million tweets, and let’s say that only 100 of them are
discussing their love (or hatred)
But we’ll need to slightly modify our definitions of precision and recall. Consider the
sample confusion matrix for a hypothetical 3-way one-of email categorization decision
(urgent, normal, spam) shown in Fig. 4.5. The matrix shows, for example, that the system
mistakenly labeled one spam document as urgent, and we have shown how to compute a
distinct precision and recall value for each class. In order to derive a single metric that tells
us how well the system is doing, we can commacroaveraging bine these values in two
ways. In macroaveraging, we compute the performance microaveraging for each class, and
then average over classes. In microaveraging, we collect the decisions for all classes into a
single confusion matrix, and then compute precision and recall from that table. Fig. 4.6
shows the confusion matrix for each class separately, and shows the computation of
microaveraged and macroaveraged precision. As the figure shows, a microaverage is
dominated by the more frequent class (in this case spam), since the counts are pooled. The
macroaverage better reflects the statistics of the smaller classes, and so is more appropriate
when performance on all the classes is equally important.
To introduce the methods for evaluating text classification, let’s first consider some simple
binary detection tasks. For example, in spam detection, our goal is to label every text as
being in the spam category (“positive”) or not in the spam category (“negative”). For each
item (email document) we therefore need to know whether our system called it spam or
not. We also need to know whether the email is actually spam or not, i.e. the human-
defined labels for each document that we are trying to gold labels match. We will refer to
these human labels as the gold labels. Or imagine you’re the CEO of the Delicious Pie
Company and you need to know what people are saying about your pies on social media,
so you build a system that detects tweets concerning Delicious Pie. Here the positive class
is tweets about Delicious Pie and the negative class is all other tweets. In both cases, we
need a metric for knowing how well our spam detector (or pie-tweet-detector) is doing. To
evaluate any system for detecting things, we start by building a confusion matrix like the
one shown in Fig. 4.4. A confusion matrix confusion matrix is a table for visualizing how
an algorithm performs with respect to the human gold labels, using two dimensions
(system output and gold labels), and each cell labeling a set of possible outcomes. In the
spam detection case, for example, true positives are documents that are indeed spam
(indicated by human-created gold labels) that our system correctly said were spam. False
negatives are documents that are indeed spam but our system incorrectly labeled as non-
spam. To the bottom right of the table is the equation for accuracy, which asks what
percentage of all the observations (for the spam or pie examples that means all emails or
tweets) our system labeled correctly. Although accuracy might seem a natural metric, we
generally don’t use it for text classification tasks. That’s because accuracy doesn’t work
well when the classes are unbalanced (as indeed they are with spam, which is a large
majority of email, or with tweets, which are mainly not about pie). To make this more
explicit, imagine that we looked at a million tweets, and let’s say that only 100 of them are
discussing their love (or hatred) for our pie, while the other 999,900 are tweets about
something completely unrelated. Imagine a simple classifier that stupidly classified every
tweet as “not about pie”. This classifier would have 999,900 true negatives and only 100
false negatives for an accuracy of 999,900/1,000,000 or 99.99%! What an amazing
accuracy level! Surely we should be happy with this classifier? But of course this fabulous
‘no pie’ classifier would be completely useless, since it wouldn’t find a single one of the
customer comments we are looking for. In other words, accuracy is not a good metric
when the goal is to discover something that is rare, or at least not completely balanced in
frequency, which is a very common situation in the world.
The training and testing procedure for text classification follows what we saw with
language modeling (Section ??): we use the training set to train the model, then use the
development test set (also called a devset) to perhaps tune some parameters, development
test set devset and in general decide what the best model is. Once we come up with what
we think is the best model, we run it on the (hitherto unseen) test set to report its
performance. While the use of a devset avoids overfitting the test set, having a fixed
training set, devset, and test set creates another problem: in order to save lots of data for
training, the test set (or devset) might not be large enough to be representative. Wouldn’t it
be better if we could somehow use all our data for training and still use cross-validation all
our data for test? We can do this by cross-validation. In cross-validation, we choose a
number k, and partition our data into k disjoint folds subsets called folds. Now we choose
one of those k folds as a test set, train our classifier on the remaining k − 1 folds, and then
compute the error rate on the test set. Then we repeat with another fold as the test set, again
training on the other k−1 folds. We do this sampling process k times and average the test
set error rate from these k runs to get an average error rate. If we choose k = 10, we would
train 10 different models (each on 90% of our data), test the model 10 times, and average
these 10 values. This is called 10-fold cross-validation. 10-fold cross-validation The only
problem with cross-validation is that because all the data is used for testing, we need the
whole corpus to be blind; we can’t examine any of the data to suggest possible features and
in general see what’s going on, because we’d be peeking at the test set, and such cheating
would cause us to overestimate the performance of our system. However, looking at the
corpus to understand what’s going on is important in designing NLP systems! What to do?
For this reason, it is common to create a fixed training set and test set, then do 10-fold
cross-validation inside the training set, but compute error rate the normal way in the test
set, as shown in
So in our example, this p-value is the probability that we would see δ(x) assuming A is not
better than B. If δ(x) is huge (let’s say A has a very respectable F1 of .9 and B has a
terrible F1 of only .2 on x), we might be surprised, since that would be extremely unlikely
to occur if H0 were in fact true, and so the p-value would be low (unlikely to have such a
large δ if A is in fact not better than B). But if δ(x) is very small, it might be less surprising
to us even if H0 were true and A is not really better than B, and so the p-value would be
higher. A very small p-value means that the difference we observed is very unlikely under
the null hypothesis, and we can reject the null hypothesis. What counts as very small? It is
common to use values like .05 or .01 as the thresholds. A value of .01 means that if the p-
value (the probability of observing the δ we saw assuming H0 is true) is less than .01, we
reject the null hypothesis and assume that A is indeed better than B. We say that a result
(e.g., “A is better than B”) is statistically significant if statistically significant the δ we saw
has a probability that is below the threshold and we therefore reject this null hypothesis.
How do we compute this probability we need for the p-value? In NLP we generally don’t
use simple parametric tests like t-tests or ANOVAs that you
as well. Now that we have the b test sets, providing a sampling distribution, we can do
statistics on how often A has an accidental advantage. There are various ways to compute
this advantage; here we follow the version laid out in Berg-Kirkpatrick et al. (2012).
Assuming H0 (A isn’t better than B), we would expect that δ(X), estimated over many test
sets, would be zero or negative; a much higher value would be surprising, since H0
specifically assumes A isn’t better than B. To measure exactly how surprising our observed
δ(x) is, we would in other circumstances compute the p-value by counting over many test
sets how often δ(x (i) ) exceeds the expected zero value by δ(x) or more: p-value(x) = 1 b
X b i=1 1 δ(x (i) )−δ(x) ≥ 0 (We use the notation 1(x) to mean “1 if x is true, and 0
otherwise”.) However, although it’s generally true that the expected value of δ(X) over
many test sets, (again assuming A isn’t better than B) is 0, this isn’t true for the
bootstrapped test sets we created. That’s because we didn’t draw these samples from a
distribution with 0 mean; we happened to create them from the original test set x, which
happens to be biased (by .20) in favor of A. So to measure how surprising is our observed
δ(x), we actually compute the p-value by counting over m
It is important to avoid harms that may result from classifiers, harms that exist both for
naive Bayes classifiers and for the other classification algorithms we introduce in later
chapters. One class of harms is representational harms (Crawford 2017, Blodgett et al.
representational harms 2020), harms caused by a system that demeans a social group, for
example by perpetuating negative stereotypes about them. For example Kiritchenko and
Mohammad (2018) examined the performance of 200 sentiment analysis systems on pairs
of sentences that were identical except for containing either a common African American
first name (like Shaniqua) or a common European American first name (like Stephanie),
chosen from the Caliskan et al. (2017) study discussed in Chapter 6. They found that most
systems assigned lower sentiment and more negative emotion to sentences with African
American names, reflecting and perpetuating stereotypes that associate African Americans
with negative emotions (Popp et al., 2003). In other tasks classifiers may lead to both
representational harms and other harms, such as silencing. For example the important text
classification task of tox- 4.11 • SUMMARY 19 icity detection is the task of detecting hate
speech, abuse, harassment, or other toxicity detection kinds of toxic language. While the
goal of such classifiers is to help reduce societal harm, toxicity classifiers can themselves
cause harms. For example, researchers have shown that some widely used toxicity
classifiers incorrectly flag as being toxic sentences that are non-toxic but simply mention
identities like women (Park et al., 2018), blind people (Hutchinson et al., 2020) or gay
people (Dixon et al., 2018; Dias Oliva et al., 2021), or simply use linguistic features
characteristic of varieties like African-American Vernacular English (Sap et al. 2019,
Davidson et al. 2019). Such false positive errors could lead to the silencing of discourse by
or about these groups
Conclsuion : -
In conclusion, machine learning and natural language
processing (NLP) techniques can be effectively used for email
spam classification. By leveraging the power of supervised
learning algorithms such as Naive Bayes, Support Vector
Machines, and KNN, and bypreprocessing the text data using
techniques such as tokenization, stop-word removal, and
stemming, it is possible to build accurate and reliable spam
filters that can automatically detectand filter out unwanted
emails. These techniques can also be extended to handle more
complexspamming strategies such as phishing attacks and
spear phishing. Overall, in the proposed models Naïve Bayes
having the accuracy of 99% SVM having 98% and KNN having
97%. Finally naïve bayes having the highestaccuracy so we
predict the Naïve bayes model. The useof ML and NLP for email
spam classification can save users valuable time and resources
and improve the overall productivity andsecurity of email
communication.
References
[1] https://abcd.com
[2] https://data-flair.training/blogs/python-anaconda-tutorial/
[3] https://www.tutorialspoint.com/machine_learning_with_python/index.htm