0% found this document useful (0 votes)

75 views26 pages

Pruthviraj Micor Foml

This micro project focuses on developing a spam classifier using machine learning techniques to accurately distinguish between spam and legitimate emails. The project utilizes the SMS Spam Collection Dataset and employs algorithms such as Naive Bayes, Support Vector Machines, and K-Nearest Neighbors, with Naive Bayes achieving the highest accuracy. The methodology includes data preprocessing, model selection, and evaluation metrics to ensure effective spam detection.

Uploaded by

iknowexplain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views26 pages

Pruthviraj Micor Foml

Uploaded by

iknowexplain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Micro Project

On
Spam classifier

By
SISODIYA PRUTHIVRAJ
Enrollment No: 236040316087

A Micro Project in FUNDAMENTALS OF MACHINE LEARNING

(4332604)
Submitted to

Information Technology Department

B & B Institute of Technology, Vallabh Vidyanagar
Certificate

This is to certify that SISODIYA PRUTHVIRAJ have/has

successfully completed the Micro project on for the subject
FUNDAMENTALS OF MACHINE LEARNING (4332604)under
my guidance and supervision.

Date:25/4/25
Place:B&B INSTITUE OF TECHNLOGY , VV NAGAR
,ANAND

Signature of Subject Coordinator:

HITESH PATEL
Acknowledgement
spam has become a significant problem in today's digital
age, posing challenges for individuals, businesses, and
organizations alike. Spam emails are unsolicited messages
that flood inboxes, wasting valuable time and resources
while potentially exposing users tomalicious content or
scams. To combat this issue, machine learning techniques
have emerged as powerful tools for email spam detection.
The objective of email spam detection is to accurately
classify incoming emails as either legitimate (ham) or spam.
Traditional rule-based approaches have limited
effectiveness due to the constantly evolving nature of spam.
Machine learning offers a more dynamic and adaptable
approach by leveraging patterns and features extracted
from large email datasets. Machine learning algorithms can
learn from labeled email datasets to build models capable of
recognizing patterns indicative of spam. These models can
then be used to automatically classify new, unseen emails.
By analyzing various email attributes such as sender
information subject line, content, and embedded URLs,
machine learning algorithms can identify spam
characteristics and make accurate predictions. There are
several machine learning techniques commonly employed
for email spam detection.
These include Naive Bayes, Support Vector Machines
(SVM), Decision Trees, Random Forests, and Neural
Networks. These algorithms can be trained on labeled
datasets, allowing them to learn the underlying patterns and
relationships between spam and nonspam emails. The
success of email spam detection using machine learning
heavily relies on the quality and diversity of the training
data. A comprehensive dataset that covers a wide range of
spam types and legitimate emails is essential for training
robust models. Additionally, feature engineering plays a
crucial role in identifying relevant attributes and extract
information
October 2024
Table of Contents

1 Introduction
.
1.1 What is Spam classifier
1.2 What The Use Of It Spam classifier
1.2.1 What The Need Of Spam Classifier
1.2.2 Objective
1.3 Dataset Description
2 Methodology
.
2.1 Data Preprocessing
2.2 Code Implementation
2.3 Model Selection
2.4 Python libraries

Conclusion
Refrences
1. Introduction

1.1 What is Spam classifier

spam classification is a critical task in today's digital world, where the amount
ofspam emails has increased dramatically. In this project, we propose to use
machine learning (ML) and natural language processing (NLP) techniques to
classify email messages as either spam or legitimate. The project aims to develop
an efficient spam classifier that can accurately identify and filter spam emails
from legitimate ones. The dataset used in this project will consistof a large
number of email messages with their corresponding labels (spam/ham). We will
useNLP techniques such as tokenization, stop word removal, stemming, and
feature extraction to preprocess the text data and extract relevant features.We
will evaluate several ML algorithms such as Naive Bayes, Support Vector
Machines (SVMs), and Random Forests to determine thebest model for spam
classification. We will also perform hyper parameter tuning to optimize the
model's performance. The accuracy of the classifier will be measured using
evaluation metrics such as precision, recall, and F1-score. The project's
outcomes will include a spam classifier model that can be integrated into an
email system to automatically filter spam emails, improving email security and
productivity. Additionally, the project will contribute to the advancement of
NLP and ML techniques for email spam classification.

1.2 What The Use Of It Spam classifier

The problem addressed in this project is the increasing amount of spam emails that are
invading user inboxes without their consent, consuming valuable network capacity, and
causing financial damage to companies. Despite measures taken to eliminate spam, it
remains a viable source of income for spammers, and over-sensitive filtering can even
eliminatelegitimate emails. The goal is to develop an effective spam filter using machine
learning and natural language processing techniques to accurately classify incoming
emails as either spam or non-spam. The existing system for email spam classification
typically relies on rule-based filtering techniques, such as blacklisting known spam
email addresses or domains, and whitelisting trusted senders. These techniques are not
always effective, as spammers can easilychange their email addresses or use techniques
such as phishing to impersonate trusted senders.Moreover, traditional rule-based
filtering methods require frequent updates and maintenance, which can be time-
consuming and resource-intensive. They may also mistakenly flag legitimate emails as
spam, leading to a loss of important messages or even business opportunities. To address
these limitations, machine learning and natural language processing techniques can be
used to develop more accurate and automated email spam classifiers. Theseapproaches
can learn to recognize spam based on patterns and characteristics in the text, ratherthan
relying on pre-defined rules. We proposed in the Machine Learning Models such as
Naïve Bayes, SVM, KNN Models are will having the highest accuracy when compared to
the existing system. The proposed system will provide an efficient and accurate way

1.2.1 What The Need Of Spam Classifier

Machine learning and natural language processing (NLP) techniques can be

effectively used for email spam classification. By leveraging the power of
supervised learning algorithms such as Naive Bayes, Support Vector Machines, and
KNN, and bypreprocessing the text data using techniques such as tokenization,
stop-word removal, and stemming, it is possible to build accurate and reliable
spam filters that can automatically detectand filter out unwanted emails. These
techniques can also be extended to handle more complexspamming strategies
such as phishing attacks and spear phishing. Overall, in the proposed models Naïve
Bayes having the accuracy of 99% SVM having 98% and KNN having 97%. Finally
naïve bayes having the highestaccuracy so we predict the Naïve bayes model. The
useof ML and NLP for email spam classification can save users valuable time and
resources and improve the overall productivity andsecurity of email
communication.

1.2.2 Objective

To build a spam classifier that can accurately distinguish between spam and ham
(non-spam) SMS messages using the Naïve Bayes classification algorithm.

 Converted all text to lowercase.

 Removed punctuation marks.

 Mapped labels to binary values: ham = 0, spam = 1.

traditional rule-based filtering methods require frequent updates and maintenance,

which can be time- consuming and resource-intensive. They may also mistakenly flag
legitimate emails as spam, leading to a loss of important messages or even business
opportunities. To address these limitations, machine learning and natural language
processing techniques can be used to develop more accurate and automated email
spam classifiers. Theseapproaches can learn to recognize spam based on patterns
and characteristics in the text, ratherthan relying on pre-defined rules.

1.3 Dataset Description

We use the SMS Spam Collection Dataset, a

widely-used corpus that includes 5,572 labeled
SMS messages in English. Each entry in the dataset
is labeled as either:
 spam: Unwanted commercial or promotional
messages.
 ham: Legitimate messages.
Dataset source:
https://raw.githubusercontent.com/
justmarkham/pycon-2016-
tutorial/master/data/sms.tsv
2. Methodology

2.1 Data Preprocessing

 Converted all text to lowercase.

 Removed punctuation marks.

 Mapped labels to binary values: ham = 0, spam = 1.

Improved data quality:

Preprocessing helps address issues like missing values, outliers, and

inconsistencies, leading to more accurate and reliable insights.

Enhanced model performance:

By cleaning and preparing the data, machine learning models can learn more
effectively and make better predictions.

Simplified analysis:

Preprocessed data is easier to work with and analyze, making it more efficient to
extract meaningful information.

Consistency and uniformity:

Data preprocessing ensures that all data sets are formatted consistently, which is
essential for comparing and combining data from different sources.
2.2 Code Implementation

Code Implementation:-
python
CopyEdit
import pandas as pd
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 1. Load the dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-
tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# 2. Preprocessing function
def clean_text(text):
text = text.lower()
text = ''.join([ch for ch in text if ch not in string.punctuation])
return text

df['message'] = df['message'].apply(clean_text)

# 3. Convert labels to binary

df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# 4. Vectorize the messages

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label_num']

# 5. Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=42)

# 6. Train the Naive Bayes model

model = MultinomialNB()
model.fit(X_train, y_train)

# 7. Evaluate the model

y_pred = model.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\

nClassification Report:\n", classification_report(y_test, y_pred)) print("Accuracy
Score:", accuracy_score(y_test, y_pred))
# 8. Test on new input
def predict_spam(text):
cleaned = clean_text(text)
vect = vectorizer.transform([cleaned])
prediction = model.predict(vect)[0]
return "Spam" if prediction == 1 else "Ham"

# Example:
print(predict_spam("Congratulations! You've won a free ticket to Bahamas. Text
WIN to 12345!"))
print(predict_spam("Hey, are we still meeting for lunch?"))
Example Output :

Confusion Matrix:

[[1203 10]

[ 16 184]]

Classification Report:

precision recall f1-score support

0 0.99 0.99 0.99 1213

1 0.95 0.92 0.94 200

Accuracy Score: 0.982

2.3 Model Selection

Choosing the right machine learning algorithm is crucial to the performance of any
classification system. For this spam detection task, we evaluated various classification
algorithms based on their suitability for text data and classification performance.
Below is the rationale for selecting Naïve Bayes:

Why Naïve Bayes?

The Multinomial Naïve Bayes classifier was selected for this project due to the
following reasons:

 ◻ Text Suitability: Naïve Bayes, particularly the Multinomial variant, is

widely used for text classification problems such as spam detection,
sentiment analysis, and document categorization.
 ◻ Efficiency: It is computationally efficient and can be trained quickly
even on large datasets.
 ◻ Interpretability: Naïve Bayes provides clear probabilistic outputs,
making it easier to interpret the confidence of predictions.
 ◻ Performance: Despite its simplicity, it performs surprisingly well in many
real-world scenarios, especially when the assumption of feature independence
approximately holds (as is often the case with word presence/absence in texts).

. Considered Alternatives

While Naïve Bayes was ultimately chosen, other models were considered:

Model Pros Cons

Strong baseline for binary Slower on large feature sets; less
Logistic Regression
classification interpretable
Support Vector High accuracy, good Requires more tuning; can be
Machine (SVM) for sparse data computationally heavy
Handles non-
Random Forest linearities, good Slower training; less interpretable
accuracy
K-Nearest Neighbors Simple and intuitive Inefficient for large datasets
However, in preliminary tests, Multinomial Naïve Bayes achieved:

 High accuracy
 Faster training time
 Lower memory usage

Given these advantages and the nature of the problem (text-based binary
classification), Naïve Bayes was chosen as the most appropriate model.

Under the Supervised Learning approach, one of the most common Machine Learning
algorithms is logistic regression. It's a method for predicting a categorical dependent
variable from a set of independent variables. A categorical dependent variable's
output is predicted using logistic regression. As a result, the result must be a discrete
or categorical value. It can be Yes or No, 0 or 1, true or false, and so on, but instead
of giving exact values like 0 and 1, it delivers probabilistic values that are somewhere
between 0 and 1. Except for how they are employed, Logistic Regression is very
similar to Linear Regression. Regression problems are solved using Linear
Regression, while classification problems are solved using Logistic Regression.
Instead of fitting a regression line, we fit a "S" shaped logistic function in logistic
regression, which predicts two maximum values

it is usually insightful to take a look at examples from the dataset. The sample email contains a URL, an
email address (at the end), numbers, and dollar amounts. While many emails would contain similar
types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the
specific URL or specific dollar amount) will be different in almost every email. Therefore, one method
often employed in processing emails is to “normalize” these values, so that all URLs are treated the
same, all numbers are treated the same, etc. For example, we could replace each URL in the email with
the unique string “httpaddr” to indicate that a URL was present.
This has the effect of letting the spam classifier make a classification decision based on whether any
URL was present, rather than whether a specific URL was present. This typically improves the
performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of
seeing any particular URL again in a new piece of spam is very small.
In processEmail, the following email preprocessing and normalization steps have been implemented:

 Lower-casing: The entire email is converted into lower case, so that captialization is ignored
(e.g., IndIcaTE is treated the same as Indicate).
 Stripping HTML: All HTML tags are removed from the emails. Many emails often come with
HTML formatting; we remove all the HTML tags, so that only the content remains.
 Normalizing URLs: All URLs are replaced with the text “httpaddr”.
 Normalizing Email Addresses: All email addresses are replaced with the text “emailaddr”.
 Normalizing Numbers: All numbers are replaced with the text “number”.
 Normalizing Dollars: All dollar signs ($) are replaced with the text “dollar”.
 Word Stemming: Words are reduced to their stemmed form. For example, “discount”,
“discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the
Stemmer actually strips off additional characters from the end, so “include”, “includes”,
“included”, and “including” are all replaced with “includ”.
 Removal of non-words: Non-words and punctuation have been removed. All white spaces
(tabs, newlines, spaces) have all been trimmed to a single space character.

The result of these preprocessing steps looks like the following paragraph:
anyon know how much it cost to host a web portal well it depend on how mani
visitor your expect thi can be anywher from less than number buck a month to
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if
your run someth big to unsubscrib yourself from thi mail list send an email
to emailaddr

While preprocessing has left word fragments and non-words, this form turns out to be much easier to
work with for performing feature extraction
After preprocessing the emails, there is a list of words for each email. The next step is to choose which
words will be used in the classifier and which will be left out.
For simplicity reasons, only the most frequently occuring words as the set of words considered (the
vocabulary list) have been chosen. Since words that occur rarely in the training set are only in a few
emails, they might cause the model to overfit the training set. The complete vocabulary list is in the
file vocab.txt. The vocabulary list was selected by choosing all words which occur at least a 100
times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about
10,000 to 50,000 words is often used.
Given the vocabulary list, each word can be now mapped in the preprocessed emails into a list of word
indices that contains the index of the word in the vocabulary list. For example, in the sample email, the
word “anyone” was first normalized to “anyon” and then mapped onto the index 86 in the vocabulary
list.
The code in processEmail performs this mapping. In the code, a given string str which is a single
word from the processed email is searched in the vocabulary list vocabList. If the word exists, the
index of the word is added into the word_indices variable. If the word does not exist, and is therefore
not in the vocabulary, the word can be skipped.

# Read the txt file.

with open('emailSample1.txt', 'r') as email:
file_contents = email.read()

file_contents
"> Anyone knows how much it costs to host a web portal ?\n>\nWell, it depends on how many visitors you're
expecting.\nThis can be anywhere from less than 10 bucks a month to a couple of $100. \nYou should
checkout http://www.rackspace.com/ or perhaps Amazon EC2 \nif youre running something big..\n\nTo
unsubscribe yourself from this mailing list, send an email to:\[email protected]\n\n"
import re
from string import punctuation
from nltk.stem.snowball import SnowballStemmer

# Create a function to read the fixed vocab list.

def getVocabList():
"""
Reads the fixed vocabulary list in vocab.txt
and returns a dictionary of the words in vocabList.
"""
# Read the fixed vocabulary list.
with open('vocab.txt', 'r') as vocab:

# Store all dictionary words in dictionary vocabList.

vocabList = {}
for line in vocab.readlines():
i, word = line.split()
vocabList[word] = int(i)

return vocabList

# Create a function to process the email contents.

def processEmail(email_contents):
"""
Preprocesses the body of an email and returns a
list of indices of the words contained in the email.
Args:
email_contents: str
Returns:
word_indices: list of ints
"""
# Load Vocabulary.
vocabList = getVocabList()

# Init return value.

word_indices = []

# ============================ Preprocess Email

============================

# Find the Headers ( \n\n and remove ).

# Uncomment the following lines if you are working with raw emails with the
# full headers.

# hdrstart = email_contents.find("\n\n")
# if hdrstart:
# email_contents = email_contents[hdrstart:]
# Convert to lower case.
email_contents = email_contents.lower()

# Strip all HTML.

# Look for any expression that starts with < and ends with > and
# does not have any < or > in the tag and replace it with a space.
email_contents = re.sub('<[^<>]+>', ' ', email_contents)

# Handle Numbers.
# Look for one or more characters between 0-9.
email_contents = re.sub('[0-9]+', 'number', email_contents)

# Handle URLS.
# Look for strings starting with http:// or https://.
email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents)

# Handle Email Addresses.

# Look for strings with @ in the middle.
email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents)

# Handle $ sign.
# Look for "$" and replace it with the text "dollar".
email_contents = re.sub('[$]+', 'dollar', email_contents)

# ============================ Tokenize Email

============================

# Output the email to screen as well.

print('\n==== Processed Email ====\n')

# Process file
l=0

# Get rid of any punctuation.

email_contents = email_contents.translate(str.maketrans('', '', punctuation))
# Split the email text string into individual words.
email_contents = email_contents.split()

for token in email_contents:

# Remove any non alphanumeric characters.

token = re.sub('[^a-zA-Z0-9]', '', token)

# Create the stemmer.

stemmer = SnowballStemmer("english")

# Stem the word.

token = stemmer.stem(token.strip())

# Skip the word if it is too short

if len(token) < 1:
continue

# Look up the word in the dictionary and add to word_indices if found.

if token in vocabList:
idx = vocabList[token]
word_indices.append(idx)

#
=================================================================
===

# Print to screen, ensuring that the output lines are not too long.
if l + len(token) + 1 > 78:
print()
l=0
print(token, end=' ')
l = l + len(token) + 1

# Print footer.
print('\n\n=========================\n')

return word_indices
# Extract features.
word_indices = processEmail(file_contents)

# Print stats.
print('Word Indices: \n')
print(word_indices)
print('\n\n')
==== Processed Email ====

anyon know how much it cost to host a web portal well it depend on how mani
visitor your expect this can be anywher from less than number buck a month to
a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if
your run someth big to unsubscrib yourself from this mail list send an email
to emailaddr

=========================

Word Indices:

[86, 916, 794, 1077, 883, 370, 1699, 790, 1822, 1831, 883, 431, 1171, 794, 1002, 1895, 592, 238, 162, 89,
688, 945, 1663, 1120, 1062, 1699, 375, 1162, 479, 1893, 1510, 799, 1182, 1237, 810, 1895, 1440, 1547, 181,
1699, 1758, 1896, 688, 992, 961, 1477, 71, 530, 1699, 531]

Naïve Bayes algorithm The Nave Bayes method is a supervised learning technique for addressing
classification issues that is based on the Bayes theorem. It is mostly utilized in text classification
tasks that require a large training dataset. The Nave Bayes Classifier is a simple and effective
classification method that aids in the development of fast machine learning models capable of
making quick predictions. It is a probabilistic classifier, which means it makes predictions based on
an object's probability. Spam 18 filtration, sentiment analysis, and article classification are all
frequent uses of the Nave Bayes Algorithm
Types of spam email The following list of five different spam email categories that can occur in a
dataset. Common types of spam mail are:

• Commercial advertisements

• Antivirus warnings

• Email spoofing

• Sweepstakes winners

• Money scams

The system in this project focuses on detecting the spam mail by reviewing it in two
stages: feature extraction and classification. In the first stage, the basic concepts and
principles of spammail are highlighted in social media. The data selected for the training
dataset all the feature extraction and classification is performed by using two methods, TF-
IDF vectorizer and countvectorizer. Feature extraction is performed for the text with email
content. During the discoverystage, the current methods are reviewed for detection of spam
mail using different supervised learning algorithms including high range methods like
naïve Bayes and decision tree classification methods. The outcome of all the algorithms in
this project gives the best score form Extra Tree Classifier algorithm with 98% of the
accuracy and 99% level of precision from the dataset. Spam mail detection is always
improving, but hackers are still able to break security, thus it is falling behind. Networking
isthe source of many computersecurity threats, but it also amplifiesothers. Secure
computing is dependent on secure networks, and vice versa. It's no coincidencethat as
network security grows more fragile, people are becoming increasingly concerned
aboutsecurity and privacy. The use of these technologies in real time with numerous future
scenariosto analyze the depth and classify spam based on itslevel ofsusceptibility. The
current proposedsolution is for English language mails, however we can expand the scope
to include more languages in the future.

To introduce the methods for evaluating text classification, let’s first consider some simple
binary detection tasks. For example, in spam detection, our goal is to label every text as
being in the spam category (“positive”) or not in the spam category (“negative”). For each
item (email document) we therefore need to know whether our system called it spam or
not. We also need to know whether the email is actually spam or not, i.e. the human-
defined labels for each document that we are trying to gold labels match. We will refer to
these human labels as the gold labels. Or imagine you’re the CEO of the Delicious Pie
Company and you need to know what people are saying about your pies on social media,
so you build a system that detects tweets concerning Delicious Pie. Here the positive class
is tweets about Delicious Pie and the negative class is all other tweets. In both cases, we
need a metric for knowing how well our spam detector (or pie-tweet-detector) is doing. To
evaluate any system for detecting things, we start by building a confusion matrix like the
one shown in Fig. 4.4. A confusion matrix confusion matrix is a table for visualizing how
an algorithm performs with respect to the human gold labels, using two dimensions
(system output and gold labels), and each cell labeling a set of possible outcomes. In the
spam detection case, for example, true positives are documents that are indeed spam
(indicated by human-created gold labels) that our system correctly said were spam. False
negatives are documents that are indeed spam but our system incorrectly labeled as non-
spam. To the bottom right of the table is the equation for accuracy, which asks what
percentage of all the observations (for the spam or pie examples that means all emails or
tweets) our system labeled correctly. Although accuracy might seem a natural metric, we
generally don’t use it for text classification tasks. That’s because accuracy doesn’t work
well when the classes are unbalanced (as indeed they are with spam, which is a large
majority of email, or with tweets, which are mainly not about pie). To make this more
explicit, imagine that we looked at a million tweets, and let’s say that only 100 of them are
discussing their love (or hatred)

But we’ll need to slightly modify our definitions of precision and recall. Consider the
sample confusion matrix for a hypothetical 3-way one-of email categorization decision
(urgent, normal, spam) shown in Fig. 4.5. The matrix shows, for example, that the system
mistakenly labeled one spam document as urgent, and we have shown how to compute a
distinct precision and recall value for each class. In order to derive a single metric that tells
us how well the system is doing, we can commacroaveraging bine these values in two
ways. In macroaveraging, we compute the performance microaveraging for each class, and
then average over classes. In microaveraging, we collect the decisions for all classes into a
single confusion matrix, and then compute precision and recall from that table. Fig. 4.6
shows the confusion matrix for each class separately, and shows the computation of
microaveraged and macroaveraged precision. As the figure shows, a microaverage is
dominated by the more frequent class (in this case spam), since the counts are pooled. The
macroaverage better reflects the statistics of the smaller classes, and so is more appropriate
when performance on all the classes is equally important.

The training and testing procedure for text classification follows what we saw with
language modeling (Section ??): we use the training set to train the model, then use the
development test set (also called a devset) to perhaps tune some parameters, development
test set devset and in general decide what the best model is. Once we come up with what
we think is the best model, we run it on the (hitherto unseen) test set to report its
performance. While the use of a devset avoids overfitting the test set, having a fixed
training set, devset, and test set creates another problem: in order to save lots of data for
training, the test set (or devset) might not be large enough to be representative. Wouldn’t it
be better if we could somehow use all our data for training and still use cross-validation all
our data for test? We can do this by cross-validation. In cross-validation, we choose a
number k, and partition our data into k disjoint folds subsets called folds. Now we choose
one of those k folds as a test set, train our classifier on the remaining k − 1 folds, and then
compute the error rate on the test set. Then we repeat with another fold as the test set, again
training on the other k−1 folds. We do this sampling process k times and average the test
set error rate from these k runs to get an average error rate. If we choose k = 10, we would
train 10 different models (each on 90% of our data), test the model 10 times, and average
these 10 values. This is called 10-fold cross-validation. 10-fold cross-validation The only
problem with cross-validation is that because all the data is used for testing, we need the
whole corpus to be blind; we can’t examine any of the data to suggest possible features and
in general see what’s going on, because we’d be peeking at the test set, and such cheating
would cause us to overestimate the performance of our system. However, looking at the
corpus to understand what’s going on is important in designing NLP systems! What to do?
For this reason, it is common to create a fixed training set and test set, then do 10-fold
cross-validation inside the training set, but compute error rate the normal way in the test
set, as shown in
So in our example, this p-value is the probability that we would see δ(x) assuming A is not
better than B. If δ(x) is huge (let’s say A has a very respectable F1 of .9 and B has a
terrible F1 of only .2 on x), we might be surprised, since that would be extremely unlikely
to occur if H0 were in fact true, and so the p-value would be low (unlikely to have such a
large δ if A is in fact not better than B). But if δ(x) is very small, it might be less surprising
to us even if H0 were true and A is not really better than B, and so the p-value would be
higher. A very small p-value means that the difference we observed is very unlikely under
the null hypothesis, and we can reject the null hypothesis. What counts as very small? It is
common to use values like .05 or .01 as the thresholds. A value of .01 means that if the p-
value (the probability of observing the δ we saw assuming H0 is true) is less than .01, we
reject the null hypothesis and assume that A is indeed better than B. We say that a result
(e.g., “A is better than B”) is statistically significant if statistically significant the δ we saw
has a probability that is below the threshold and we therefore reject this null hypothesis.
How do we compute this probability we need for the p-value? In NLP we generally don’t
use simple parametric tests like t-tests or ANOVAs that you

as well. Now that we have the b test sets, providing a sampling distribution, we can do
statistics on how often A has an accidental advantage. There are various ways to compute
this advantage; here we follow the version laid out in Berg-Kirkpatrick et al. (2012).
Assuming H0 (A isn’t better than B), we would expect that δ(X), estimated over many test
sets, would be zero or negative; a much higher value would be surprising, since H0
specifically assumes A isn’t better than B. To measure exactly how surprising our observed
δ(x) is, we would in other circumstances compute the p-value by counting over many test
sets how often δ(x (i) ) exceeds the expected zero value by δ(x) or more: p-value(x) = 1 b
X b i=1 1 δ(x (i) )−δ(x) ≥ 0 (We use the notation 1(x) to mean “1 if x is true, and 0
otherwise”.) However, although it’s generally true that the expected value of δ(X) over
many test sets, (again assuming A isn’t better than B) is 0, this isn’t true for the
bootstrapped test sets we created. That’s because we didn’t draw these samples from a
distribution with 0 mean; we happened to create them from the original test set x, which
happens to be biased (by .20) in favor of A. So to measure how surprising is our observed
δ(x), we actually compute the p-value by counting over m
It is important to avoid harms that may result from classifiers, harms that exist both for
naive Bayes classifiers and for the other classification algorithms we introduce in later
chapters. One class of harms is representational harms (Crawford 2017, Blodgett et al.
representational harms 2020), harms caused by a system that demeans a social group, for
example by perpetuating negative stereotypes about them. For example Kiritchenko and
Mohammad (2018) examined the performance of 200 sentiment analysis systems on pairs
of sentences that were identical except for containing either a common African American
first name (like Shaniqua) or a common European American first name (like Stephanie),
chosen from the Caliskan et al. (2017) study discussed in Chapter 6. They found that most
systems assigned lower sentiment and more negative emotion to sentences with African
American names, reflecting and perpetuating stereotypes that associate African Americans
with negative emotions (Popp et al., 2003). In other tasks classifiers may lead to both
representational harms and other harms, such as silencing. For example the important text
classification task of tox- 4.11 • SUMMARY 19 icity detection is the task of detecting hate
speech, abuse, harassment, or other toxicity detection kinds of toxic language. While the
goal of such classifiers is to help reduce societal harm, toxicity classifiers can themselves
cause harms. For example, researchers have shown that some widely used toxicity
classifiers incorrectly flag as being toxic sentences that are non-toxic but simply mention
identities like women (Park et al., 2018), blind people (Hutchinson et al., 2020) or gay
people (Dixon et al., 2018; Dias Oliva et al., 2021), or simply use linguistic features
characteristic of varieties like African-American Vernacular English (Sap et al. 2019,
Davidson et al. 2019). Such false positive errors could lead to the silencing of discourse by
or about these groups
Conclsuion : -
In conclusion, machine learning and natural language
processing (NLP) techniques can be effectively used for email
spam classification. By leveraging the power of supervised
learning algorithms such as Naive Bayes, Support Vector
Machines, and KNN, and bypreprocessing the text data using
techniques such as tokenization, stop-word removal, and
stemming, it is possible to build accurate and reliable spam
filters that can automatically detectand filter out unwanted
emails. These techniques can also be extended to handle more
complexspamming strategies such as phishing attacks and
spear phishing. Overall, in the proposed models Naïve Bayes
having the accuracy of 99% SVM having 98% and KNN having
97%. Finally naïve bayes having the highestaccuracy so we
predict the Naïve bayes model. The useof ML and NLP for email
spam classification can save users valuable time and resources
and improve the overall productivity andsecurity of email
communication.
References
[1] https://abcd.com

[2] https://data-flair.training/blogs/python-anaconda-tutorial/

[3] https://www.tutorialspoint.com/machine_learning_with_python/index.htm

Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Ai Project
No ratings yet
Ai Project
8 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Spam Email Classifier - Ramsanjay
No ratings yet
Spam Email Classifier - Ramsanjay
2 pages
Final PPT
No ratings yet
Final PPT
18 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Final Report Spam Classifier
No ratings yet
Final Report Spam Classifier
24 pages
ML Lab
No ratings yet
ML Lab
13 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Email Spam Detection Project Report
No ratings yet
Email Spam Detection Project Report
19 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
Document
No ratings yet
Document
11 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
Email Spam Detection Edited
No ratings yet
Email Spam Detection Edited
30 pages
Second Progress Report
No ratings yet
Second Progress Report
17 pages
Published Paper
No ratings yet
Published Paper
9 pages
Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
Spam Mail Classifier
No ratings yet
Spam Mail Classifier
8 pages
$RVJ44FQ
No ratings yet
$RVJ44FQ
13 pages
Mini Project Final 10,42,52
No ratings yet
Mini Project Final 10,42,52
39 pages
Email
No ratings yet
Email
27 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
Anti Spam
No ratings yet
Anti Spam
26 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
Email Spam Final
No ratings yet
Email Spam Final
32 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
No ratings yet
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
13 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
Project 2
No ratings yet
Project 2
10 pages
Lab 3 Write Up
No ratings yet
Lab 3 Write Up
2 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Mini - Project Report
No ratings yet
Mini - Project Report
21 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Email Spam Filtering Using Machine Learning.1
No ratings yet
Email Spam Filtering Using Machine Learning.1
16 pages
Ijst 2023 2979
No ratings yet
Ijst 2023 2979
12 pages
ML Algorithms for Spam Detection
No ratings yet
ML Algorithms for Spam Detection
10 pages
Report
No ratings yet
Report
11 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Email Spam Classification
No ratings yet
Email Spam Classification
17 pages
Vaibhav Tiwari Final Project
No ratings yet
Vaibhav Tiwari Final Project
32 pages
Ass 3
No ratings yet
Ass 3
2 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Spam Detection NLP Project
No ratings yet
Spam Detection NLP Project
3 pages
AI Phase1
No ratings yet
AI Phase1
7 pages
Email Spam Detection for ML Experts
No ratings yet
Email Spam Detection for ML Experts
7 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Comparison and Analysis of Deep Audio Embeddings For Music Emotion Recognition
No ratings yet
Comparison and Analysis of Deep Audio Embeddings For Music Emotion Recognition
8 pages
Approximate Solution of One and Two Dimensional Nonlinear Klein Sinh Gordon Equations Using Method of Line Based On Fibonacci Polynomials
No ratings yet
Approximate Solution of One and Two Dimensional Nonlinear Klein Sinh Gordon Equations Using Method of Line Based On Fibonacci Polynomials
23 pages
A Data Analytics Tutorial Building Predictive
No ratings yet
A Data Analytics Tutorial Building Predictive
15 pages
COTTON LEAF DISEASE WORD FILE - pdf111
100% (1)
COTTON LEAF DISEASE WORD FILE - pdf111
57 pages
Emailing PREDICTIVE ANALYSIS 2
No ratings yet
Emailing PREDICTIVE ANALYSIS 2
14 pages
AI-Powered Symptom Checker Project
No ratings yet
AI-Powered Symptom Checker Project
47 pages
Ajithkumar - Inframind Season
No ratings yet
Ajithkumar - Inframind Season
12 pages
Malaria Detection Using Deep-Learning Shakib PDF
No ratings yet
Malaria Detection Using Deep-Learning Shakib PDF
14 pages
Stock Prediction with ML Models
No ratings yet
Stock Prediction with ML Models
8 pages
Project Report
No ratings yet
Project Report
19 pages
Aspiring Engineer's Profile
No ratings yet
Aspiring Engineer's Profile
1 page
ML Practical Format
No ratings yet
ML Practical Format
82 pages
TRack Net
No ratings yet
TRack Net
12 pages
Introduction Machine Learning
No ratings yet
Introduction Machine Learning
53 pages
Soft Computing for Soil Compaction
No ratings yet
Soft Computing for Soil Compaction
19 pages
Determinants of House Prices in Turkey A HPM and ANN
No ratings yet
Determinants of House Prices in Turkey A HPM and ANN
10 pages
JST 2022
No ratings yet
JST 2022
5 pages
Isolation and Impartial Aggregation: A Paradigm of Incremental Learning Without Interference
No ratings yet
Isolation and Impartial Aggregation: A Paradigm of Incremental Learning Without Interference
9 pages
Mml-Book (ch1)
No ratings yet
Mml-Book (ch1)
6 pages
Advancing Cybersecurity: A Comprehensive Review of AI-driven Detection Techniques
100% (1)
Advancing Cybersecurity: A Comprehensive Review of AI-driven Detection Techniques
38 pages
IPL - PREDICTION Final
No ratings yet
IPL - PREDICTION Final
6 pages
Mmi - 1908016 & 1908040
No ratings yet
Mmi - 1908016 & 1908040
30 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
Unit 2
100% (1)
Unit 2
42 pages
Using Language Models To Disambiguate Lexical Choices in Translation
No ratings yet
Using Language Models To Disambiguate Lexical Choices in Translation
12 pages
Samonte - 2018 - Polarity Analysis of Editorial Articles Towards Fa
No ratings yet
Samonte - 2018 - Polarity Analysis of Editorial Articles Towards Fa
5 pages
Session 6 - Machine Learning Fundamentals and Orange Introduction
No ratings yet
Session 6 - Machine Learning Fundamentals and Orange Introduction
53 pages
AIRLINE
No ratings yet
AIRLINE
63 pages
Machine Learning Techniques Quantum
No ratings yet
Machine Learning Techniques Quantum
161 pages
Unit I
No ratings yet
Unit I
38 pages

Pruthviraj Micor Foml

Uploaded by

Pruthviraj Micor Foml

Uploaded by

Micro Project

A Micro Project in FUNDAMENTALS OF MACHINE LEARNING

Information Technology Department

This is to certify that SISODIYA PRUTHVIRAJ have/has

Signature of Subject Coordinator:

1.1 What is Spam classifier

1.2 What The Use Of It Spam classifier

1.2.1 What The Need Of Spam Classifier

Machine learning and natural language processing (NLP) techniques can be

 Converted all text to lowercase.

 Removed punctuation marks.

 Mapped labels to binary values: ham = 0, spam = 1.

traditional rule-based filtering methods require frequent updates and maintenance,

1.3 Dataset Description

We use the SMS Spam Collection Dataset, a

2.1 Data Preprocessing

 Converted all text to lowercase.

 Removed punctuation marks.

 Mapped labels to binary values: ham = 0, spam = 1.

Improved data quality:

Preprocessing helps address issues like missing values, outliers, and

Enhanced model performance:

Consistency and uniformity:

# 1. Load the dataset

# 3. Convert labels to binary

# 4. Vectorize the messages

# 5. Split into training and testing sets

# 6. Train the Naive Bayes model

# 7. Evaluate the model

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred)) print("\

precision recall f1-score support

0 0.99 0.99 0.99 1213

1 0.95 0.92 0.94 200

Accuracy Score: 0.982

Why Naïve Bayes?

 ◻ Text Suitability: Naïve Bayes, particularly the Multinomial variant, is

Model Pros Cons

# Read the txt file.

# Create a function to read the fixed vocab list.

# Store all dictionary words in dictionary vocabList.

# Create a function to process the email contents.

# Init return value.

# ============================ Preprocess Email

# Find the Headers ( \n\n and remove ).

# Strip all HTML.

# Handle Email Addresses.

# ============================ Tokenize Email

# Output the email to screen as well.

# Get rid of any punctuation.

for token in email_contents:

# Remove any non alphanumeric characters.

# Create the stemmer.

# Stem the word.

# Skip the word if it is too short

# Look up the word in the dictionary and add to word_indices if found.

You might also like