1. Spam Mail Detection
Using Naïve Bayes Classifier
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
ADITYA INSTITUTE OF TECHNOLOGY AND MANAGEMENT
(AUTONOMOUS)
K.KOTTURU,TEKKALI-532201
Submitted by
G . KISHORE
23A51D5802
MTECH (CSE)
Under the supervision of
Sri. T. Chalapathi Rao, M. Tech, ph. D
(Sr. Assist. Professor, Dept of CSE)
3. ABSTRACT
• In this project, we consider the main problem faced by the G-mail users caused by spam mails. In
this project we classify the mails based on their text content by using different methods, it is spam
or not spam mail.
• Spam detection means detecting spam messages or emails by understanding text content so that
you can only receive notifications about messages or emails that are very important to you.
• If spam messages are found, they are automatically transferred to a spam folder and you are never
notified of such alerts. This helps to improve the user experience, as many spam alerts can bother
many users.
• Naïve bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag
of words features to identify spam e -mail, an approach commonly used in text classification.
• OBJECTIVE:The main objective of this project is Detecting spam alerts in emails and messages,
one of the main applications like GOOGLE-MAIL that every big tech company tries to improve for
its customers, looking to build a spam detection system.
• OUTCOME: the output of this project is predict the given text message is spam or not a spam mail.
4. • We proposed a frame work to adequate detection of spam mails quickly and efficiently, it consist of
feature extraction CountVectorizer model and machine learning technique for natural language
processing (NLP).
• we considered a dataset from Kaggle and explore how Machine learning algorithms can be used to
find patterns in data. We applied ML Models to predict the target class and plot the sparse matrix
for all classifier models and calculate the accuracy score.
• Finally we create a web page using streamLight and python ,in the placeholder we enter the text
messages and click on the process button.After clicking process button it speaks whether the given
message is spam mail or ham mail by using artificial bias.
• It also displays the output whether it is spam or ham mail.
5. INTRODUCTION
• Whenever you submit details about your email or contact number on any platform, it has
become easy for those platforms to market their products by advertising them by sending
emails or by sending messages directly to your contact number.
• This results in lots of spam alerts and notifications in your inbox. This is where the task
of spam detection comes in.
• Detecting spam alerts in emails and messages is one of the main applications that every
big tech company tries to improve for its customers.
• Apple’s official messaging app and Google’s Gmail are great examples of such
applications where spam detection works well to protect users from spam alerts.
• Spam email can be dangerous.it can include malicious links can infect your computer
with malware.
• Cyber attacts also done using this spam mails, by sending trojan links .
6. • The presence of spam content in social media is tremendously increasing, and
therefore the detection of spam has become vital.
• The spam contents increase as people extensively use social media, i.e., Facebook,
Twitter, YouTube, and E-mail.
• The time spent by people using social media is overgrowing, especially in the time
of the pandemic.
• Users get a lot of text messages through social media, and they cannot recognize
the spam content in these messages.
• Spam messages contain malicious links, apps, fake accounts, fake news, reviews,
rumors, etc.
• To improve social media security, the detection and control of spam text are
essential. In this project we present a detailed survey on spam text detection and
classification in social media using machine learning python.
7. Implementation
This is the procedure that our model follows:
Dataset : Spam.csv dataset is used which is taken from kaggle.
Data Preprocessing: Data preprocessing can refer to manipulation of data before it is used .
It is divided into 4 stages :1)data cleaning 2)data integration 3)data reduction 4)data transformation .
Feature Extraction: we are using Count Vectorizer.
Count Vectorizer: It is a great tool provided by the scikit-learn library in Python. It is used to transform a
given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
This is helpful when we have multiple such texts, and we wish to convert each word in each text into
vectors (for using in further text analysis).
Machine learning model:Naïve Bayes Classifier
8. • Simple probabilistic classifier that calculates a set of probabilities by counting the frequency and combination
of values in a given dataset.
• Represent as a vector of feature values.
• Formula for calculating probabilities:
normal=p(N)*P(W1/N)*P(W2/N)..P(Wn/N)
spam=p(S)*P(W1/S)*P(W2/S)..P(Wn/S)
p(N)=probability of normal messages
p(S)=probability of spam messages
p(W1/N)=probability of word W1 in normal messages
p(W1/S)=probability of word W1 in spam messages.
• It is very useful to classify the e-mails properly.
• The precision and recall of this method is knowing to be very effective.
NAÏVE BAYES CLASSIFIER
10. Literature review
• A literature review is a survey of scholarly sources on a specific topic. It provides an overview
of current knowledge, allowing you to identify relevant theories, methods, and gaps in the
existing research.
• Writing a literature review involves finding relevant publications (such as books and journal
articles), critically analyzing them, and explaining what you found.
• There are five key steps:
1.Search for relevant literature
2.Evaluate sources
3.Identify themes, debates and gaps
4.Outline the structure
5.Write your literature review
11. Literature review (Table – 1)
s.no Dataset name Description Reference Web link
1 Spam Assassin 1,897 spam and 4,150
ham messages
(Méndez et al., 2006) https://spamassassin.apa
che.org/old/publiccorpu
s/
2
Princeton Spam Image
Benchmark
1,071 spam images
(Biggio et al., 2011)
https://www.cs.princeto
n.edu/cass/spam/
3 Dredze Image Spam Dataset 3,927 spam and 2,006
spam images
(Almeida &
Yamakami, 2012)
https://www.cs.jhu.edu/
~mdredze/datasets/ima
ge_spam/
4 ZH1–Chinese email spam
dataset
1,205 spam and 428
ham text emails
(
Zhang, Zhu & Yao, 2
004
)
https://archive.ics.uci.ed
u/ml/datasets/spambase
5 Enron-Spam 13,496 spam and
16,545 non spam email
text
(Koprinska
et al., 2007)
http://www2.aueb.gr/us
ers/ion/data/enron-spa
m/
12. OUTPUT
• Any external email can be detected and classified as spam e-mail.so
the users will be aware of such email.
• Mails are classified into spam and not spam.
• From the classified data we have calculated the accuracy as 98.29%
15. References
1. A comparative performance study of feature selection methods for the a
nti-spam filtering domain
.
2. A survey and experimental evaluation of image spam filtering
techniques.
3. Advances in spam filtering techniques.
4. An evaluation of statistical spam filtering techniques.
5. Learning to classify e-mail.