SlideShare a Scribd company logo
Topic Models Based Personalized Spam Filter Sudarsun. S Director – R & D, Checktronix India Pvt Ltd, Chennai Venkatesh Prabhu. G Research Associate, Checktronix India Pvt Ltd, Chennai Valarmathi B Professor, SKP Engineering College, Thiruvannamalai
What is Spam ? unsolicited, unwanted email   What is Spam Filtering ? Detection/Filtering of unsolicited content What’s Personalized Spam Filtering ? Definition of “unsolicited” becomes personal Approaches Origin-Based Filtering [ Generic ] Content Based-Filtering [ Personalized ]
Content Based Filtering What does the message contain ? Images, Text, URL Is it “irrelevant” to my preferences ? How to define relevancy ? How does the system understands relevancy ? Supervised Learning Teach the system about what I like and what I don’t Unsupervised Learning Decision made using latent patterns
Content-Based Filtering -- Methods Bayesian Spam Filtering Simplest Design / Less computation cost Based on keyword distribution Cannot work on contexts Accuracy is around 60% Topic Models based Text Mining Based on distribution of n-grams (key phrases) Addresses Synonymy and Polysemy Run-time computation cost is less Unsupervised technique Rule based Filtering Supervised technique based on hand-written rules Best accuracy for known cases Cannot adopt to new patterns
Topic Models Treats every word as a feature Represents the corpus as a higher-dimensional distribution SVD: Decomposes the higher-dimensional data to a small reduced sub-space containing only the dominant feature vectors PLSA: Documents can be understood as a mixture of topics Rule Based Approaches N-Grams – Language Model Approach More common n-grams    more closer the patterns are.
Describes underlying structure among text. Computes similarities between text. Represents documents in high-dimensional Semantic Space (Term – Document Matrix). High dimensional space is approximated to low-dimensional space using Singular Value Decomposition (SVD). Decomposes the higher dimensional TDM to U, S, V matrices. U: Left Singular Vectors ( reduced word vectors ) V: Right Singular Vector ( reduced document vectors ) S: Array of Singular Values ( variances or scaling factor ) LSA Model, In Brief
PLSA Model By PLSA model, a document is a mixture of topics and topics generate words. The probabilistic latent factor model can be described as the following generative model Select a document  d i   from  D  with probability  Pr ( d i ).  Pick a latent factor  z k   with probability  Pr ( z k |d i ).  Generate a word  w j   from  W  with probability  Pr ( w j |z k ).   Where Computing the aspects model parameters using EM Algorithm
N–Gram Approach Language Model Approach Looks for repeated patterns Each word depends probabilistically on the n-1 preceding words. Calculating and Comparing the N-Gram profiles.
Overall System Architecture Training Mails   Preprocessor LSA  Model PLSA  Model N-Gram Other  Classifiers Combiner Final Result Test Mail … .
Preprocessing Feature Extraction Tokenizing Feature Selection Pruning Stemming Weighting Feature Representation Term Document Matrix Generation Sub Spacing LSA / PLSA Model Projection Feature Reduction Principle Component Analysis
Principle Component Analysis - PCA Data Reduction -  Ignore the features of lesser significance Given  N  data vectors from  k -dimensions, find  c <=  k  orthogonal vectors that can be best used to represent data  The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)  To detect structure in the relationship between variables that is used to classify data.
LSA Classification Score Input Mails LSA Model PCA BPN Token List Vector 1xR R: Rank MxR M: Vocab Size R: Rank Vector 1xR’ RxR’ R: InVar Size R’: OutVar Size
PLSA Classification Score Input Mails PLSA Model PCA BPN Token List Vector 1xZ Z: Aspects MxZ M: Vocab Size R: Aspects Count Vector 1xZ’ ZxZ’ Z: InVar Size Z’: OutVar Size
Model Training Build the Global (P)LSA model using the training mails. Vectorize the training mails using LSI/PSLA model Reduce the dimensionality of the matrix of pseudo vectors of training documents using PCA. Feed the reduced matrix into neural networks for learning. Model Testing Test mails is fed to (P)LSA for vectorization. Vector is reduced using PCA model. Reduced vector is fed into BPN neural network. BPN network emits its prediction with a confidence score (P)LSA Classification
N-Gram method Construct an N-Gram tree out of training docs Documents make the leaves Nodes make the identified N-grams from docs Weight of an N-gram = Number of children Higher order of N-gram implies more weight Weight Wt    Wt * S / ( S + L ) P: Total number of docs sharing a N-Gram S: Number of SPAM docs sharing N-Gram L: P - S
An Example N-Gram Tree T5 T1 T2 T3 T4 3 rd   2 nd   N1 2 nd   1 s t   N2 N3 N 4
Combiner Mixture of Experts Get Predictions from all the Experts Use the maximum common prediction Use the prediction with maximum confidence score
Conclusion Objective is to Filter mail messages based on the preference of an individual Classification performance increases with increased (incremental) training Initial learning is not necessary for LSA, PLSA & N-Gram. Performs unsupervised filtering Performs fast prediction although background training is a relatively slower process
References [1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian Anti-Spam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000. [2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996. [3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0 Reference Guide”, 2001. [4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks, 1999. [5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September, 2002. Ciencias Físicas, Universidad de Valencia, 1992. [6] M. Vinther, “Junk Detection using neural networks”, MeeSoft Technical Report, June 2002. Available: http://logicnet.dk/reports/JunkDetection/JunkDetection.htm. [7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”,  Journal of the American Society For Information  Science , 41, 391-407. (1990) [8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising Performance of LSA on Text Data”,  Proceedings of   IEEE INDICON  2006 . [9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,”  Proc. 22 Int’l SIGIR Conf. on Research and Development in Information Retrieval, 1999 [10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti  .” Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA”,  Proceedings of IEEE INDICON 2006.  [11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284. (1998). [12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, &quot;Information retrieval using a singular value decomposition model of latent semantic structure,&quot;  in The 11th International Conference on Research and Development in Information Retrieval, Grenoble, France: ACM Press , pp. 465--480. (1988) [13] Damashek, M. Gauging , “Similarity via N-Grams:  Language-Independant Sorting, Categorization and Retrieval of  Text”.  Science ,  267 . 843-848. [14]  Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”,  KDD’05, August 2005.
Any Queries…. ? You can post your queries to  [email_address]

More Related Content

What's hot (20)

PDF
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
PPTX
Neural Models for Information Retrieval
Bhaskar Mitra
 
PPTX
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
PPTX
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
PPTX
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
PPTX
The Duet model
Bhaskar Mitra
 
PPT
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
PPTX
Search Engines
butest
 
PPTX
Deep Learning for Search
Bhaskar Mitra
 
PPTX
Neural Information Retrieval: In search of meaningful progress
Bhaskar Mitra
 
PPTX
Vectorland: Brief Notes from Using Text Embeddings for Search
Bhaskar Mitra
 
DOC
taghelper-final.doc
butest
 
PPTX
Sentiment analysis using naive bayes classifier
Dev Sahu
 
PDF
Information Retrieval using Semantic Similarity
Saswat Padhi
 
PPTX
Text Mining for Lexicography
Leiden University
 
PPTX
Presentation on Text Classification
Sai Srinivas Kotni
 
PPTX
What is word2vec?
Traian Rebedea
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
NLP_Project_Paper_up276_vec241
Urjit Patel
 
PDF
Improved text clustering with
IJDKP
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
Neural Models for Information Retrieval
Bhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
Bhaskar Mitra
 
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
The Duet model
Bhaskar Mitra
 
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Search Engines
butest
 
Deep Learning for Search
Bhaskar Mitra
 
Neural Information Retrieval: In search of meaningful progress
Bhaskar Mitra
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Bhaskar Mitra
 
taghelper-final.doc
butest
 
Sentiment analysis using naive bayes classifier
Dev Sahu
 
Information Retrieval using Semantic Similarity
Saswat Padhi
 
Text Mining for Lexicography
Leiden University
 
Presentation on Text Classification
Sai Srinivas Kotni
 
What is word2vec?
Traian Rebedea
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
NLP_Project_Paper_up276_vec241
Urjit Patel
 
Improved text clustering with
IJDKP
 

Similar to Topic Models Based Personalized Spam Filter (20)

PDF
Aj35198205
IJERA Editor
 
PPTX
Data Mining Email SPam Detection PPT WITH Algorithms
deepika90811
 
PDF
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
ReemaAsker1
 
PDF
IRJET- Suspicious Email Detection System
IRJET Journal
 
PDF
Integration of feature sets with machine learning techniques
iaemedu
 
PPTX
Text Classification
RAX Automation Suite
 
PPTX
Text mining and analytics v6 - p2
Dave King
 
DOC
Comparing Naive Bayesian and k-NN algorithms for automatic ...
butest
 
PDF
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
aciijournal
 
PPT
mailfilter.ppt
butest
 
PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 
PDF
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
ijtsrd
 
PDF
Automatic Text Classification Of News Blog using Machine Learning
IRJET Journal
 
PPT
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
PDF
Statistical Learning and Text Classification with NLTK and scikit-learn
Olivier Grisel
 
PDF
NLP Project: Paragraph Topic Classification
Eugene Nho
 
PDF
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
PPTX
Lecture 10
Jeet Das
 
PPTX
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
PDF
Context Driven Technique for Document Classification
IDES Editor
 
Aj35198205
IJERA Editor
 
Data Mining Email SPam Detection PPT WITH Algorithms
deepika90811
 
05_nlp_Vectorization_ML_model_in_text_analysis.pdf
ReemaAsker1
 
IRJET- Suspicious Email Detection System
IRJET Journal
 
Integration of feature sets with machine learning techniques
iaemedu
 
Text Classification
RAX Automation Suite
 
Text mining and analytics v6 - p2
Dave King
 
Comparing Naive Bayesian and k-NN algorithms for automatic ...
butest
 
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
aciijournal
 
mailfilter.ppt
butest
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
ijtsrd
 
Automatic Text Classification Of News Blog using Machine Learning
IRJET Journal
 
The science behind predictive analytics a text mining perspective
ankurpandeyinfo
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Olivier Grisel
 
NLP Project: Paragraph Topic Classification
Eugene Nho
 
Crash-course in Natural Language Processing
Vsevolod Dyomkin
 
Lecture 10
Jeet Das
 
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Context Driven Technique for Document Classification
IDES Editor
 
Ad

More from Sudarsun Santhiappan (13)

ODP
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
ODP
Software Patterns
Sudarsun Santhiappan
 
ODP
Search Engine Demystified
Sudarsun Santhiappan
 
ODP
Distributed Computing
Sudarsun Santhiappan
 
ODP
Essentials for a Budding IT professional
Sudarsun Santhiappan
 
PPT
What it takes to be the Best IT Trainer
Sudarsun Santhiappan
 
PPT
Using Behavioral Patterns In Treating Autistic
Sudarsun Santhiappan
 
PPT
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
PPT
Audio And Video Over Internet
Sudarsun Santhiappan
 
PPT
Practical Network Security
Sudarsun Santhiappan
 
PPT
How To Do A Project
Sudarsun Santhiappan
 
PPT
Object Oriented Design
Sudarsun Santhiappan
 
Challenges in Large Scale Machine Learning
Sudarsun Santhiappan
 
Software Patterns
Sudarsun Santhiappan
 
Search Engine Demystified
Sudarsun Santhiappan
 
Distributed Computing
Sudarsun Santhiappan
 
Essentials for a Budding IT professional
Sudarsun Santhiappan
 
What it takes to be the Best IT Trainer
Sudarsun Santhiappan
 
Using Behavioral Patterns In Treating Autistic
Sudarsun Santhiappan
 
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
Audio And Video Over Internet
Sudarsun Santhiappan
 
Practical Network Security
Sudarsun Santhiappan
 
How To Do A Project
Sudarsun Santhiappan
 
Object Oriented Design
Sudarsun Santhiappan
 
Ad

Recently uploaded (20)

PDF
Thane Stenner - An Industry Expert
Thane Stenner
 
PDF
NewBase 07 July 2025 Energy News issue - 1800 by Khaled Al Awadi_compressed.pdf
Khaled Al Awadi
 
PPTX
2025 July - ABM for B2B in Hubspot - Demand Gen HUG.pptx
mjenkins13
 
PPTX
Washington University of Health and Science A Choice You Can Trust
Washington University of Health and Science
 
PPTX
Drive Operational Excellence with Proven Continuous Improvement Strategies
Group50 Consulting
 
PDF
LEWIONICS SCO Company Profile UAE JULY 2025
Natalie Lewes
 
PDF
Maksym Vyshnivetskyi: Управління закупівлями (UA)
Lviv Startup Club
 
PDF
Connecting Startups to Strategic Global VC Opportunities.pdf
Google
 
PDF
"Complete Guide to the Partner Visa 2025
Zealand Immigration
 
PDF
Flexible Metal Hose & Custom Hose Assemblies
McGill Hose & Coupling Inc
 
PDF
Azumah Resources reaffirms commitment to Ghana amid dispute with Engineers & ...
Kweku Zurek
 
PDF
Buy Boys Long Sleeve T-shirts at Port 213
Port 213
 
PPTX
Understanding ISO 42001 Standard: AI Governance & Compliance Insights from Ad...
Adeptiv AI
 
PDF
Importance of Timely Renewal of Legal Entity Identifiers.pdf
MNS Credit Management Group Pvt. Ltd.
 
PDF
15 Essential Cloud Podcasts Every Tech Professional Should Know in 2025
Amnic
 
PPTX
Top RPA Tools to Watch in 2024: Transforming Automation
RUPAL AGARWAL
 
PDF
Smart Lead Magnet Review: Effortless Email List Growth with Automated Funnels...
Larry888358
 
PDF
NJ GST Collection Summary - June2025.pdf
writer28
 
PDF
Jordan Minnesota City Codes and Ordinances
Forklift Trucks in Minnesota
 
PDF
Blind Spots in Business: Unearthing Hidden Challenges in Today's Organizations
Crimson Business Consulting
 
Thane Stenner - An Industry Expert
Thane Stenner
 
NewBase 07 July 2025 Energy News issue - 1800 by Khaled Al Awadi_compressed.pdf
Khaled Al Awadi
 
2025 July - ABM for B2B in Hubspot - Demand Gen HUG.pptx
mjenkins13
 
Washington University of Health and Science A Choice You Can Trust
Washington University of Health and Science
 
Drive Operational Excellence with Proven Continuous Improvement Strategies
Group50 Consulting
 
LEWIONICS SCO Company Profile UAE JULY 2025
Natalie Lewes
 
Maksym Vyshnivetskyi: Управління закупівлями (UA)
Lviv Startup Club
 
Connecting Startups to Strategic Global VC Opportunities.pdf
Google
 
"Complete Guide to the Partner Visa 2025
Zealand Immigration
 
Flexible Metal Hose & Custom Hose Assemblies
McGill Hose & Coupling Inc
 
Azumah Resources reaffirms commitment to Ghana amid dispute with Engineers & ...
Kweku Zurek
 
Buy Boys Long Sleeve T-shirts at Port 213
Port 213
 
Understanding ISO 42001 Standard: AI Governance & Compliance Insights from Ad...
Adeptiv AI
 
Importance of Timely Renewal of Legal Entity Identifiers.pdf
MNS Credit Management Group Pvt. Ltd.
 
15 Essential Cloud Podcasts Every Tech Professional Should Know in 2025
Amnic
 
Top RPA Tools to Watch in 2024: Transforming Automation
RUPAL AGARWAL
 
Smart Lead Magnet Review: Effortless Email List Growth with Automated Funnels...
Larry888358
 
NJ GST Collection Summary - June2025.pdf
writer28
 
Jordan Minnesota City Codes and Ordinances
Forklift Trucks in Minnesota
 
Blind Spots in Business: Unearthing Hidden Challenges in Today's Organizations
Crimson Business Consulting
 

Topic Models Based Personalized Spam Filter

  • 1. Topic Models Based Personalized Spam Filter Sudarsun. S Director – R & D, Checktronix India Pvt Ltd, Chennai Venkatesh Prabhu. G Research Associate, Checktronix India Pvt Ltd, Chennai Valarmathi B Professor, SKP Engineering College, Thiruvannamalai
  • 2. What is Spam ? unsolicited, unwanted email What is Spam Filtering ? Detection/Filtering of unsolicited content What’s Personalized Spam Filtering ? Definition of “unsolicited” becomes personal Approaches Origin-Based Filtering [ Generic ] Content Based-Filtering [ Personalized ]
  • 3. Content Based Filtering What does the message contain ? Images, Text, URL Is it “irrelevant” to my preferences ? How to define relevancy ? How does the system understands relevancy ? Supervised Learning Teach the system about what I like and what I don’t Unsupervised Learning Decision made using latent patterns
  • 4. Content-Based Filtering -- Methods Bayesian Spam Filtering Simplest Design / Less computation cost Based on keyword distribution Cannot work on contexts Accuracy is around 60% Topic Models based Text Mining Based on distribution of n-grams (key phrases) Addresses Synonymy and Polysemy Run-time computation cost is less Unsupervised technique Rule based Filtering Supervised technique based on hand-written rules Best accuracy for known cases Cannot adopt to new patterns
  • 5. Topic Models Treats every word as a feature Represents the corpus as a higher-dimensional distribution SVD: Decomposes the higher-dimensional data to a small reduced sub-space containing only the dominant feature vectors PLSA: Documents can be understood as a mixture of topics Rule Based Approaches N-Grams – Language Model Approach More common n-grams  more closer the patterns are.
  • 6. Describes underlying structure among text. Computes similarities between text. Represents documents in high-dimensional Semantic Space (Term – Document Matrix). High dimensional space is approximated to low-dimensional space using Singular Value Decomposition (SVD). Decomposes the higher dimensional TDM to U, S, V matrices. U: Left Singular Vectors ( reduced word vectors ) V: Right Singular Vector ( reduced document vectors ) S: Array of Singular Values ( variances or scaling factor ) LSA Model, In Brief
  • 7. PLSA Model By PLSA model, a document is a mixture of topics and topics generate words. The probabilistic latent factor model can be described as the following generative model Select a document d i from D with probability Pr ( d i ). Pick a latent factor z k with probability Pr ( z k |d i ). Generate a word w j from W with probability Pr ( w j |z k ). Where Computing the aspects model parameters using EM Algorithm
  • 8. N–Gram Approach Language Model Approach Looks for repeated patterns Each word depends probabilistically on the n-1 preceding words. Calculating and Comparing the N-Gram profiles.
  • 9. Overall System Architecture Training Mails Preprocessor LSA Model PLSA Model N-Gram Other Classifiers Combiner Final Result Test Mail … .
  • 10. Preprocessing Feature Extraction Tokenizing Feature Selection Pruning Stemming Weighting Feature Representation Term Document Matrix Generation Sub Spacing LSA / PLSA Model Projection Feature Reduction Principle Component Analysis
  • 11. Principle Component Analysis - PCA Data Reduction - Ignore the features of lesser significance Given N data vectors from k -dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) To detect structure in the relationship between variables that is used to classify data.
  • 12. LSA Classification Score Input Mails LSA Model PCA BPN Token List Vector 1xR R: Rank MxR M: Vocab Size R: Rank Vector 1xR’ RxR’ R: InVar Size R’: OutVar Size
  • 13. PLSA Classification Score Input Mails PLSA Model PCA BPN Token List Vector 1xZ Z: Aspects MxZ M: Vocab Size R: Aspects Count Vector 1xZ’ ZxZ’ Z: InVar Size Z’: OutVar Size
  • 14. Model Training Build the Global (P)LSA model using the training mails. Vectorize the training mails using LSI/PSLA model Reduce the dimensionality of the matrix of pseudo vectors of training documents using PCA. Feed the reduced matrix into neural networks for learning. Model Testing Test mails is fed to (P)LSA for vectorization. Vector is reduced using PCA model. Reduced vector is fed into BPN neural network. BPN network emits its prediction with a confidence score (P)LSA Classification
  • 15. N-Gram method Construct an N-Gram tree out of training docs Documents make the leaves Nodes make the identified N-grams from docs Weight of an N-gram = Number of children Higher order of N-gram implies more weight Weight Wt  Wt * S / ( S + L ) P: Total number of docs sharing a N-Gram S: Number of SPAM docs sharing N-Gram L: P - S
  • 16. An Example N-Gram Tree T5 T1 T2 T3 T4 3 rd 2 nd N1 2 nd 1 s t N2 N3 N 4
  • 17. Combiner Mixture of Experts Get Predictions from all the Experts Use the maximum common prediction Use the prediction with maximum confidence score
  • 18. Conclusion Objective is to Filter mail messages based on the preference of an individual Classification performance increases with increased (incremental) training Initial learning is not necessary for LSA, PLSA & N-Gram. Performs unsupervised filtering Performs fast prediction although background training is a relatively slower process
  • 19. References [1]I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. “An Evaluation of Naïve Bayesian Anti-Spam Filtering”, Proc. of the workshop on Machine Learning in the New Information Age, 2000. [2]W. Cohen, “Learning rules that classify e-mail”, AAAI Spring Symposium on Machine Learning in Information Access, 1996. [3] W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bosch, “TiMBL: Tilburg Memory-Based Learner - version 4.0 Reference Guide”, 2001. [4] H. Drucker, D. Wu, and V. N. Vapnik., “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural networks, 1999. [5] D. Mertz, “Spam Filtering Techniques. Six approaches to eliminating unwanted e-mail.”, Gnosis Software Inc., September, 2002. Ciencias Físicas, Universidad de Valencia, 1992. [6] M. Vinther, “Junk Detection using neural networks”, MeeSoft Technical Report, June 2002. Available: http://logicnet.dk/reports/JunkDetection/JunkDetection.htm. [7] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. “Indexing By Latent Semantic Analysis”, Journal of the American Society For Information Science , 41, 391-407. (1990) [8] Sudarsun Santhiappan, Venkatesh Prabhu Gopalan, and Sathish Kumar Veeraswamy,”Role of Weighting on TDM in Improvising Performance of LSA on Text Data”, Proceedings of IEEE INDICON 2006 . [9] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22 Int’l SIGIR Conf. on Research and Development in Information Retrieval, 1999 [10]Sudarsun Santhiappan, Dalou Kalaivendhan and Venkateswarlu Malapatti .” Unsupervised Contextual Keyword Relevance Learning and Measurement using PLSA”, Proceedings of IEEE INDICON 2006. [11]Landauer, T. K., Foltz, P. W., & Laham, D. “Introduction to Latent Semantic Analysis”, DiscourseProcesses, 25, 259-284. (1998). [12]G. Furnas, S. Deerwester, S. Dumais, T. Landauer, R. Harshman, L. Streeter and K. Lochbaum, &quot;Information retrieval using a singular value decomposition model of latent semantic structure,&quot; in The 11th International Conference on Research and Development in Information Retrieval, Grenoble, France: ACM Press , pp. 465--480. (1988) [13] Damashek, M. Gauging , “Similarity via N-Grams: Language-Independant Sorting, Categorization and Retrieval of Text”. Science , 267 . 843-848. [14] Sholomo Hershkop, Salvatore J.Stolfo , “Combining Email models for False Positive Reduction”, KDD’05, August 2005.
  • 20. Any Queries…. ? You can post your queries to [email_address]