
Spam Filtering
From the Lab to the Real World
Joshua Goodman
Microsoft Research

Outline
From the Lab to the Real World
Step 1: Inventions
Step 2: Picking the right technique
•
Digression on evaluating spam filters
Step 3: Shipping it
•
Lots and lots of practical issues
Step 4: The future

Invention
If you have a hammer,
everything looks like a nail
•
Machine Learning and Applied
Statistics group full of Bayesians,
so we used a Bayesian machine
learning approach
Microsoft Research started work on spam
filtering in 1997
•
Lots and lots of people involved. I’m not the
inventor, I helped transfer to product.

How it Works
Look at various features of the message
•
Words in the message
•
Wet is spam-like
•
Weather is not spam
•
Wet weather is not
•
Special features
•
Example: What time the message was sent
•
Spam more likely to be sent in the middle of the night
Machine learning system, similar to a
probabilistic neural net

Picking the Right Technique
Evaluation is key
•
Otherwise, you don’t know what is working,
what techniques to use.
Evaluating Spam Filters is a lot harder
than it sounds
What measures do you use