[This site will look better in browsers that support web standards, but should be legible in all.]

Tuesday, 08/20/02

Here's an interesting error I made in my Bayesian spam filtering experiments: I created a non-spam base set from an archival mailbox of my personal mail, because it was just about the same size as the spam mailbox I'd assembled. Because the cutoff date for my mail archive was the end of last year, the filters attached enormous spam value to the token "2002".

Looking for false positives, I initially started testing against a few dozen current messages. In one case a message containing the unusual email address of an old girlfriend (repeated three times) and a rarely used nickname for myself was determined to be spam nonetheless, because it also prominently featured the date several times.

The other term from that message which came up big for a spam rating was "complexes". I wouldn't have noticed, but it appears frequently in stock scams, usually within the phrase "mega complexes" or "mega entertainment complexes". I have 18 such message in my collection. It's that sort of observation that has me all a-twitter about this approach. I'd never have built a spam rule that gave any weight to "complexes".

The funniest spam-significant word from my initial tests is "bumpyslidebooks". If you send me a message containing the word "bumpyslidebooks", the filters are going to take a long look. 10:42PM «


Bits pushed by Movable Type