Wednesday, 06/01/05
Spam tools in my life have gotten so good over the last couple of years that I don't consider spam to be a serious problem anymore. Yes, I get hundreds of junk messages a day, but unless I'm compelled to use dialup or go offline for a couple of days, volume isn't an obstacle either.
The core of my system is Michael Tsai's Spamsieve, which continues to be uncannily effective. I especially appreciate that it records statistics on its accuracy: going back to the first of the year I've had an average of 357 spam messages per day, or 53,949, of which a grand total of 28 got through, plus three false positives. All three were marketing messages that really were borderline spam, and which I did not care about at all. Overall accuracy is reported as 99.9%, but that's rounding down -- it's actually 99.94%. That sounds like the definition of a non-problem.
That said, my spam corpus is a year and a half old, which is pushing the limits -- the profiles of spam change over time, and an old corpus keeps fighting the last war. This was driven home to me this morning when Spamsieve flagged its first unmistakably non-spam message in memory. Fortunately, its mitigating factors left it with a very low score and it was reported by Growl, so I caught it immediately.
Looking in the Spamsieve log, I see my corpus considers "ajax" a .999 probability spam term. That's what happens when the community dreams up new jargon overnight, I suppose. The other spam-associated terms from the message were:
skyscraper(0.998)
ect(0.998) (misspelling of "etc.")
toothpicks?(0.998)
reused(0.995)
R:^dsl^speakeasy^net(0.995)
stearns(0.995) (a correspondent's name)
Time to rebuild that corpus. 12:15PM «
Bits pushed by Movable Type