Wednesday, 08/21/02
Continuing to track Bayesian false positives: I added all my personal mail from the year to date to my good base set. Then I spot-checked the 168 messages recent enough that they're still sitting in my shell account. Two came back false positive for spam -- an automatically generated message from TiVo, and an automatically generated message from Apple.
TiVo's message is more interesting. Here's the terms list:
0.99 - 4at2 0.99 - serif 0.01 - okay 0.99 - 4at1 0.99 - _blank 0.99 - serif 0.99 - ff9900 0.99 - 4at2 0.99 - sans 0.99 - email_domains 0.99 - email_domains 0.99 - 2002 0.99 - 4at2 0.99 - 4at2 0.99 - 4at1
It's HTML email, which damns it from the getgo, since the only real person who ever sends me any HTML email is Clemma. But the email_domains token -- what the hell is that? Turns out it appears almost 200 times in my spam set, all in newslettery messages with text/plain parts formatted just like Tivo's missives, which all contain a bunch of URLs pointing at 4at1.com (you can see above that "4at1" and "4at2" didn't help). Unsettling to think of Tivo using the same bulk mailing/personalization service as spammers.
The Apple message is also HTML email with a text/plain part.
0.99 - 3d 0.99 - 3d 0.99 - 3d 0.99 - 3d 0.99 - 3d 0.01 - v10 0.01 - sherlock 0.01 - jaguar 0.99 - 3d0 0.01 - v10 0.01 - v10 0.99 - 3d28 0.99 - 3d 0.01 - sherlock 0.99 - 3d
The "3d" is a lowercased 3D, which is all too common, since it's the quoted-printable replacement character for the equals sign. (3D is hexadecimal for 61, the ASCII number for "=".) That means "3d" appears in most HTML messages at least once for every hyperlink. The filters can't tell the difference between an HTML message with a lot of hyperlinks and a rare hypothetical message from a chum who's thrilled to have won a copy of Maya as a door prize -- I might as well have received a non-spam message about a great new herbal Viagra. Maybe I need to see how well things work with a minimum token size of three characters.
I could go on and on, of course, but I need to set this topic down for a bit and finish a book proposal. 12:08PM «
Bits pushed by Movable Type