[This site will look better in browsers that support web standards, but should be legible in all.]

Wednesday, 08/21/02

Continuing to track Bayesian false positives: I added all my personal mail from the year to date to my good base set. Then I spot-checked the 168 messages recent enough that they're still sitting in my shell account. Two came back false positive for spam -- an automatically generated message from TiVo, and an automatically generated message from Apple.

TiVo's message is more interesting. Here's the terms list:

	0.99 - 4at2
	0.99 - serif
	0.01 - okay
	0.99 - 4at1
	0.99 - _blank
	0.99 - serif
	0.99 - ff9900
	0.99 - 4at2
	0.99 - sans
	0.99 - email_domains
	0.99 - email_domains
	0.99 - 2002
	0.99 - 4at2
	0.99 - 4at2
	0.99 - 4at1

It's HTML email, which damns it from the getgo, since the only real person who ever sends me any HTML email is Clemma. But the email_domains token -- what the hell is that? Turns out it appears almost 200 times in my spam set, all in newslettery messages with text/plain parts formatted just like Tivo's missives, which all contain a bunch of URLs pointing at 4at1.com (you can see above that "4at1" and "4at2" didn't help). Unsettling to think of Tivo using the same bulk mailing/personalization service as spammers.

The Apple message is also HTML email with a text/plain part.

	0.99 - 3d
	0.99 - 3d
	0.99 - 3d
	0.99 - 3d
	0.99 - 3d
	0.01 - v10
	0.01 - sherlock
	0.01 - jaguar
	0.99 - 3d0
	0.01 - v10
	0.01 - v10
	0.99 - 3d28
	0.99 - 3d
	0.01 - sherlock
	0.99 - 3d

The "3d" is a lowercased 3D, which is all too common, since it's the quoted-printable replacement character for the equals sign. (3D is hexadecimal for 61, the ASCII number for "=".) That means "3d" appears in most HTML messages at least once for every hyperlink. The filters can't tell the difference between an HTML message with a lot of hyperlinks and a rare hypothetical message from a chum who's thrilled to have won a copy of Maya as a door prize -- I might as well have received a non-spam message about a great new herbal Viagra. Maybe I need to see how well things work with a minimum token size of three characters.

I could go on and on, of course, but I need to set this topic down for a bit and finish a book proposal. 12:08PM «


Bits pushed by Movable Type