[This site will look better in browsers that support web standards, but should be legible in all.]

Thursday, 08/29/02

Paul Graham, who wrote the Bayesian spam filtering article and sparked a lot of interest in the subject, is back with an op-ed of sorts: Spam is Different. In the preface he says, "I wrote this partly for computer people, to explain why spam doesn't have to be protected as free speech, and partly for direct marketers, a few of whom aren't yet quite clear about the difference between email and other forms of advertising."

In a nutshell, he's trying to explain why it's okay to make spam illegal (setting aside the logistical difficulties of prosecuting offenders beyond national borders).

Here's what he says early on:

Some people say the difference with spam is that the cost of email is shared between the sender and the recipient. The problem with spam, this argument goes, is that it's like sending a letter postage due.

I don't think this is the real problem. If spammers did reimburse you the cost of the resources they used, would spam stop bothering you?

After answering himself in the negative, here's what he says near the end:

What saves print catalogs, ironically, is their own cost. There are plenty of companies that would send a courier to interrupt you with an offer to refinance your mortgage if they could afford to. (The same ones that now send you spam.) But the response rate wouldn't justify the cost.

Ultimately, it's the low cost of spam that's the root of the problem.

In other words, if spammers did reimburse me for the cost of the resources they use, spam would stop bothering me -- not because I would personally receive adequate compensation, but because spammers couldn't afford to send spam.

There's no reason to resort to the spooky claim that spam is somehow a new class of unprotected speech, like obscenity, because the conventional economic argument that spam improperly consumes the receiver's resources gets the job done. This, not annoyance, is the predicate for banning unsolicited faxes.

I like Graham's point in the Bayes article about nomenclature. He argues successfully that classifying spam as "unsolicited commercial email" is insufficiently precise for several reasons, and makes a good case for using "unsolicited automated email" in its place. I bring this up because I previously thought UCE was a good technical definition of spam, and he convinced me otherwise.

His nomenclature argument undermines his legal argument -- if a politician mass-mailed "Vote for me!" messages to people in his district, that's bloody well protected speech, and it's also spam. We need a way to squash one without squashing the other. A legal method, whether it constricts the first amendment or is more narrowly targeted along the line of Washington State's spam laws, is never going to exhibit the finesse of a well-crafted technical method, like the one Graham himself popularized. 01:38PM «

Friday, 08/23/02

Trademark Blog: New York Court Embraces Order To Chill 12:19PM «

Wednesday, 08/21/02

Continuing to track Bayesian false positives: I added all my personal mail from the year to date to my good base set. Then I spot-checked the 168 messages recent enough that they're still sitting in my shell account. Two came back false positive for spam -- an automatically generated message from TiVo, and an automatically generated message from Apple.

TiVo's message is more interesting. Here's the terms list:

	0.99 - 4at2
	0.99 - serif
	0.01 - okay
	0.99 - 4at1
	0.99 - _blank
	0.99 - serif
	0.99 - ff9900
	0.99 - 4at2
	0.99 - sans
	0.99 - email_domains
	0.99 - email_domains
	0.99 - 2002
	0.99 - 4at2
	0.99 - 4at2
	0.99 - 4at1

It's HTML email, which damns it from the getgo, since the only real person who ever sends me any HTML email is Clemma. But the email_domains token -- what the hell is that? Turns out it appears almost 200 times in my spam set, all in newslettery messages with text/plain parts formatted just like Tivo's missives, which all contain a bunch of URLs pointing at 4at1.com (you can see above that "4at1" and "4at2" didn't help). Unsettling to think of Tivo using the same bulk mailing/personalization service as spammers.

The Apple message is also HTML email with a text/plain part.

	0.99 - 3d
	0.99 - 3d
	0.99 - 3d
	0.99 - 3d
	0.99 - 3d
	0.01 - v10
	0.01 - sherlock
	0.01 - jaguar
	0.99 - 3d0
	0.01 - v10
	0.01 - v10
	0.99 - 3d28
	0.99 - 3d
	0.01 - sherlock
	0.99 - 3d

The "3d" is a lowercased 3D, which is all too common, since it's the quoted-printable replacement character for the equals sign. (3D is hexadecimal for 61, the ASCII number for "=".) That means "3d" appears in most HTML messages at least once for every hyperlink. The filters can't tell the difference between an HTML message with a lot of hyperlinks and a rare hypothetical message from a chum who's thrilled to have won a copy of Maya as a door prize -- I might as well have received a non-spam message about a great new herbal Viagra. Maybe I need to see how well things work with a minimum token size of three characters.

I could go on and on, of course, but I need to set this topic down for a bit and finish a book proposal. 12:08PM «

Tuesday, 08/20/02

Here's an interesting error I made in my Bayesian spam filtering experiments: I created a non-spam base set from an archival mailbox of my personal mail, because it was just about the same size as the spam mailbox I'd assembled. Because the cutoff date for my mail archive was the end of last year, the filters attached enormous spam value to the token "2002".

Looking for false positives, I initially started testing against a few dozen current messages. In one case a message containing the unusual email address of an old girlfriend (repeated three times) and a rarely used nickname for myself was determined to be spam nonetheless, because it also prominently featured the date several times.

The other term from that message which came up big for a spam rating was "complexes". I wouldn't have noticed, but it appears frequently in stock scams, usually within the phrase "mega complexes" or "mega entertainment complexes". I have 18 such message in my collection. It's that sort of observation that has me all a-twitter about this approach. I'd never have built a spam rule that gave any weight to "complexes".

The funniest spam-significant word from my initial tests is "bumpyslidebooks". If you send me a message containing the word "bumpyslidebooks", the filters are going to take a long look. 10:42PM «

I just got back from a daylong internet outage to be greeted by about 40 pieces of spam. Buried among the spam was a promo offer from Apple, a TidBITS issue, and an automated message I get whenever Lyle Lovett updates his tour schedule.

Did you know that Lyle's leg was crushed in March by his pet bull as he saved his uncle from certain death? And that he subsequently continued to meet and add tour dates? It was news to me. Seems the very definition of a stupendous badass. 12:48PM «

Saturday, 08/17/02

Wesley Felter points to this immensely clever Paul Graham essay on Bayesian approaches to spam management:

But it doesn't mean much to be able to filter out most present-day spam, because spam evolves. Indeed, most antispam techniques so far have been like pesticides that do nothing more than create a new, resistant strain of bugs.

I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.

That'd fit nicely into Mail::Audit. The remarkable thing, which Graham doesn't address, is that this approach should work just as well for all email filtering. You start to file mail from your Aunt Judy into a particular mailbox, the filters notice. You sign up to a new mailing list and put those messages in a particular place, the filters notice. The filters don't know the significance of the Aunt Judy corpus and the spam corpus, but they know the difference, and they take different action.

If this works for spam filtering, it's going to put ever other email filtering strategy right out of business. 08:55PM «

Tuesday, 08/13/02

Lisa Belkin's cover story for the NYT Magazine about coincidence and human pattern-finding tendencies in a troubling time is wonderful for many reasons, not least this snippet:

Among other things, Tversky disproved the "hot hand" theory of basketball, the belief that a player who has made his last few baskets will more likely than not make his next. After examining thousands of shots by the Philadelphia 76ers, he proved that the odds of a successful shot cannot be predicted by the shots that came before.

I've never heard of Tversky or this finding before, but I've been fascinated for years by the statistical apparatus of Major League Baseball. The depth of information they collect and the speed and ease with which the most off-the-wall announcers' musings can be backed up by hard numbers never ceases to amaze.

(I watched a game early this season where Ichiro was toying with a pitcher, fouling off ball after ball after ball with two strikes, waiting for something he could hit, when sometime after the fourteenth pitch one of the announcers wondered out loud how often Ichiro connects with the ball. This can't be calculated from hits/walks/strikeouts/hit-by-pitch, because you can swing any number of times and hit foul balls before striking out, or not swing at all and get called strikes. Less than three minutes later the announcer had his statistic, some hard number in excess of 92%, tallying the number of times Ichiro had swung the bat and connected in his debut season in major league baseball. That's the mark of a well-designed data model.)

A couple of months ago I was wondering aloud to my pal Ted about what could be called the hot hand theory of baseball. Every time a batter steps up, you're going to see and hear that player's performance in the game to date. Does it mean anything?

It's not as cold a calculation as basketball, because I think there really are pitcher-batter relationships in which one of them just has the other's number, and I would expect that in the large, one will find better batter performance the second and third times through the lineup batting against the same starting pitcher within a game. But I've always wondered just how good a predictor one's intra-game performance really is. Someday I'm going to have to find out. 11:23AM «

Friday, 08/09/02

Automaton's drummer? The guy who turned me down for the job? He has a solo project called Plan B, and now KEXP is playing that. 04:41PM «

Thursday, 08/08/02

It's difficult to understand how The West Wing has filled three seasons without squeezing in Stephin Merritt's adorable anthem, "Washington, D.C.". 10:58AM «

Monday, 08/05/02

"This is Florida," he said. The real payoff's in the second to last paragraph. 10:41AM «

Thursday, 08/01/02

KEXP, which I have playing in the background most of the day, has been on a singleminded Automaton push for the last few weeks. The drummer for Automaton was the guy who turned me down in June for the job on which I'd allowed myself to fixate madly. Most recently, an Automaton song came on while I was actually composing a cover letter for another job.

Hopefully KEXP will follow its usual short-attention-span pattern and start focusing on something else after Automaton's show at I-Spy later this week (focusing on something like Hem, for instance, or the forthcoming Carissa's Wierd album. Yum).

I've been a bad blogger of late. I'll fix that presently. 02:23PM «


Bits pushed by Movable Type