I'm dragging a discussion we were having off the list back on. I think I might have come up with a satisfactory way to scan for -all- RFC 822 addresses in free text (unfortunately, you might tend to pick up other stuff that also looks like an address). The important trick is to match a subset of RFC 822 which still gives you a deliverable address. While: This is valid <foo@bar.com> might be a complete address, you can get away with just grabbing 'foo@bar.com'. This is the "global address" (addr-spec in the BNF). Its still complicated to match, but its something that's fairly unique in plain text. On Mon, Jan 31, 2000 at 09:38:40PM -0500, Bennett Todd wrote: > 2000-01-31-21:26:18 Michael G Schwern: > > 2000-01-31-21:07:11 Bennett Todd: > > > > > > ouch@[wow isn't this fun?].the.pain.net > > > > > > I'll grant you that the BNF you quoted would seem to permit > > > the above address, but is there any email system following > > > RFC82[12] where anything other than user@domainname or > > > user@[ip.addr.in.quads] can actually be delivered? > > > > Any good MTA should understand it. I'm sure sendmail wouldn't > > have a problem. Whether or not anyone actually -uses- it is > > another question. > > Let me cast the question another way. > > [wow isn't this fun?].the.pain.net > > is not a valid domain name. An MTA can't do an MX lookup on it. > > It also doesn't match the pattern > > [ip.addr.in.quads] > > which bypasses the DNS lookup and routes directly (via SMTP) to the > named IP address. > > So even if it's valid according to the BNF, how is it supposed to > be handled? If it's not deliverable under any circumstances on any > machine anywhere, then I'd say the fact that the BNF accepts it is a > bizzare curiosity, rather than a practical problem, no? Most of RFC822 is a bizarre curiosity. Frighteningly enough it follows the same 90/90 pattern that parsing English does (nailing 90% of the cases takes 90% of the time, getting the last 10% takes the other 90% of the time). You're right, most of this is just intellectual wanking. You can nail most emails with a simple regex, but there's -always- someone that takes advantage of that. That simple regex I posted on FWP eariler, I added a specific case to nail *@qz.to (Eli the Bearded) just because I knew it was out there and that most email scanners missed it. I'd say the cases you really have to worry about are: simple@email.address.com "quoted name"@foo.com "quoted name with \" character"@foo.com *@qt.to (and other valid special characters) foo@tld (top level domains) You can scan for those with something like: # Build up basic RFC 822 BNF definitions. $specials = '()<>@,;:\\".[]'; $space = '\040'; $char = '\000-\0177'; $ctl = '\000-\037\0177'; $qtext_re = qr/[^"\\\r]+/; $qpair_re = qr/\\$char/; $quoted_string_re = qr/"($qtext_re|$qpair_re)*"/; $atom_re = qr/[^$ctl$space$specials]+/; $domain_ref_re = $atom_re; $dtext_re = qr/[^\[\]\\\r]/; $domain_literal_re = qr/[(?:$dtext_re|$qpair_re)*]/; $sub_domain_re = qr/($domain_ref_re|$domain_literal_re)/; $domain_re = qr/$sub_domain_re(?:\.$sub_domain_re)*/; $word_re = qr/(?:$atom_re|$quoted_string_re)/; $local_part_re = qr/$word_re(?:\.$word_re)*/; $addr_spec_re = qr/($local_part_re\@$domain_re)/; You can then use $addr_spec_re to scan a text document for a subset of RFC 822 and feed each match into Email::Valid (not entirely sure that's necessary, my regex might be complete enough.) Of course, none of that is optimized, I'm sure I could tighten it up, but... whatever. Good enough for 3am. I believe that nail them all with a minimum of fuss, even the domain-literals. Of course, I'm not sure what else it will pick up. Of course, there is also the problem of accidentally picking up message IDs. I have -no- idea what to do about that. Now, I could modularize this... but would I just be making a scanning library for spammers? -- Michael G Schwern schwern@pobox.com http://www.pobox.com/~schwern /(?:(?:(1)[.-]?)?\(?(\d{3})\)?[.-]?)?(\d{3})[.-]?(\d{4})(x\d+)?/i ==== Want to unsubscribe from Fun With Perl? Well, if you insist... ==== Send email to <fwp-request@technofile.org> with message _body_ ==== unsubscribe