[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] R/E Question



I'm dragging a discussion we were having off the list back on.  I
think I might have come up with a satisfactory way to scan for -all-
RFC 822 addresses in free text (unfortunately, you might tend to pick
up other stuff that also looks like an address).

The important trick is to match a subset of RFC 822 which still gives
you a deliverable address.  While:

        This is valid <foo@bar.com>

might be a complete address, you can get away with just grabbing
'foo@bar.com'.  This is the "global address" (addr-spec in the BNF).
Its still complicated to match, but its something that's fairly unique
in plain text.


On Mon, Jan 31, 2000 at 09:38:40PM -0500, Bennett Todd wrote:
> 2000-01-31-21:26:18 Michael G Schwern:
> > 2000-01-31-21:07:11 Bennett Todd:
> > > > > > ouch@[wow isn't this fun?].the.pain.net
> > > 
> > > I'll grant you that the BNF you quoted would seem to permit
> > > the above address, but is there any email system following
> > > RFC82[12] where anything other than user@domainname or
> > > user@[ip.addr.in.quads] can actually be delivered?
> >
> > Any good MTA should understand it.  I'm sure sendmail wouldn't
> > have a problem.  Whether or not anyone actually -uses- it is
> > another question.
> 
> Let me cast the question another way.
> 
> 	[wow isn't this fun?].the.pain.net
> 
> is not a valid domain name. An MTA can't do an MX lookup on it.
> 
> It also doesn't match the pattern
> 
> 	[ip.addr.in.quads]
> 
> which bypasses the DNS lookup and routes directly (via SMTP) to the
> named IP address.
> 
> So even if it's valid according to the BNF, how is it supposed to
> be handled? If it's not deliverable under any circumstances on any
> machine anywhere, then I'd say the fact that the BNF accepts it is a
> bizzare curiosity, rather than a practical problem, no?

Most of RFC822 is a bizarre curiosity.  Frighteningly enough it
follows the same 90/90 pattern that parsing English does (nailing 90%
of the cases takes 90% of the time, getting the last 10% takes the
other 90% of the time).

You're right, most of this is just intellectual wanking.  You can nail
most emails with a simple regex, but there's -always- someone that
takes advantage of that.  That simple regex I posted on FWP eariler, I
added a specific case to nail *@qz.to (Eli the Bearded) just because I
knew it was out there and that most email scanners missed it.

I'd say the cases you really have to worry about are:
        simple@email.address.com
        "quoted name"@foo.com
        "quoted name with \" character"@foo.com
        *@qt.to     (and other valid special characters)
        foo@tld     (top level domains)

You can scan for those with something like:
        # Build up basic RFC 822 BNF definitions.
        $specials = '()<>@,;:\\".[]';
        $space    = '\040';
        $char     = '\000-\0177';
        $ctl      = '\000-\037\0177';

        $qtext_re = qr/[^"\\\r]+/;
        $qpair_re = qr/\\$char/;
        $quoted_string_re = qr/"($qtext_re|$qpair_re)*"/;

        $atom_re  = qr/[^$ctl$space$specials]+/;
        $domain_ref_re = $atom_re;
        $dtext_re = qr/[^\[\]\\\r]/;
        $domain_literal_re = qr/[(?:$dtext_re|$qpair_re)*]/;
        $sub_domain_re = qr/($domain_ref_re|$domain_literal_re)/;
        $domain_re = qr/$sub_domain_re(?:\.$sub_domain_re)*/;

        $word_re = qr/(?:$atom_re|$quoted_string_re)/;
        $local_part_re = qr/$word_re(?:\.$word_re)*/;

        $addr_spec_re = qr/($local_part_re\@$domain_re)/;

You can then use $addr_spec_re to scan a text document for a subset of
RFC 822 and feed each match into Email::Valid (not entirely sure
that's necessary, my regex might be complete enough.)  Of course, none
of that is optimized, I'm sure I could tighten it up, but... whatever.
Good enough for 3am.

I believe that nail them all with a minimum of fuss, even the
domain-literals.  Of course, I'm not sure what else it will pick up.
 
Of course, there is also the problem of accidentally picking up
message IDs.  I have -no- idea what to do about that.

Now, I could modularize this... but would I just be making a scanning
library for spammers?

-- 

Michael G Schwern                                           schwern@pobox.com
                    http://www.pobox.com/~schwern
     /(?:(?:(1)[.-]?)?\(?(\d{3})\)?[.-]?)?(\d{3})[.-]?(\d{4})(x\d+)?/i

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe