[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] R/E Question

To: Bill Jones <bill@fccj.org>
Subject: Re: [FWP] R/E Question
From: Michael G Schwern <schwern@pobox.com>
Date: Sat, 29 Jan 2000 22:23:03 -0500
Cc: fwp@technofile.org
In-Reply-To: <B4B8C836.AD2D%bill@fccj.org>; from bill@fccj.org on Sat, Jan 29, 2000 at 04:39:18PM -0500
References: <B4B3486F.A88A%bill@fccj.org> <B4B8C836.AD2D%bill@fccj.org>

On Sat, Jan 29, 2000 at 04:39:18PM -0500, Bill Jones wrote:
> ([Ee]-?mail:)?[a-zA-Z]+([._][a-zA-Z]+)*@[^ ]+
> 
> (ftp://|http://)[^ ]+

I presume you're trying to discover URLs in arbitrary text?  I have
this bit of code which has congealed over the past couple of years.
The goal:  Find -all- possible URLs (not just common ones) in large
blocks of arbitrary text quickly.  Also attempt to find schemeless
URLs.

The solution:  Use a (relatively) simple regex to scan for a superset
of all URLs, then toss what we find through URI::URL to determine if
it is in fact a URL.  There's also some heuristical whatnots in there
to clean things up a little.

This specific routine, morphurls() is primarily for finding URLs in
text and wrapping them in an HTML link, but the basic technique can be
adapted to do other things.  I've used something very similar to scan
large amounts of text (such as the Perl documentation) for URLs and
checking to make sure they're still valid.

Its not the prettiest thing in the universe, but its -almost- 100%.
It might pick up a few silly things, especially when scanning text
talking about File::Find and HTTP::Daemon.  It might consider those
strings to be URLs.  However, AFAIK it will never miss a valid URL.

require URI;

# From RFC 2396
use constant ALPHA_SET         	=> 'A-Za-z';
use constant DIGIT_SET         	=> '0-9';
use constant ALPHA_NUM_SET     	=> ALPHA_SET . DIGIT_SET;
use constant SCHEME_SET        	=> ALPHA_NUM_SET . '\+\-\.';
use constant RESERVED_SET      	=> $URI::reserved;
# We want to escape &, ; and =, so remove them from the reserved set
use constant RESERVED_SET_CHEAT => '\/\?\:\@\+\$\,';
use constant ESCAPED_SET       	=> '\%' . DIGIT_SET . 'A-Fa-f';
use constant MARK_SET          	=> $URI::mark;
use constant UNRESERVED_SET    	=> ALPHA_NUM_SET . MARK_SET;
use constant URIC_NO_SLASH_SET 	=> UNRESERVED_SET . ESCAPED_SET .
                                    '\;\?\:\@\&\=\+\$\,';
use constant URIC_SET          	=> RESERVED_SET_CHEAT . UNRESERVED_SET . 
                                   ESCAPED_SET;
use constant SCHEME_SET_RE		=> qr/$URI::scheme_re/;

my($schemeRe) = SCHEME_SET_RE;
my($uricSet)  = URIC_SET;
# Look for schemed URLs or simply www.
my($uriRe)    = qr/(?:$schemeRe:|www\.)[$uricSet]+/;

# $num_urlsmorphed = morphurls(\$text, \&callback);
sub morphurls {
    my($text, $callback) = @_;

    my $urlsfound = 0;

    # Don't assume http.
    URI::URL::strict(1);

    # Yes, evil.  Basically, look for something vaguely resembling a URL,
    # then hand it off to URI::URL for examination.  If it passes, throw
    # it to a callback and put the result in its place.
    local $SIG{__DIE__} = 'DEFAULT';
    my $uri_cand;
    my $uri;
    $$text =~ s/($uriRe)/
      $uri_cand = $1;

      # A heruristic.  Often you'll see things like:
      # "I saw this site, http:\/\/www.foo.com, and its really neat!"
      # or "Foo Industries (at http:\/\/www.foo.com)"
      # We want to avoid picking up the trailing paren, period or comma.
      # Of course, this might wreck a perfectly valid URI, more often than
      # not it corrects a parse mistake.
      $uri_cand =~ s|[\)\,\.]+$||;

      # Another cheat.  Add http:\/\/ to schemeless URIs that start with www.
      $uri = $uri_cand;
      $uri =~ s|^www\.|http:\/\/www\.|;

      eval {
	  $uri = URI::URL->new($uri);
      };

      if($@ || !defined $uri) {	# leave everything untouched, its not a URI.
	  $1;
      }
      else {			# Its a URI, run th callback on it.
	  $urlsfound++;
	  $callback->($uri_cand, $uri);
      }
    /egx;

    return $urlsfound;
}

Similar principles might be applied to finding email addresses in
text, but the problem is that almost -anything- can be considered a
valid RFC 822 address.  If you can settle for a subset, this regex
isn't too horrible.

        qr/[\w\.\*\-\+]+\@(?:[\w.-]+\.)+[\w-]+/;

It has the virtues of being able to pick up things like:

        msdt+@andrew.cmu.edu
        *@qz.to            (Yes, this is perfectly valid)
        joe12345@aol.com
        Yarrow-Hock@foo.com

But there's a whole class of things it will miss.  Fortunately,
they're relatively uncommon.

-- 

Michael G Schwern                                           schwern@pobox.com
                    http://www.pobox.com/~schwern
     /(?:(?:(1)[.-]?)?\(?(\d{3})\)?[.-]?)?(\d{3})[.-]?(\d{4})(x\d+)?/i

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe

Follow-Ups:
- Re: [FWP] R/E Question
  - From: Adam Sampson <azz@gnu.org>
- Re: [FWP] R/E Question
  - From: Uri Guttman <uri@sysarch.com>
- Re: [FWP] R/E Question
  - From: Jeff Pinyan <jeffp@crusoe.net>

References:
- Re: [FWP] Small fun with Perl
  - From: Bill Jones <bill@fccj.org>
- [FWP] R/E Question
  - From: Bill Jones <bill@fccj.org>

Prev by Date: Re: [FWP] R/E Question
Next by Date: Re: [FWP] R/E Question
Prev by thread: Re: [FWP] R/E Question
Next by thread: Re: [FWP] R/E Question
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net