On Sat, Jan 29, 2000 at 04:39:18PM -0500, Bill Jones wrote: > ([Ee]-?mail:)?[a-zA-Z]+([._][a-zA-Z]+)*@[^ ]+ > > (ftp://|http://)[^ ]+ I presume you're trying to discover URLs in arbitrary text? I have this bit of code which has congealed over the past couple of years. The goal: Find -all- possible URLs (not just common ones) in large blocks of arbitrary text quickly. Also attempt to find schemeless URLs. The solution: Use a (relatively) simple regex to scan for a superset of all URLs, then toss what we find through URI::URL to determine if it is in fact a URL. There's also some heuristical whatnots in there to clean things up a little. This specific routine, morphurls() is primarily for finding URLs in text and wrapping them in an HTML link, but the basic technique can be adapted to do other things. I've used something very similar to scan large amounts of text (such as the Perl documentation) for URLs and checking to make sure they're still valid. Its not the prettiest thing in the universe, but its -almost- 100%. It might pick up a few silly things, especially when scanning text talking about File::Find and HTTP::Daemon. It might consider those strings to be URLs. However, AFAIK it will never miss a valid URL. require URI; # From RFC 2396 use constant ALPHA_SET => 'A-Za-z'; use constant DIGIT_SET => '0-9'; use constant ALPHA_NUM_SET => ALPHA_SET . DIGIT_SET; use constant SCHEME_SET => ALPHA_NUM_SET . '\+\-\.'; use constant RESERVED_SET => $URI::reserved; # We want to escape &, ; and =, so remove them from the reserved set use constant RESERVED_SET_CHEAT => '\/\?\:\@\+\$\,'; use constant ESCAPED_SET => '\%' . DIGIT_SET . 'A-Fa-f'; use constant MARK_SET => $URI::mark; use constant UNRESERVED_SET => ALPHA_NUM_SET . MARK_SET; use constant URIC_NO_SLASH_SET => UNRESERVED_SET . ESCAPED_SET . '\;\?\:\@\&\=\+\$\,'; use constant URIC_SET => RESERVED_SET_CHEAT . UNRESERVED_SET . ESCAPED_SET; use constant SCHEME_SET_RE => qr/$URI::scheme_re/; my($schemeRe) = SCHEME_SET_RE; my($uricSet) = URIC_SET; # Look for schemed URLs or simply www. my($uriRe) = qr/(?:$schemeRe:|www\.)[$uricSet]+/; # $num_urlsmorphed = morphurls(\$text, \&callback); sub morphurls { my($text, $callback) = @_; my $urlsfound = 0; # Don't assume http. URI::URL::strict(1); # Yes, evil. Basically, look for something vaguely resembling a URL, # then hand it off to URI::URL for examination. If it passes, throw # it to a callback and put the result in its place. local $SIG{__DIE__} = 'DEFAULT'; my $uri_cand; my $uri; $$text =~ s/($uriRe)/ $uri_cand = $1; # A heruristic. Often you'll see things like: # "I saw this site, http:\/\/www.foo.com, and its really neat!" # or "Foo Industries (at http:\/\/www.foo.com)" # We want to avoid picking up the trailing paren, period or comma. # Of course, this might wreck a perfectly valid URI, more often than # not it corrects a parse mistake. $uri_cand =~ s|[\)\,\.]+$||; # Another cheat. Add http:\/\/ to schemeless URIs that start with www. $uri = $uri_cand; $uri =~ s|^www\.|http:\/\/www\.|; eval { $uri = URI::URL->new($uri); }; if($@ || !defined $uri) { # leave everything untouched, its not a URI. $1; } else { # Its a URI, run th callback on it. $urlsfound++; $callback->($uri_cand, $uri); } /egx; return $urlsfound; } Similar principles might be applied to finding email addresses in text, but the problem is that almost -anything- can be considered a valid RFC 822 address. If you can settle for a subset, this regex isn't too horrible. qr/[\w\.\*\-\+]+\@(?:[\w.-]+\.)+[\w-]+/; It has the virtues of being able to pick up things like: msdt+@andrew.cmu.edu *@qz.to (Yes, this is perfectly valid) joe12345@aol.com Yarrow-Hock@foo.com But there's a whole class of things it will miss. Fortunately, they're relatively uncommon. -- Michael G Schwern schwern@pobox.com http://www.pobox.com/~schwern /(?:(?:(1)[.-]?)?\(?(\d{3})\)?[.-]?)?(\d{3})[.-]?(\d{4})(x\d+)?/i ==== Want to unsubscribe from Fun With Perl? Well, if you insist... ==== Send email to <fwp-request@technofile.org> with message _body_ ==== unsubscribe