[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] R/E Question



>>>>> "Bill" == Bill Jones <bill@fccj.org> writes:

    Bill> (ftp://|http://)[^ ]+
    Bill> Now for the question: Your thoughts?

For my IRC bot, I scan for things that resemble URLs, without being
too strict.  Since I'm not dealing with a large number of URLs at
once, for each URL I think I find, I shoot off a HEAD request to
the URL candidate, and only if I get a valid response from the
server do I list it.

Here is the main URL sniffing code:

  my $validchar = '[^\s<>"#{}|\\^\[\]\`@,]';

  for ( split ( /\s+/, $msg ))
  {
    next if ( ! ( tr/\./\./ ));  # What's a URL without dots?

    if ( m{
      (
       (?:https?|ftp)://$validchar+
       | (?:ftp|www)$validchar*\.$validchar+
       | $validchar+[^.]\.(?:com|net|edu|org|us|uk|ca|de|se|au|jp|no|fr|nl|dk|tw)/?$validchar*
       )
        }xoi )
    {
      my $URL = $1;
      found_url ( $nick, $host, $chan, $URL );
    }
  }

Things of note:

Finding http:// is a sure sign someone is giving a URL.
Otherwise ftp or www smells like a URL.
Looking for common TLDs can find URLs like sneaky.ca, obviously my
list isn't complete.

found_url does a little more cleaning with the URL:

sub found_url ( $$$$ )
{
  my ( $nick, $host, $chan, $url ) = @_;

  $url =~ s/^[.,()<>{}?!]+//;  # Strip extra characters from the front of the url -CGH
  $url =~ s/[.,()<>{}?!]+$//;  # Ditto back.
  $url = uf_uristr( $url );
  ...
}

This works pretty well for my own purposes.  I'd love to hear
suggestions on how to make this better.

Thanks,

- Robert

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe