[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] R/E Question

To: Bill Jones <bill@fccj.org>
Subject: Re: [FWP] R/E Question
From: J Robert Ray <jrray@home.com>
Date: 31 Jan 2000 05:00:34 -0800
Cc: <fwp@technofile.org>
In-Reply-To: Bill Jones's message of "Sat, 29 Jan 2000 16:39:18 -0500"
References: <B4B8C836.AD2D%bill@fccj.org>

>>>>> "Bill" == Bill Jones <bill@fccj.org> writes:

    Bill> (ftp://|http://)[^ ]+
    Bill> Now for the question: Your thoughts?

For my IRC bot, I scan for things that resemble URLs, without being
too strict.  Since I'm not dealing with a large number of URLs at
once, for each URL I think I find, I shoot off a HEAD request to
the URL candidate, and only if I get a valid response from the
server do I list it.

Here is the main URL sniffing code:

  my $validchar = '[^\s<>"#{}|\\^\[\]\`@,]';

  for ( split ( /\s+/, $msg ))
  {
    next if ( ! ( tr/\./\./ ));  # What's a URL without dots?

    if ( m{
      (
       (?:https?|ftp)://$validchar+
       | (?:ftp|www)$validchar*\.$validchar+
       | $validchar+[^.]\.(?:com|net|edu|org|us|uk|ca|de|se|au|jp|no|fr|nl|dk|tw)/?$validchar*
       )
        }xoi )
    {
      my $URL = $1;
      found_url ( $nick, $host, $chan, $URL );
    }
  }

Things of note:

Finding http:// is a sure sign someone is giving a URL.
Otherwise ftp or www smells like a URL.
Looking for common TLDs can find URLs like sneaky.ca, obviously my
list isn't complete.

found_url does a little more cleaning with the URL:

sub found_url ( $$$$ )
{
  my ( $nick, $host, $chan, $url ) = @_;

  $url =~ s/^[.,()<>{}?!]+//;  # Strip extra characters from the front of the url -CGH
  $url =~ s/[.,()<>{}?!]+$//;  # Ditto back.
  $url = uf_uristr( $url );
  ...
}

This works pretty well for my own purposes.  I'd love to hear
suggestions on how to make this better.

Thanks,

- Robert

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe

References:
- [FWP] R/E Question
  - From: Bill Jones <bill@fccj.org>

Prev by Date: Re: [FWP] R/E Question
Next by Date: Re: [FWP] R/E Question
Prev by thread: Re: [FWP] R/E Question
Next by thread: [FWP] Profiling on Win32
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net