[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] HTML link checker?



>On Mon, 10 Feb 1997, Chris Hammond-Thrasher wrote:
>
>> Vicki's request reminds me of something that I really need in the next week
>> or so, a Perl/MacPerl script that goes beyond a link checker to acting as a
>> simple web robot. What I need is an app that follows all the links on a
>> single web page and locally saves all of the linked documents. Any help
>> would be appreciated.
>
>Pick up the "Web Tondeuse" program from info-mac. It's a Java program that
>can be run with the Mac Java runtime. It works great for this kind of thing.
>
> --- Joe M.

I'm just writing a simple script that gets a file with a list of urls, and
then reads all of them to the hard disk. It also creates all necessary
directorys and gets all local files and imgs of the requested document.
Unfortunately, I'm still working on the OS independent directory creation
which may take a couple of days.

I've included an older version which may help you though it's not exactly
what you need.  EMail if you're interested in the new version (it's free,
of course).

Greetings,
            Erich Rast.

---------------------------------------
h0444zkf@rz.hu-berlin.de
http://www2.rz.hu-berlin.de/~h0444zkf/
---------------------------------------

#! /usr/bin/perl

use LWP::UserAgent;
use File::Basename;
use URI::URL;


=head1 NAME

B<Blowjob> - a simple http sucker

=head1 SYNOPSIS

I<Usage:> blowjob [<filename>]

        <filename> is a file that contains a list of URLs separated by
        newline. Default <filename> is 'blowjob.job'.

=head1 DESCRIPTION

Reads a number of http-documents from www-servers into a local
default folder. The URLs to read from are specified
in the input file on the command line. The File should contain a list
of urls separated by newline, lines beginning with # will be ignored.
Requires the libwww5.03 library.

=cut

### startup

$version = "0.23";
print "Blowjob/$version (c) 1996 by E. Rast\n\n";

### filename and path & misc

$file = 'blowjob.job';

# default input file
if ( $#ARGV > 0 ) { die "Too many arguments.\n"; }
if ( @ARGV ) { $file = $ARGV[0]; }



# get name without .suff
($infile, $inpath, $suffix) = fileparse( $file, '');

### create user agent

$ua = new LWP::UserAgent;
$ua->agent("Blowjob/$version");

### main loop

open( IN, $infile ) || die "Cannot open jobfile '$infile': $!\n";
binmode IN;

JOB:
        while ( $job = <IN> ) {
                next JOB if ( $job =~ /^#/o );
                chop ( $job );
                $url = new URI::URL $job;
                if ( $url->host ) {
                        ++( $count );
                        $filename = "job-$count.html";
                        $res = &get_url( $url, $filename );
                        if ( $res->is_success ) {
                                print "Job-$count read: <$job>\n";
                                } else {
                                        print "Couldn't get <$job> as
'$filename'\n";
                                  }
                } else {
                        print "Not a valid URL: '$job'\n";
                  }
        }

close IN;
&say( "Done.\n" );

### get a url

sub get_url {
        my( $url, $file ) = @_;
        my $req = new HTTP::Request GET => $url;
        my $res = $ua->request( $req, $file );
        return $res;
}