[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [Fun With Perl] index.html

To: John Carter <john@dwaf-hri.pwv.gov.za>
Subject: Re: [Fun With Perl] index.html
From: Michael G Schwern <schwern@pobox.com>
Date: Fri, 11 Jun 1999 14:30:28 -0400
Cc: fwp@technofile.org
In-Reply-To: <Pine.LNX.4.05.9906110618310.3447-100000@models.iwqs.pwv.gov.za>; from John Carter on Fri, Jun 11, 1999 at 07:33:40AM +0000
References: <Pine.LNX.4.05.9906110618310.3447-100000@models.iwqs.pwv.gov.za>

On Fri, Jun 11, 1999 at 07:33:40AM +0000, John Carter wrote:
> Suppose you come across a web page with hundreds of pictures or mp3's
> that you would like to download. Why tediously go click-click like
> a Windows user when you can use Hrvoje Niksic's utility "wget" and perl?

I'd say, why drag perl into this at all?

wget -r -l2 -t5 -L -A mp3,mpg3,mpeg3,mpg,mpeg http://whatever.com/mp3s/

Recursive web suck (2 levels down, the original file is 1, the mp3s
are 2), retry 5 times, follow only relative links, accept files with
the suffix "mp3", "mpg3", "mpeg3", etc...

Know thine toolset.

> A standard script fails since no two web authors link to their content
> in the same manner. A custom crafted one liner works everytime...
> 
> First wget the page you want... (wget is avaliable as a debian
> GNU/Linux package)
> 
> wget http://www.whatever.com/~someone
> 
> If you don't have wget, 
> 
> perl -MLWP::Simple -e 'mirror( "http://www.whatever.com/~someone",
>      "index.html")'
> 
> will do the same thing.

GET http://www.whatever.com/~someone > index.html

...is another way.

GET, POST and HEAD are neat little LWP wrappers that come with LWP.
Most people aren't aware of their existance.  I forget how I stumbled
onto them.

> This will produce a file "index.html". Inspect index.html to
> work out the shape of the links you want to follow then
> 
> perl -nle 'print "wget http://www.whatever.com/~someone/$1" 
>   if /href="([^"\.]+\.mp3)"/' < index.html | bash -ex
> 
> I tried doing this using the LWP::Simple module but it came out about
> the same length...
> perl -MLWP::Simple -ne 'mirror( "http://www.whatever.com/~someone/$1",$1) 
>   if /href="([^\."]+\.mp3)"/' < index.html

Interesting, I wouldn't have thought of using mirror!

Perhaps a more reliable, but slightly more complex, way of doing this.

use HTML::LinkExtor;
use LWP::Simple;

my $URI = shift;
my($URI_Base) = $URI =~ m|/([^/])+$|;

# Comma seperated list of suffixes to match against.
my @Thingys = map {quotemeta} split /,\s*/, shift;

sub getthingys {
    my($tag, %attribs) = @_;

    # I don't think we need to only use A tags, so don't bother
    # checking $tags.

    my $href;
    if(exists $attribs{href}) { $href = $attribs{href} }
    else { next } # No link.

    if( grep { $href =~ /$_$/ } @Thingys ) { 
        mirror($href, ($href =~ m|/([^/])+$|)) or
            warn "Can't get $href!"; # should check the response code
                                     # but I'm lazy.
    }
}

HTML::LinkExtor->new(\&getthingys, $URI_Base)->parse(get($URI));

This is pretty much generic, but its pretty much still not as
powerful/complete as the wget line above.  But I usually forget all
the wget flags, so something like this is nice to have lying in your
$HOME/bin.

As long as I'm babbling about automated file extraction... a utility
I've always wanted was one that could iterate through a sequence of
URLs.  Say the docs for your favorite program are ONLY available as
HTML from some web site.  And to make matters worse, they're ONLY
avaialble as SPLIT HTML files (doc01.html, doc02.html...).  SUCK!

So you want something that, given a specially formatted URL, can just
iterate through the sequence, sucking down as it goes.  AFAIK, no such
thing exists.

Well, here (this is adapted from MacPerl, so the way it takes
arguments it a little odd)

use strict;

use constant USAGE=> <<"EOU";
USAGE:  $0 <url> <begin #> <end #> [<dir> <referer> <user> <pass>]
EOU

die USAGE unless @ARGV >=3;

use File::Spec;

# A URL formatted sprintf style with one %d. 
# Like:  http://www.foo.com/docs/docs%02d.html to get docXX.html
my($URL_fmt) = shift;
my($Begin, $End) = shift, shift;  # Number to start and end on. 
my($Dir) = shift || File::Spec->curdir;  # Directory to throw all this stuff.
                                          # Default is Cwd.
my($Referer, $User, $Pass) = (shift)x3;

use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

for my $num ($Begin..$End) {
  my $url = sprintf($URL_fmt, $num);
  print STDERR "Grabbing $url...";

  my $req = HTTP::Request->new('GET', $new_url);
  $req->referer($Referer) if defined $Referer;
  $req->authorization_basic($User, $Pass) if defined $User;

  my $filename;
  ($filename) = $url =~ m|/([^/]+)$|;
  my $res = $ua->mirror($req, File::Spec->catdir($Dir,$filename));
  if($res->is_success) {
    print STDERR "got it!";
  }
  else {
    print STDERR "failed:  ".$res->status_line;
  }
  print STDERR "\n";
}

It needs alot of work, but its basically useful.

-- 

Michael G Schwern                                           schwern@pobox.com
                    http://www.pobox.com/~schwern
     /(?:(?:(1)[.-]?)?\(?(\d{3})\)?[.-]?)?(\d{3})[.-]?(\d{4})(x\d+)?/i

==== Want to unsubscribe from this list? (Don't you love us anymore?)
==== Well, if you insist... Send mail with body "unsubscribe" to
==== fwp-request@technofile.org

Follow-Ups:
- Re: [Fun With Perl] index.html
  - From: Fergal Daly <fergal@esatclear.ie>
- Re: [Fun With Perl] index.html
  - From: JVromans@squirrel.nl (Johan Vromans)

References:
- [Fun With Perl] index.html
  - From: John Carter <john@dwaf-hri.pwv.gov.za>

Prev by Date: Re: [Fun With Perl] Idiom misuse (typos can getcha)
Next by Date: Re: [Fun With Perl] Package Retrofit
Prev by thread: Re: [Fun With Perl] index.html
Next by thread: Re: [Fun With Perl] index.html
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net