On Fri, Jun 11, 1999 at 07:33:40AM +0000, John Carter wrote: > Suppose you come across a web page with hundreds of pictures or mp3's > that you would like to download. Why tediously go click-click like > a Windows user when you can use Hrvoje Niksic's utility "wget" and perl? I'd say, why drag perl into this at all? wget -r -l2 -t5 -L -A mp3,mpg3,mpeg3,mpg,mpeg http://whatever.com/mp3s/ Recursive web suck (2 levels down, the original file is 1, the mp3s are 2), retry 5 times, follow only relative links, accept files with the suffix "mp3", "mpg3", "mpeg3", etc... Know thine toolset. > A standard script fails since no two web authors link to their content > in the same manner. A custom crafted one liner works everytime... > > First wget the page you want... (wget is avaliable as a debian > GNU/Linux package) > > wget http://www.whatever.com/~someone > > If you don't have wget, > > perl -MLWP::Simple -e 'mirror( "http://www.whatever.com/~someone", > "index.html")' > > will do the same thing. GET http://www.whatever.com/~someone > index.html ...is another way. GET, POST and HEAD are neat little LWP wrappers that come with LWP. Most people aren't aware of their existance. I forget how I stumbled onto them. > This will produce a file "index.html". Inspect index.html to > work out the shape of the links you want to follow then > > perl -nle 'print "wget http://www.whatever.com/~someone/$1" > if /href="([^"\.]+\.mp3)"/' < index.html | bash -ex > > I tried doing this using the LWP::Simple module but it came out about > the same length... > perl -MLWP::Simple -ne 'mirror( "http://www.whatever.com/~someone/$1",$1) > if /href="([^\."]+\.mp3)"/' < index.html Interesting, I wouldn't have thought of using mirror! Perhaps a more reliable, but slightly more complex, way of doing this. use HTML::LinkExtor; use LWP::Simple; my $URI = shift; my($URI_Base) = $URI =~ m|/([^/])+$|; # Comma seperated list of suffixes to match against. my @Thingys = map {quotemeta} split /,\s*/, shift; sub getthingys { my($tag, %attribs) = @_; # I don't think we need to only use A tags, so don't bother # checking $tags. my $href; if(exists $attribs{href}) { $href = $attribs{href} } else { next } # No link. if( grep { $href =~ /$_$/ } @Thingys ) { mirror($href, ($href =~ m|/([^/])+$|)) or warn "Can't get $href!"; # should check the response code # but I'm lazy. } } HTML::LinkExtor->new(\&getthingys, $URI_Base)->parse(get($URI)); This is pretty much generic, but its pretty much still not as powerful/complete as the wget line above. But I usually forget all the wget flags, so something like this is nice to have lying in your $HOME/bin. As long as I'm babbling about automated file extraction... a utility I've always wanted was one that could iterate through a sequence of URLs. Say the docs for your favorite program are ONLY available as HTML from some web site. And to make matters worse, they're ONLY avaialble as SPLIT HTML files (doc01.html, doc02.html...). SUCK! So you want something that, given a specially formatted URL, can just iterate through the sequence, sucking down as it goes. AFAIK, no such thing exists. Well, here (this is adapted from MacPerl, so the way it takes arguments it a little odd) use strict; use constant USAGE=> <<"EOU"; USAGE: $0 <url> <begin #> <end #> [<dir> <referer> <user> <pass>] EOU die USAGE unless @ARGV >=3; use File::Spec; # A URL formatted sprintf style with one %d. # Like: http://www.foo.com/docs/docs%02d.html to get docXX.html my($URL_fmt) = shift; my($Begin, $End) = shift, shift; # Number to start and end on. my($Dir) = shift || File::Spec->curdir; # Directory to throw all this stuff. # Default is Cwd. my($Referer, $User, $Pass) = (shift)x3; use LWP::UserAgent; my $ua = LWP::UserAgent->new; for my $num ($Begin..$End) { my $url = sprintf($URL_fmt, $num); print STDERR "Grabbing $url..."; my $req = HTTP::Request->new('GET', $new_url); $req->referer($Referer) if defined $Referer; $req->authorization_basic($User, $Pass) if defined $User; my $filename; ($filename) = $url =~ m|/([^/]+)$|; my $res = $ua->mirror($req, File::Spec->catdir($Dir,$filename)); if($res->is_success) { print STDERR "got it!"; } else { print STDERR "failed: ".$res->status_line; } print STDERR "\n"; } It needs alot of work, but its basically useful. -- Michael G Schwern schwern@pobox.com http://www.pobox.com/~schwern /(?:(?:(1)[.-]?)?\(?(\d{3})\)?[.-]?)?(\d{3})[.-]?(\d{4})(x\d+)?/i ==== Want to unsubscribe from this list? (Don't you love us anymore?) ==== Well, if you insist... Send mail with body "unsubscribe" to ==== fwp-request@technofile.org