[Date Prev][Date Next][Thread Prev][Thread Next]
[Search]
[Date Index]
[Thread Index]
[MacPerl-WebCGI] Paying the cookie troll
Title: Paying the cookie troll
Hi Folks,
APOLOGIES if this is a dup post. I sent it originally during our
recent "down time" and haven't seen it or responses, so am
resending now...
I've been writing perl in a lib-less unix perl 4 world for over
six years and only recently moved to a Mac. When I downloaded
MacPerl, I was stunned by the huge number of modules, libraries, etc.
I have realized that there is probably an easy answer to my question
that I'm just not recognizing. I really have tried to find it (and
perl.src is its own channel in my Sherlock2), but, as I said, I
probably wouldn't recognize the solution if stumbled on it.
I'm trying to find an automatic way of pulling records out of:
http://gpo.osti.gov:901/dds/advanced.html
Ideally, I need perl to run about 500 queries, parse the results
to extract the 10000-20000 relevant records, retrieve each of those
URLs, and manipulate that text into a tab-delimited format that can
be read into a FileMaker database. YES, I have confirmed that there
aren't any copyright issues with this.
If you visit the query link
(http://gpo.osti.gov:901/dds/advanced.html), you'll realize that any search is greeted with a
"you must allow cookies" error, even if you DO allow
cookies:
Error: You must allow cookies to be set in your browser to use this
web site!<P>
Please set your browser accordingly and return <a
href="http://www.doe.gov/bridge">home</a> to correct this problem
Clicking on the "Home" button (or listed link) and
then returning to the advanced search clears this up (but I don't
know how to click on buttons from within perl).
The url sequence would be:
http://gpo.osti.gov:901/cgi-bin/entry.pl
http://gpo.osti.gov:901/dds/advanced.html
type a random word into top field
click search -> http://gpo.osti.gov:901/cgi-bin/startsearch
Results contain 50(max) links like:
http://gpo.osti.gov:901/cgi-bin/displaybib?QS=735~7fTW~3b:0~3b~3e~2b3&bcode=6~3f6~7fH~3d5392743&bib=3706ymlglz5~3c72553244&wrap=373~3bymlglz5~3c72552431&file=373~3bymlglz5~3c72552406
I manually generated an outfile from a search and have written
code to "screen scrape it" (parsing it to extract the URL
for each reference). When I try to retrieve these files via perl
(i.e., "&getstore ($pdf,$pdffile) or die;"), I
encounter the cookie error.
I've read everything I can find on cookie setting, but it all
seems to be approached from the webmaster perspective.
When I run the following code snippet, I *see* a cookie, but I
haven't a clue what to do with it. And each "pdf" I
"retrieve" is merely a two line text file containing the
cookie error noted above.
--Some code snippets--
require "GUSI.ph";
use LWP::Simple;
use LWP::UserAgent;
local $infile = "";
$infile = MacPerl::Choose( &GUSI::AF_FILE, 0, "",
&GUSI::pack_sa_constr_file("OBJ ", "TEXT"))
unless -f "$infile";
#You have to visit this page to set the cookie
$request = new HTTP::Request( 'GET',
'http://gpo.osti.gov:901/cgi-bin/entry.pl' );
$response = $ua->request($request, 'tmp');
print $response->as_string(), "\n"; #A cookie has
been set somewhere
#Try to retrieve some files:
#The URLs parsed from query results look like garbage but
pasting this into
#the location field does actually download a real pdf or bring
up a real record.
$pdf = 'http://gpo.osti.gov:901/cgi-bin/dds_upload.pl?doc=6546ytcwa~3f(lwn`aj|-~3d1C,D~3b2635552~2bO:~3c264132-u`d~23pkg4$s~7frb~3b~25gg~3f';
&getstore ($pdf, 'testpdf') or die;
$table = 'http://gpo.osti.gov:901/cgi-bin/displaybib?QS=735~7fTW~3b:0~3b~3e~2b3&bcode=6~3f6~7fH~3d5392743&bib=3706ymlglz5~3c72553244&wrap=373~3bymlglz5~3c72552431&file=373~3bymlglz5~3c72552406';
&getstore ( $table,'testtable') or die;
--End code snippets--
I read a bit about *sharing* cookies with Netscape, and would be
willing to set up a cludge in which I need to click on the
#!$*&#$& Home button before I run the script (though I'd
certainly prefer a more elegant solution).
The secondary question is, "Once I solve this cookie
problem, how do I dynamically generate/post the form containing my
query content and retrieve the results?" so that I don't have to
run all 500 queries manually.
And a mac-web-but-non-perl question:
When I download these pdfs (thus far manually), expander 5.5
opens and asks me where I want to store something that it never
creates. The filetype is "TEXT" and the creator is
"SITx". I can manually open them in acrobat and then they
"learn" what they really are--pdf files. I wrote a script
to fix the filetypes, so I'm not too worried about it, BUT, it seems
to me that there's something wrong. Netscape's garbage collector does
a wonderful job of deleting all of the newly downloaded files when
you quit, so forgetting to run the script (which also moves the
files) can be costly. My guess is that it's the MIME types not being
set correctly on the server. Is this an artifact of my having Stuffit
Expander as my default helper? Does anyone else out there encounter
this same behavior? When I asked the GPO web support folk about it,
they said that their site doesn't really support Macs. Humphhhh. My
mac-sensibilities are a bit offended....
Any ideas????
Thanks in advance for any suggestions.
---Shelly
--
Shelly Spearing
Systems Engineer
ESS Program Office, Accelerator Transmutation of Waste Project
shellys@lanl.gov, MS K575, 505-665-0587