[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl-WebCGI] Paying the cookie troll



Title: Paying the cookie troll
Hi Folks,

APOLOGIES if this is a dup post. I sent it originally during our recent "down time" and haven't seen it or responses, so am resending now...


I've been writing perl in a lib-less unix perl 4 world for over six years and only recently moved to a Mac. When I downloaded MacPerl, I was stunned by the huge number of modules, libraries, etc. I have realized that there is probably an easy answer to my question that I'm just not recognizing. I really have tried to find it (and perl.src is its own channel in my Sherlock2), but, as I said, I probably wouldn't recognize the solution if stumbled on it.

I'm trying to find an automatic way of pulling records out of:
http://gpo.osti.gov:901/dds/advanced.html
Ideally, I need perl to run about 500 queries, parse the results to extract the 10000-20000 relevant records, retrieve each of those URLs, and manipulate that text into a tab-delimited format that can be read into a FileMaker database. YES, I have confirmed that there aren't any copyright issues with this.

If you visit the query link (http://gpo.osti.gov:901/dds/advanced.html), you'll realize that any search is greeted with a "you must allow cookies" error, even if you DO allow cookies:

Error: You must allow cookies to be set in your browser to use this web site!<P>
Please set your browser accordingly and return <a href="http://www.doe.gov/bridge">home</a> to correct this problem

Clicking on the "Home" button (or listed link) and then returning to the advanced search clears this up (but I don't know how to click on buttons from within perl).

The url sequence would be:
http://gpo.osti.gov:901/cgi-bin/entry.pl
http://gpo.osti.gov:901/dds/advanced.html
type a random word into top field
click search -> http://gpo.osti.gov:901/cgi-bin/startsearch
Results contain 50(max) links like:
http://gpo.osti.gov:901/cgi-bin/displaybib?QS=735~7fTW~3b:0~3b~3e~2b3&bcode=6~3f6~7fH~3d5392743&bib=3706ymlglz5~3c72553244&wrap=373~3bymlglz5~3c72552431&file=373~3bymlglz5~3c72552406


I manually generated an outfile from a search and have written code to "screen scrape it" (parsing it to extract the URL for each reference). When I try to retrieve these files via perl (i.e., "&getstore ($pdf,$pdffile) or die;"), I encounter the cookie error.

I've read everything I can find on cookie setting, but it all seems to be approached from the webmaster perspective.
When I run the following code snippet, I *see* a cookie, but I haven't a clue what to do with it. And each "pdf" I "retrieve" is merely a two line text file containing the cookie error noted above.

--Some code snippets--

require "GUSI.ph";
use LWP::Simple;
use LWP::UserAgent;

local $infile = "";
$infile = MacPerl::Choose( &GUSI::AF_FILE, 0, "", &GUSI::pack_sa_constr_file("OBJ ", "TEXT")) unless -f "$infile";

#You have to visit this page to set the cookie
$request = new HTTP::Request( 'GET', 'http://gpo.osti.gov:901/cgi-bin/entry.pl' );
$response = $ua->request($request, 'tmp');
print $response->as_string(), "\n"; #A cookie has been set somewhere

#Try to retrieve some files:
#The URLs parsed from query results look like garbage but pasting this into
#the location field does actually download a real pdf or bring up a real record.
$pdf = 'http://gpo.osti.gov:901/cgi-bin/dds_upload.pl?doc=6546ytcwa~3f(lwn`aj|-~3d1C,D~3b2635552~2bO:~3c264132-u`d~23pkg4$s~7frb~3b~25gg~3f';

&getstore ($pdf, 'testpdf') or die;
$table = 'http://gpo.osti.gov:901/cgi-bin/displaybib?QS=735~7fTW~3b:0~3b~3e~2b3&bcode=6~3f6~7fH~3d5392743&bib=3706ymlglz5~3c72553244&wrap=373~3bymlglz5~3c72552431&file=373~3bymlglz5~3c72552406';
&getstore ( $table,'testtable') or die;
--End code snippets--

I read a bit about *sharing* cookies with Netscape, and would be willing to set up a cludge in which I need to click on the #!$*&#$& Home button before I run the script (though I'd certainly prefer a more elegant solution).

The secondary question is, "Once I solve this cookie problem, how do I dynamically generate/post the form containing my query content and retrieve the results?" so that I don't have to run all 500 queries manually.



And a mac-web-but-non-perl question:
When I download these pdfs (thus far manually), expander 5.5 opens and asks me where I want to store something that it never creates. The filetype is "TEXT" and the creator is "SITx". I can manually open them in acrobat and then they "learn" what they really are--pdf files. I wrote a script to fix the filetypes, so I'm not too worried about it, BUT, it seems to me that there's something wrong. Netscape's garbage collector does a wonderful job of deleting all of the newly downloaded files when you quit, so forgetting to run the script (which also moves the files) can be costly. My guess is that it's the MIME types not being set correctly on the server. Is this an artifact of my having Stuffit Expander as my default helper? Does anyone else out there encounter this same behavior? When I asked the GPO web support folk about it, they said that their site doesn't really support Macs. Humphhhh. My mac-sensibilities are a bit offended....

Any ideas????
Thanks in advance for any suggestions.
---Shelly
--

Shelly Spearing
Systems Engineer
ESS Program Office, Accelerator Transmutation of Waste Project
shellys@lanl.gov, MS K575, 505-665-0587