On Fri, 19 Nov 1999 18:35:04 GMT, bart.lateur@skynet.be (Bart Lateur) wrote: > On Thu, 18 Nov 1999 01:34:09 -0600 (CST), Matthew Langford wrote: > >>To get the whole deal I had to use more than 55 MB of memory. (I think >>this is what I came back to; for a while I had to crank out some virtual >>memory and set it higher.) The hash, after I finally got it, takes up 4.6 >>MB. If I read in the file, I don't have too many memory problems. >>Clearly, there are serious memory leaks here in the "getting" stage. > > That doesn't prove a thing. I tried HTML::TreeBuilder on a 15k HTML > file, on a PC (DOS Perl), and I got an out of memory error, with many > megabytes available for Perl. > > This HTML parsing seems to be very memory hungry. My guess is that this > is because every HTML tag is a not so small object in memory. A lot of > those objects, and you're quickly out of memory. > > Just give an acceptable proof that there IS indeed a memory leak, e.g. > by parsing the same file over and over again, clearing memory (= > releasing objects) every time, and watching it get out of memory after > many loops. True. And I remember having proved that maybe a year or two ago, and reported it to the list. There was, as far as I remember, not much of a reaction, though. What I tried to do then is to loop over a bunch of files, strip all html code from them ( HTML::FormatText) and dump the plain text into a single text file. The point is that I am looping over files, which should mean that I am starting at point zero on every new loop, but still I lost several 100k on each iteration. I tried to purposely undef all the variables and close the filehandles at the end of each iteration, but to no avail. The solution I came up with and which I am still using, is an extremely ugly one: I wrote an AppleScript calling the droplet containing the Perl script with a list of about 10 files on each iteration, closing MacPerl, restarting it and doing a new iteration. It's ridiculous, but it has worked for months, every day. I don't know who is to blame, MacPerl or the HTML module(s) or both, but there is no doubt that there are huge memory leaks in this area for years. [later] Ooops, after rereading the related pod, it seems I finally found a hint in Gisle Aas' HTML::Element: >BUGS > >If you want to free the memory assosiated with a tree built of >HTML::Element nodes then you will have to delete it explicitly. The >reason for this is that perl currently has no proper garbage collector, >but depends on reference counts in the objects. This scheme fails because >the parse tree contains circular references (parents have references to >their children and children have a reference to their parent). (maybe this note was not there in earlier versions of the module...[shrug]) There still remains a practical problem: What does this mean in relation to HTML::FormatText? Regarding the following snippet... foreach $file (@ARGV){ if (-f $file && $file !~/index\.htm./) { open (OUT, ">>$outfile") or die $^E; print OUT basename($file); $html = parse_htmlfile($file); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 100000); $text = $formatter->format($html); undef $html; undef $formatter; #further GREP-wise cleaning-up snipped print OUT $text, "÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷÷\n\n"; undef $text; close OUT; } } ...I don't see how I could possibly free up more memory. Am I doing something wrong here? Ideas anybody? I am in digest mode, so please cc me. __Peter Hartmann ________ mailto:hphartmann@arcormail.de # ===== Want to unsubscribe from this list? # ===== Send mail with body "unsubscribe" to macperl-request@macperl.org