At 14:34 -0700 2000.09.30, Todd Richmond wrote: >I'm trying to optimize a script that does lookups. Here's the >problem: each day the script processes between 1000 to 100,000+ new >records. I have to compare the record's unique identifier (a 5-8 >digit integer) to a master list of over a million IDs. If I haven't >seen the ID before and the record meets my criteria I process it and >add the ID to the master list. Right now, the way I do this is to >read the master list of IDs into a hash, and then check to see if the >key exists as I'm working my way through the new records. This works >fairly quickly: 10-15 seconds to load the hash and then ~10 minutes >to process 100,000 records (depending on how many meet the criteria). >The problem with this, of course, is that I have to allocate a huge >amount of memory to MacPerl to load this into memory (>120 MB). I'd >like to do this more efficiently, especially since I can foresee a >time in the not-so-distant future when the master list will no longer >fit in memory. The question is, what can I do to reduce the amount of >memory required, but still maintain the speed? I tried using >Tie::SubstrHash (by forcing all the IDs to be an 8 digit integer). It >definitely used less memory (~1/3 as before), but took almost 5 >minutes to load the hash (and then gave me an error when I tried to >check for the existence of a key...). Am I going to have to go to a >database solution? If so, anyone have any suggestions? I would >imagine that querying a database for every ID is going to be >significantly slower than checking for the existence of a hash key. >True? Why not just use DB_File? #!perl -w use strict; use DB_File; my $master_list = "path:to:master:list"; tie my %hash, 'DB_File', $master_list, O_RDWR|O_CREAT or die $!; my @nums = (012345, 9876543); for my $n (@nums) { if (! exists $hash{$n}) { $hash{$n}++; # save in hash for future use } } __END__ DB_File has some issues on certain kinds of volumes, though. On two of my volumes, DB_File just pukes (one is HFS 5GB, one is HFS+ 10GB). On another, my Mac OS X volume (HFS+ 1.88 GB), it works just fine. I think it is a block size issue that has yet to be resolved. -- Chris Nandor pudge@pobox.com http://pudge.net/ Open Source Development Network pudge@osdn.com http://osdn.com/ ==== Want to unsubscribe from this list? ==== Send mail with body "unsubscribe" to macperl-anyperl-request@macperl.org