[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl-AnyPerl] Lookups and efficiency

Hi all,

I'm trying to optimize a script that does lookups. Here's the 
problem: each day the script processes between 1000 to 100,000+ new 
records. I have to compare the record's unique identifier (a 5-8 
digit integer) to a master list of over a million IDs. If I haven't 
seen the ID before and the record meets my criteria I process it and 
add the ID to the master list. Right now, the way I do this is to 
read the master list of IDs into a hash, and then check to see if the 
key exists as I'm working my way through the new records. This works 
fairly quickly: 10-15 seconds to load the hash and then ~10 minutes 
to process 100,000 records (depending on how many meet the criteria). 
The problem with this, of course, is that I have to allocate a huge 
amount of memory to MacPerl to load this into memory (>120 MB). I'd 
like to do this more efficiently, especially since I can foresee a 
time in the not-so-distant future when the master list will no longer 
fit in memory. The question is, what can I do to reduce the amount of 
memory required, but still maintain the speed? I tried using 
Tie::SubstrHash (by forcing all the IDs to be an 8 digit integer). It 
definitely used less memory (~1/3 as before), but took almost 5 
minutes to load the hash (and then gave me an error when I tried to 
check for the existence of a key...). Am I going to have to go to a 
database solution? If so, anyone have any suggestions? I would 
imagine that querying a database for every ID is going to be 
significantly slower than checking for the existence of a hash key. 

Thanks, Todd

Dr. Todd Richmond
Carnegie Institution of Washington
260 Panama Street
Stanford, CA 94305
Email: todd@andrew2.stanford.edu  Homepage: http://cellwall.stanford.edu/todd

==== Want to unsubscribe from this list?
==== Send mail with body "unsubscribe" to macperl-anyperl-request@macperl.org