At 2:34 PM -0700 9/30/00, Todd Richmond wrote: >Hi all, > >I'm trying to optimize a script that does lookups. Here's the >problem: each day the script processes between 1000 to 100,000+ new >records. I have to compare the record's unique identifier (a 5-8 >digit integer) to a master list of over a million IDs. If I haven't >seen the ID before and the record meets my criteria I process it and >add the ID to the master list. Right now, the way I do this is to >read the master list of IDs into a hash, and then check to see if >the key exists as I'm working my way through the new records. This >works fairly quickly: 10-15 seconds to load the hash and then ~10 >minutes to process 100,000 records (depending on how many meet the >criteria). The problem with this, of course, is that I have to >allocate a huge amount of memory to MacPerl to load this into memory >(>120 MB). I'd like to do this more efficiently, especially since I >can foresee a time in the not-so-distant future when the master list >will no longer fit in memory. The question is, what can I do to >reduce the amount of memory required, but still maintain the speed? >I tried using Tie::SubstrHash (by forcing all the IDs to be an 8 >digit integer). It definitely used less memory (~1/3 as before), but >took almost 5 minutes to load the hash (and then gave me an error >when I tried to check for the existence of a key...). Am I going to >have to go to a database solution? If so, anyone have any >suggestions? I would imagine that querying a database for every ID >is going to be significantly slower than checking for the existence >of a hash key. True? The size of your dataspace is sufficiently big that I would have started with a database, myself. The issues you're dealing with are exactly the things databases do well and were designed for. Database recommendation is a platform specific question (or, if you are going for a cross-platform database, that's also something to know). Anyway, I don't work with databases much, so I can't make a recommendation anyway. That aside, I still think I have something worth contributing. I expect that if you move to a database, you won't have to query it for every id. You would query the database for the existance of a specific ID, and continue from there. A good DB solution would load reasonably fast, and not actually require the entire table to be put into your script's memory. -Jeff Lowrey ==== Want to unsubscribe from this list? ==== Send mail with body "unsubscribe" to macperl-anyperl-request@macperl.org