[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl-AnyPerl] Lookups and efficiency



At 2:34 PM -0700 9/30/00, Todd Richmond wrote:
>Hi all,
>
>I'm trying to optimize a script that does lookups. Here's the 
>problem: each day the script processes between 1000 to 100,000+ new 
>records. I have to compare the record's unique identifier (a 5-8 
>digit integer) to a master list of over a million IDs. If I haven't 
>seen the ID before and the record meets my criteria I process it and 
>add the ID to the master list. Right now, the way I do this is to 
>read the master list of IDs into a hash, and then check to see if 
>the key exists as I'm working my way through the new records. This 
>works fairly quickly: 10-15 seconds to load the hash and then ~10 
>minutes to process 100,000 records (depending on how many meet the 
>criteria). The problem with this, of course, is that I have to 
>allocate a huge amount of memory to MacPerl to load this into memory 
>(>120 MB). I'd like to do this more efficiently, especially since I 
>can foresee a time in the not-so-distant future when the master list 
>will no longer fit in memory. The question is, what can I do to 
>reduce the amount of memory required, but still maintain the speed? 
>I tried using Tie::SubstrHash (by forcing all the IDs to be an 8 
>digit integer). It definitely used less memory (~1/3 as before), but 
>took almost 5 minutes to load the hash (and then gave me an error 
>when I tried to check for the existence of a key...). Am I going to 
>have to go to a database solution? If so, anyone have any 
>suggestions? I would imagine that querying a database for every ID 
>is going to be significantly slower than checking for the existence 
>of a hash key. True?

The size of your dataspace is sufficiently big that I would have 
started with a database, myself.  The issues you're dealing with are 
exactly the things databases do well and were designed for.

Database recommendation is a platform specific question (or, if you 
are going for a cross-platform database, that's also something to 
know).  Anyway, I don't work with databases much, so I can't make a 
recommendation anyway.

That aside, I still think I have something worth contributing.  I 
expect that if you move to a database, you won't have to query it for 
every id.  You would query the database for the existance of a 
specific ID, and continue from there.  A good DB solution would load 
reasonably fast, and not actually require the entire table to be put 
into your script's memory.

-Jeff Lowrey

==== Want to unsubscribe from this list?
==== Send mail with body "unsubscribe" to macperl-anyperl-request@macperl.org