[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl-AnyPerl] Lookups and efficiency



At 14:34 -0700 2000.09.30, Todd Richmond wrote:
>I'm trying to optimize a script that does lookups. Here's the
>problem: each day the script processes between 1000 to 100,000+ new
>records. I have to compare the record's unique identifier (a 5-8
>digit integer) to a master list of over a million IDs. If I haven't
>seen the ID before and the record meets my criteria I process it and
>add the ID to the master list. Right now, the way I do this is to
>read the master list of IDs into a hash, and then check to see if the
>key exists as I'm working my way through the new records. This works
>fairly quickly: 10-15 seconds to load the hash and then ~10 minutes
>to process 100,000 records (depending on how many meet the criteria).
>The problem with this, of course, is that I have to allocate a huge
>amount of memory to MacPerl to load this into memory (>120 MB). I'd
>like to do this more efficiently, especially since I can foresee a
>time in the not-so-distant future when the master list will no longer
>fit in memory. The question is, what can I do to reduce the amount of
>memory required, but still maintain the speed? I tried using
>Tie::SubstrHash (by forcing all the IDs to be an 8 digit integer). It
>definitely used less memory (~1/3 as before), but took almost 5
>minutes to load the hash (and then gave me an error when I tried to
>check for the existence of a key...). Am I going to have to go to a
>database solution? If so, anyone have any suggestions? I would
>imagine that querying a database for every ID is going to be
>significantly slower than checking for the existence of a hash key.
>True?

Why not just use DB_File?

#!perl -w
use strict;
use DB_File;

my $master_list = "path:to:master:list";
tie my %hash, 'DB_File', $master_list, O_RDWR|O_CREAT or die $!;

my @nums = (012345, 9876543);
for my $n (@nums) {
    if (! exists $hash{$n}) {
        $hash{$n}++;  # save in hash for future use
    }
}

__END__

DB_File has some issues on certain kinds of volumes, though.  On two of my
volumes, DB_File just pukes (one is HFS 5GB, one is HFS+ 10GB).  On
another, my Mac OS X volume (HFS+ 1.88 GB), it works just fine.  I think it
is a block size issue that has yet to be resolved.

-- 
Chris Nandor                      pudge@pobox.com    http://pudge.net/
Open Source Development Network    pudge@osdn.com     http://osdn.com/

==== Want to unsubscribe from this list?
==== Send mail with body "unsubscribe" to macperl-anyperl-request@macperl.org