[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] RFC: How can I optimise bulk matching regex



At 18:27 +0900 07/20/1999, robinmcf@altern.org wrote:
>I'm currently working on a dictionary project, I'm using -
>@searchterms= qw (
>	term1
>	term2
>	term3
>	...);
>
>while (<>) {
>    foreach $item(@searchterms) {
>	if (/$item\/go) {
>	    do stuff....;
>	}
>    }
>}
>
>-to parse the files I want to check - (the search terms array will get
>bigger as the user dictionary expands). In the regexp faq it says this kind
>of matching is very inefficient and suggest using  MAP to compile the
>regexes, though I have tried I can't adapt the example given for what I
>want to do, and the Camel is (unfortunately in this case) exactly like the
>FAQ .
>Any examples and/or pointers will go a long way to helping my insomnia

Assuming that you are literally talking about a dictionary that has 
entries and definitions (e.g., keys and values), why not just use a 
hash and test for the existence of a key that matches each search 
term in turn? That's very fast and something that hashes are well 
suited for (in fact, hashes are actually called dictionaries in 
Python). The downside is that hashes are not very flexible and if you 
need to allow for fallback loose matching (returning an inexact match 
if the exact search term isn't found) then you are stuck with an 
array.

I've got a rather large dictionary running at 
http://www.sbl-site.org/cgi-bin/SBL/sbl-loader.pl?EPubs/dictindex.html 
that uses arrays mainly because loose matching was desirable. There 
are over 3700 entries and the initial results are returned using a 
simple /^$term.*/i match against a single text file that uses tabs as 
the delimiter between the entry and the definition with one physical 
line per entry. The results links then use /^$term\t/ (and some 
progressively looser matching for unrelated reasons) to retrieve & 
display the actual text.

This may not be the most efficient way to do this, but the 
performance is very good and it does not assume that the user already 
knows exactly what he's looking for. This project is documented at 
http://www.sbl-site.org/cgi-bin/SBL/sbl-loader.pl?EPubs/documentation. 
html and you can view scripts in your browser via links there (the 
ones you are interested in are dbrowser.pl which gets the initial 
results and loadDY.acgi which retrieves the individual entries). Help 
yourself if you find anything that may be of use to you.


Richard Gordon
--------------------
Gordon Consulting & Design
Database Design/Scripting Languages
mailto:richard@richardgordon.net
http://www.richardgordon.net
770.971.6887 (voice)
770.216.1829 (fax)

===== Want to unsubscribe from this list?
===== Send mail with body "unsubscribe" to macperl-request@macperl.org