[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] keyword parser for search engine



Andrew Pimlott <andrew@pimlott.ne.mediaone.net> writes:
> Actually, I think it would be very cool to have a series of x-to-re
> modules that convert various pattern languages into perl regular
> expressions.  All of the various glob styles are obvious candidates.

I was toying with something a couple of days ago that led me to this.  It
takes glob patterns as accepted by GNU locate(1) (complete with the logic
about whether or not to anchor the match) and gives you a qr// regex.  You
get undef for an invalid glob pattern.  The code's probably a little too
long, dull, and awkward to qualify as genuinely Fun, but the task has some
annoying little niggles that aren't entirely trivial.

sub locate_glob_to_regex {
    my $s = shift;

    # Anything backslash-escaped should be left as is.  A question-mark
    # should become a dot.  A star should become a dot-star.  A
    # bracket-expression should be left as-is, except that a bang used to
    # negate should become a caret.  Any other non-\w char should be
    # backslash-escaped.

    # Once we've done all that, we qr// it, *without* \Q, and anchor it if
    # necessary.

    my $rx = "";
    my $found_meta = 0;

    for (my $i = 0;  $i < length($s);  $i++) {
	my $c = substr($s, $i, 1);
	if ($c eq "*") {
	    $found_meta = 1;
	    $rx .= ".*";
	}
	elsif ($c eq "?") {
	    $found_meta = 1;
	    $rx .= ".";
	}
	elsif ($c eq "\\") {
	    return undef if ++$i >= length($s);
	    $rx .= $c . substr($s, $i, 1);
	}
	elsif ($c eq "[") {
	    $found_meta = 1;

	    # Swallow a bracket expression and add it to $rx.  Points to
	    # watch for: if the *first* character is a bang, make it a
	    # caret.  If the first non-bang, non-caret character is a
	    # close-bracket, it doesn't close the expression.  Thereafter,
	    # everything goes in unchanged, up to and including the first
	    # close-bracket.

	    $rx .= $c;

	    return undef if ++$i >= length($s);
	    $c = substr($s, $i, 1);

	    if ($c eq "!" or $c eq "^") {
		$rx .= "^";
		return undef if ++$i >= length($s);
		$c = substr($s, $i, 1);
	    }

	    # Whatever the first non-bang, non-caret char was, we add it
	    # to the rx.
	    $rx .= $c;

	    for (;;) {
		return undef if ++$i >= length($s);
		$c = substr($s, $i, 1);
		$rx .= $c;
		last if $c eq "]";
	    }
	}
	elsif ($c =~ /\w/) {
	    $rx .= $c;
	}
	else {		# non-\w, non-special
	    $rx .= "\\$c";
	}
    }

    $rx = "^$rx\$" if $found_meta;
    return qr/$rx/;
}

-- 
Aaron Crane   <aaron.crane@pobox.com>   <URL:http://pobox.com/~aaronc/>

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe