[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] keyword parser for search engine



Aaron Crane wrote:
> 
> Andrew Pimlott <andrew@pimlott.ne.mediaone.net> writes:
> > Actually, I think it would be very cool to have a series of x-to-re
> > modules that convert various pattern languages into perl regular
> > expressions.  All of the various glob styles are obvious candidates.
> 
> I was toying with something a couple of days ago that led me to this.  It
> takes glob patterns as accepted by GNU locate(1) (complete with the logic
> about whether or not to anchor the match) and gives you a qr// regex.  You
> get undef for an invalid glob pattern.  The code's probably a little too
> long, dull, and awkward to qualify as genuinely Fun, but the task has some
> annoying little niggles that aren't entirely trivial.
> 
> sub locate_glob_to_regex {
>     my $s = shift;
> 
>     # Anything backslash-escaped should be left as is.  A question-mark
>     # should become a dot.  A star should become a dot-star.  A
>     # bracket-expression should be left as-is, except that a bang used to
>     # negate should become a caret.  Any other non-\w char should be
>     # backslash-escaped.
> 
>     # Once we've done all that, we qr// it, *without* \Q, and anchor it if
>     # necessary.
> 
>     my $rx = "";
>     my $found_meta = 0;
> 
>     for (my $i = 0;  $i < length($s);  $i++) {
>         my $c = substr($s, $i, 1);
>         if ($c eq "*") {
>             $found_meta = 1;
>             $rx .= ".*";
>         }
>         elsif ($c eq "?") {
>             $found_meta = 1;
>             $rx .= ".";
>         }
>         elsif ($c eq "\\") {
>             return undef if ++$i >= length($s);
>             $rx .= $c . substr($s, $i, 1);
>         }
>         elsif ($c eq "[") {
>             $found_meta = 1;
> 
>             # Swallow a bracket expression and add it to $rx.  Points to
>             # watch for: if the *first* character is a bang, make it a
>             # caret.  If the first non-bang, non-caret character is a
>             # close-bracket, it doesn't close the expression.  Thereafter,
>             # everything goes in unchanged, up to and including the first
>             # close-bracket.
> 
>             $rx .= $c;
> 
>             return undef if ++$i >= length($s);
>             $c = substr($s, $i, 1);
> 
>             if ($c eq "!" or $c eq "^") {
>                 $rx .= "^";
>                 return undef if ++$i >= length($s);
>                 $c = substr($s, $i, 1);
>             }
> 
>             # Whatever the first non-bang, non-caret char was, we add it
>             # to the rx.
>             $rx .= $c;
> 
>             for (;;) {
>                 return undef if ++$i >= length($s);
>                 $c = substr($s, $i, 1);
>                 $rx .= $c;
>                 last if $c eq "]";
>             }
>         }
>         elsif ($c =~ /\w/) {
>             $rx .= $c;
>         }
>         else {          # non-\w, non-special
>             $rx .= "\\$c";
>         }
>     }
> 
>     $rx = "^$rx\$" if $found_meta;
>     return qr/$rx/;
> }
> 

Change "too long, dull, and awkward" to "short, exciting, and ???????"
and you get "Fun with Perl"...

sub locate_glob_to_regex {
  local $_ = "\e" . shift;
  s/\e\*/.*\a\e/ or
    s/\e\?/.\a\e/ or
    /\e\\$/ and return(undef) or
    s/\e(\\.|\w+)/$1\e/ or
    /\e\[[\!\^]\][^\]]*$/ and return(undef) or
    s/\e\[([\^\!]?)(\].*?)\]/[@{[$1 && '^']}$2]\a\e/ or
    s/\e\[([\^\!]?)([^]]+)\]/[@{[$1 && '^']}$2]\a\e/ or
    s/\e([^[])/\\$1\e/ or return(undef)
    until /\e$/;
  $_ = "^$_\$" if tr/\e\a//d > 1;
  return qr/$_/;
}

This *was* Fun...
-- 
Rick

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe