[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[FWP] sexeger



No, this isn't a cleverly disguised porn spam, but rather a mini essay on
reversed regular expressions.

I heard on the P5P list (I'm quite sure) something about a regex modifier
/r, that would start matching at the END of a string.  Now, I'm not about
to say I know how to implement this on the source code side, but it would
be a matter of reversing the regex nodes (and reversing some of their
meanings). in the simplest cases.  In more complex cases, as shown next,
entire sections can be optimized out:

  #!/usr/bin/perl
  use Benchmark 'timethese';

  $rsmall = reverse ($small = 'aaaa' x 100);
  $rbig = reverse ($big = 'aaaa' x 1_000);

  timethese(-3, {
    'EOS small'        => q{ $small =~ /aaaa(?!.*aaaa)/ },
    'EOS big'          => q{ $big =~ /aaaa(?!.*aaaa)/ },  
    'BOS small'        => q{ (reverse $small) =~ /aaaa/ },
    'BOS big'          => q{ (reverse $big) =~ /aaaa/ },  
    'cached BOS small' => q{ $rsmall =~ /aaaa/ },
    'cached BOS big'   => q{ $rbig =~ /aaaa/ },  
  });

  __END__

Benchmark: running
  BOS big, BOS small, EOS big, EOS small,
  cached BOS big, cached BOS small,
each for at least 3 CPU seconds...

  BOS big:  3 wallclock secs
     ( 3.00 usr +  0.00 sys =  3.00 CPU) @ 1095.67/s (n=3287)
BOS small:  4 wallclock secs
     ( 3.02 usr +  0.00 sys =  3.02 CPU) @ 12560.93/s (n=37934)
  EOS big:  4 wallclock secs
     ( 4.69 usr +  0.00 sys =  4.69 CPU) @ 1.28/s (n=6)
EOS small:  4 wallclock secs
     ( 3.01 usr +  0.00 sys =  3.01 CPU) @ 89.70/s (n=270)
  cached BOS big:  3 wallclock secs
     ( 3.05 usr +  0.01 sys =  3.06 CPU) @ 194393.79/s (n=594845)
cached BOS small:  5 wallclock secs
     ( 3.04 usr +  0.00 sys =  3.04 CPU) @ 166147.04/s (n=505087)


In this simple example, we're matching the LAST occurrence of 'aaaa' in a
string.  In the standard approach, we'd use something like:

  /aaaa(?!.*aaaa)/

This could be optimized to

  /aaaa(?:[^a]+a{1,3})*$/

making the results:

Benchmark: running
  EOS big, EOS small,
each for at least 3 CPU seconds...

   EOS big:  4 wallclock secs
    ( 3.50 usr + -0.01 sys =  3.49 CPU) @ 20.34/s (n=71)
 EOS small:  3 wallclock secs
    ( 3.29 usr +  0.00 sys =  3.29 CPU) @ 207.90/s (n=684)

But in the case of matching a string from the end, we can simply match

  /aaaa/

because the LAST occurrence of 'aaaa' in a string is the FIRST occurrence
in the reverse of the string.  This is ok for string literals; and if we
use rindex() instead, we get good speeds, but not as fast the cached
beginning-of-string method:

  rindex big:  3 wallclock secs
  ( 2.99 usr +  0.01 sys =  3.00 CPU) @ 124290.00/s (n=372870)
rindex small:  3 wallclock secs
  ( 3.67 usr +  0.00 sys =  3.67 CPU) @ 127551.23/s (n=468113)

So what do you do for complex regular expressions?  Let's take the simple
grammar:

  $comment = qr{ ![^!]*! }x;
  $string = qr{ <[^\\<>](?:\\.[^\\<>]*)*> }x;
  $other = qr{ [^<>!]+ }x;

A language written as such could be matched by the following regex:

  $CODE =~ /\A(?: $comment | $string | $other )+\z/xo;

To match the LAST comment in the code, we'd have to do

  $CODE =~ /($comment)(?: | $string | $other )*\z/xo;

Here's a sample block of code:

========================================================================
! this is a sample program ! int x = 10;
! what a silly grammar ! str y = <cool " beans>;
if (len(y) GREATER_THAN x) { ! can't use < and >... heehee !
  ! empty comment coming up !
  print y, < is longer than >, x, < characters>;
  !!
  ! and more stuff to come soon !
  chop(y,x);
  print <I sliced 'y' down to >, x, < characters for you>;
}
========================================================================

So the last comment is "! and more stuff to come soon !".  (I tried this
code, and it was taking a very long time to match.  Is there a fallacy in
my regex somewhere?)

Let's use the reversal tactic.

  $REVcomment = qr{ ![^!]*! }x;
  $REVstring = qr{ > (?: [^<>\\]* .\\ )* [^<>\\]* < }x;
  $REVother = qr{ [^<>!]* }x;

  reverse $CODE =~ /\A (?: $REVstring | $REVother )* ($REVcomment) /xo;
  # you have to reverse $1, of course

When I ran my code through a Benchmark...

Benchmark: running
  back+cache, backwards,
each for at least 5 CPU seconds...

back+cache: 15 wallclock secs
  ( 5.36 usr +  0.00 sys =  5.36 CPU) @ 3183.77/s (n=17065)
 backwards: 10 wallclock secs
  ( 5.02 usr +  0.00 sys =  5.02 CPU) @ 2487.25/s (n=12486)

As you can see, the caching has a slight speed increase (of course this is
expected -- it takes time to reverse a string).  As I said, the forward
approach was taking FAR too long for me to sit and wait for an answer.

SO.

Where is this headed?  I'm not 100% sure.  It's not a totally simple task
to reverse the nodes of a regular expression.  Especially if you run into
the trap that (?=...) can match variable-width, while (?<=...) cannot:

  /(?=A(?:_\d)+Z)(...)/

reversed would be

  /(?<=Z(?:\d_)+)(\d_A)/

which isn't allowed currently.  The regex would have to be crafted in a
different fashion (which might be incompatible with complex regexes):

  /Z(?:\d_)+(\d_A)/

A problem arises.

So is regex reversal possible?  Not for all cases.  But it's definitely
something to look into, especially if look-behind can be modified to allow
variable-width patterns.

-- 
Jeff "japhy" Pinyan     japhy@pobox.com     http://www.pobox.com/~japhy/
PerlMonth - An Online Perl Magazine            http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc.    http://www.perlarchive.com/
CPAN - #1 Perl Resource  (my id:  PINYAN)        http://search.cpan.org/


==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe