No, this isn't a cleverly disguised porn spam, but rather a mini essay on reversed regular expressions. I heard on the P5P list (I'm quite sure) something about a regex modifier /r, that would start matching at the END of a string. Now, I'm not about to say I know how to implement this on the source code side, but it would be a matter of reversing the regex nodes (and reversing some of their meanings). in the simplest cases. In more complex cases, as shown next, entire sections can be optimized out: #!/usr/bin/perl use Benchmark 'timethese'; $rsmall = reverse ($small = 'aaaa' x 100); $rbig = reverse ($big = 'aaaa' x 1_000); timethese(-3, { 'EOS small' => q{ $small =~ /aaaa(?!.*aaaa)/ }, 'EOS big' => q{ $big =~ /aaaa(?!.*aaaa)/ }, 'BOS small' => q{ (reverse $small) =~ /aaaa/ }, 'BOS big' => q{ (reverse $big) =~ /aaaa/ }, 'cached BOS small' => q{ $rsmall =~ /aaaa/ }, 'cached BOS big' => q{ $rbig =~ /aaaa/ }, }); __END__ Benchmark: running BOS big, BOS small, EOS big, EOS small, cached BOS big, cached BOS small, each for at least 3 CPU seconds... BOS big: 3 wallclock secs ( 3.00 usr + 0.00 sys = 3.00 CPU) @ 1095.67/s (n=3287) BOS small: 4 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 12560.93/s (n=37934) EOS big: 4 wallclock secs ( 4.69 usr + 0.00 sys = 4.69 CPU) @ 1.28/s (n=6) EOS small: 4 wallclock secs ( 3.01 usr + 0.00 sys = 3.01 CPU) @ 89.70/s (n=270) cached BOS big: 3 wallclock secs ( 3.05 usr + 0.01 sys = 3.06 CPU) @ 194393.79/s (n=594845) cached BOS small: 5 wallclock secs ( 3.04 usr + 0.00 sys = 3.04 CPU) @ 166147.04/s (n=505087) In this simple example, we're matching the LAST occurrence of 'aaaa' in a string. In the standard approach, we'd use something like: /aaaa(?!.*aaaa)/ This could be optimized to /aaaa(?:[^a]+a{1,3})*$/ making the results: Benchmark: running EOS big, EOS small, each for at least 3 CPU seconds... EOS big: 4 wallclock secs ( 3.50 usr + -0.01 sys = 3.49 CPU) @ 20.34/s (n=71) EOS small: 3 wallclock secs ( 3.29 usr + 0.00 sys = 3.29 CPU) @ 207.90/s (n=684) But in the case of matching a string from the end, we can simply match /aaaa/ because the LAST occurrence of 'aaaa' in a string is the FIRST occurrence in the reverse of the string. This is ok for string literals; and if we use rindex() instead, we get good speeds, but not as fast the cached beginning-of-string method: rindex big: 3 wallclock secs ( 2.99 usr + 0.01 sys = 3.00 CPU) @ 124290.00/s (n=372870) rindex small: 3 wallclock secs ( 3.67 usr + 0.00 sys = 3.67 CPU) @ 127551.23/s (n=468113) So what do you do for complex regular expressions? Let's take the simple grammar: $comment = qr{ ![^!]*! }x; $string = qr{ <[^\\<>](?:\\.[^\\<>]*)*> }x; $other = qr{ [^<>!]+ }x; A language written as such could be matched by the following regex: $CODE =~ /\A(?: $comment | $string | $other )+\z/xo; To match the LAST comment in the code, we'd have to do $CODE =~ /($comment)(?: | $string | $other )*\z/xo; Here's a sample block of code: ======================================================================== ! this is a sample program ! int x = 10; ! what a silly grammar ! str y = <cool " beans>; if (len(y) GREATER_THAN x) { ! can't use < and >... heehee ! ! empty comment coming up ! print y, < is longer than >, x, < characters>; !! ! and more stuff to come soon ! chop(y,x); print <I sliced 'y' down to >, x, < characters for you>; } ======================================================================== So the last comment is "! and more stuff to come soon !". (I tried this code, and it was taking a very long time to match. Is there a fallacy in my regex somewhere?) Let's use the reversal tactic. $REVcomment = qr{ ![^!]*! }x; $REVstring = qr{ > (?: [^<>\\]* .\\ )* [^<>\\]* < }x; $REVother = qr{ [^<>!]* }x; reverse $CODE =~ /\A (?: $REVstring | $REVother )* ($REVcomment) /xo; # you have to reverse $1, of course When I ran my code through a Benchmark... Benchmark: running back+cache, backwards, each for at least 5 CPU seconds... back+cache: 15 wallclock secs ( 5.36 usr + 0.00 sys = 5.36 CPU) @ 3183.77/s (n=17065) backwards: 10 wallclock secs ( 5.02 usr + 0.00 sys = 5.02 CPU) @ 2487.25/s (n=12486) As you can see, the caching has a slight speed increase (of course this is expected -- it takes time to reverse a string). As I said, the forward approach was taking FAR too long for me to sit and wait for an answer. SO. Where is this headed? I'm not 100% sure. It's not a totally simple task to reverse the nodes of a regular expression. Especially if you run into the trap that (?=...) can match variable-width, while (?<=...) cannot: /(?=A(?:_\d)+Z)(...)/ reversed would be /(?<=Z(?:\d_)+)(\d_A)/ which isn't allowed currently. The regex would have to be crafted in a different fashion (which might be incompatible with complex regexes): /Z(?:\d_)+(\d_A)/ A problem arises. So is regex reversal possible? Not for all cases. But it's definitely something to look into, especially if look-behind can be modified to allow variable-width patterns. -- Jeff "japhy" Pinyan japhy@pobox.com http://www.pobox.com/~japhy/ PerlMonth - An Online Perl Magazine http://www.perlmonth.com/ The Perl Archive - Articles, Forums, etc. http://www.perlarchive.com/ CPAN - #1 Perl Resource (my id: PINYAN) http://search.cpan.org/ ==== Want to unsubscribe from Fun With Perl? Well, if you insist... ==== Send email to <fwp-request@technofile.org> with message _body_ ==== unsubscribe