[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[FWP] innermost first parsing



Given text like (indented for sanity's sake):

  BEGIN
    alpha
    BEGIN
      beta
      gamma
      BEGIN
        delta
      END
      epsiolon
      BEGIN
        zeta
        eta
      END
      theta
    END
  END

If you want to find the first piece of data between BEGIN and END that
does NOT have another BEGIN ... END set in it, you can use the following
"unrolling the loop" style regex:

  ($first) = $text =~ m{
    BEGIN
    (
      [^BE]*  # 'B' and 'E' are first chars of tags
      (?:
        (?:
          B+ (?! EGIN )  # match /B+/ if NOT 'BEGIN'
          |
          E+ (?! ND )    # match /E+/ if NOT 'END'
        )
        [^BE]*
      )*
    )
    END
  }xs;

This has worked for the stress-testing I've done.

The application is:

  while ($text =~ s/REGEX/SOMETHING/) {
    $count++;
  }

so that the inner-most matches are dealt with first.

-- 
Jeff "japhy" Pinyan     japhy@pobox.com     http://www.pobox.com/~japhy/
PerlMonth - An Online Perl Magazine            http://www.perlmonth.com/
The Perl Archive - Articles, Forums, etc.    http://www.perlarchive.com/
CPAN - #1 Perl Resource  (my id:  PINYAN)        http://search.cpan.org/


==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe