[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] cut



At 7:53 PM 12/13/00, Matthias Dorn wrote:
>Hello
>
>I am looking for a way to extract multiple occurances of a substring 
>(in my case something in rectangular brackets /^<.+$>/ in such a way 
>that the resultant list includes the brackets and its content.
>
>for example if i do a split at ">" somehow it gets cut of ?
>

Parsing marked-up elements out of text isn't trivial, especially if 
markup is allowed to nest within markup.

But is possible to capture the split string. Put parentheses around it.

my $string = q/<P>This is <B>marked-up </B> text.</P>/;

my @items = split /\s*(>)\s*/, $string;
# note parens         ^ ^
print join "\n", @items;
####
prints:

<P
>
This is <B
>
marked-up </B
>
text.</P
>

Any whitespace (\s*) around > is part of the split pattern, but isn't 
captured because it's not within the parens.

You can also split on a pattern that varies, and capture whatever 
happens to be the split in each instance.

# same $string
my @items = split /\s*(<\/?|>)\s*/, $string;
# note pattern        ^......^
print join "\n", @items;
####
prints:

<
P
>
This is
<
B
>
marked-up
</
B
>
text.
</
P
>

That split pattern,
   \s*(<\/?|>)\s*/
splits on <, </, and >.

That split could be the basis of a parser. Another approach uses a 
slightly different split pattern:

# same $string
my @items2 = split /\s*(<[^>]*>) \s*/, $string;
# note pattern         ^.......^
print join "\n", @items2, "\n", '---', "\n";
####
prints:

<P ALIGN='CENTER'>
This is
<B>
marked-up
</B>
text.
</P>

Or, remove the parens and whitespace from the split pattern:

# same $string
my @items3 = split /<[^>]*>/, $string;
# note NO parens & no whitespace (\s*)
print join "", @items3, "\n", '---', "\n";
# join with "" to preserve whitespace
####
prints:
This is marked-up text.

Note that if the text starts with something that matches the split 
pattern, the list returned from split will have a 'blank' first 
element, before any element with captured split string. That's why 
the first and second printouts above have a blank first line. The 
elements of @items1 and @items2 were joined with "\n", so even an 
element with '' gets its own line in the printout.

These aren't bullet-proof patterns, and some folks have spent many 
hours on parsing algorithms for HTML and other texts, so you might 
want to search around and find which Perl parsing modules will 
readily do the heavy lifting for you.

Or, maybe you've got just the algorithm for your needs, and the above 
fun with split patterns might help you along. Here's all of the 
above, so you can fiddle with different strings and patterns:

#!perl -w

my $string = q{<P ALIGN='CENTER'>This is <B>marked-up</B> text.</P>};
print "## Text: ###", "\n", $string, "\n\n";

my @items1 = split /\s*(<\/?|>)\s*/, $string;
print join "\n", '## @items1 ###', @items1, "\n";

my @items2 = split /\s*(<[^>]*>)\s*/, $string;
print join "\n", '## @items2 ###', @items2, "\n";

my @items3 = split /<[^>]*>/, $string;
print join "", '## @items3 ###', "\n", @items3, "\n";

__END__

HTH

1;




- Bruce

__Bruce_Van_Allen___Santa_Cruz_CA__

# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org