[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] Iterative Implementation of a Tail-Recursive $POST_MATCHed Split of a Slurped HTML Text



Hi,

Dave Babbitt wrote:
> 
> Hi Guys!
> 
> I'm having trouble extracting patterns from HTML. I am also confused with
> all the built-in $ stuff. All I want to do is create a key-value
> associative list with the anchor names as the key and the text between them
> as the value. Then I want to be able to search through the values and
> return the key if I have found what I am looking for. I can get the first
> "<a name=blah>" at the beginning of a $' by using $_ =~ /(<body[^>]*>[^<a
> ]*)/i; but how do I do the rest? The html would look like this:
> 
  [sample deleted]

> Can anybody help?

Hope so ...

What about this one:

   #!/usr/local/bin/perl -w

   open(FILE,"test.dat") or die "Oops!\n";
   @text = <FILE>;
   close(FILE);
   $text = join('',@text);    ## here $text contains the whole file

   $text =~ s/^.*?<body.*?>(.+?)$/$1/si;  ## remove anything from
                                          ## beginning to '<body...>',
                                          ## if you insist doing this

   while($text =~ m|<a name=(.+?)>(.*?)</a>|gsi)   ## that's all ...
   {
       print "name: $1\n";
       print "value: $2\n";
   }

Note some things here:
1) 'while' in combination with the "global" option ('/../g')
   iterates over the string and gives you all the matches.
2) You must use the single-line option 's', so that the dot '.'
   matches newlines '\n', too.
3) The option 'i' is recommended here as HTML is case-insensitive
   regarding the tags.
4) The use of non-greedy search (note the '?'s in patterns) is
   important here.


Bye, Eike
-- 
======================================================================
 Eike Grote, Theoretical Physics IV, University of Bayreuth, Germany
----------------------------------------------------------------------
 e-mail -> eike.grote@theo.phy.uni-bayreuth.de
 WWW    -> http://www.phy.uni-bayreuth.de/theo/tp4/members/grote.html 
           http://www.phy.uni-bayreuth.de/~btpa25/
======================================================================

***** Want to unsubscribe from this list?
***** Send mail with body "unsubscribe" to mac-perl-request@iis.ee.ethz.ch