[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

slurping [was Re: mmap was Re: [FWP] rewrite and simplify (out o



John Carter <john@dwaf-hri.pwv.gov.za> wrote:

> Take the output of MS-Word's doc to
> html converter for example. Mostly its broken HTML so you
> can't really parse it with a standard html parser. (The point
> of the exercise is to fix and clean up the broken stuff...)
> 
> Now I would like to match tags using regexs but the elements
> are spread across many lines and unless you sluuurp (as Todd
> puts it) you can't match.

Manipulating HTML documents -- or more often, unfortunately, 
pseudo-HTML documents created by Word, FrontPage, Pagemill, 
Netscape Gold, and other garbage-generating tools -- is one of 
the things I use Perl for most often.  I always slurp the files 
and use regexen to match across line boundaries.  Any HTML file 
that's too large to fit into memory is far too large to put on 
the Web or do much else with.

-- 
Keith C. Ivey <kcivey@cpcug.org>
http://cpcug.org/user/kcivey/
Washington, DC

==== Want to unsubscribe from Fun With Perl?
==== Well, if you insist... Send mail with body "unsubscribe" to
==== fwp-request@technofile.org