[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

slurping [was Re: mmap was Re: [FWP] rewrite and simplify (out o

To: "fwp@technofile.org" <fwp@technofile.org>
Subject: slurping [was Re: mmap was Re: [FWP] rewrite and simplify (out o
From: "Keith Calvert Ivey" <kcivey@cpcug.org>
Date: Fri, 2 Jul 1999 07:54:39 -0400
Comments: Authenticated sender is <kcivey@cpcug.org>
In-reply-to: <Pine.LNX.4.05.9907020555200.10641-100000@models.iwqs.pwv.gov.za>
Organization: Capital PC User Group, Rockville, Maryland,
References: <377BB66E.28864B62@kasey.umkc.edu>

John Carter <john@dwaf-hri.pwv.gov.za> wrote:

> Take the output of MS-Word's doc to
> html converter for example. Mostly its broken HTML so you
> can't really parse it with a standard html parser. (The point
> of the exercise is to fix and clean up the broken stuff...)
> 
> Now I would like to match tags using regexs but the elements
> are spread across many lines and unless you sluuurp (as Todd
> puts it) you can't match.

Manipulating HTML documents -- or more often, unfortunately, 
pseudo-HTML documents created by Word, FrontPage, Pagemill, 
Netscape Gold, and other garbage-generating tools -- is one of 
the things I use Perl for most often.  I always slurp the files 
and use regexen to match across line boundaries.  Any HTML file 
that's too large to fit into memory is far too large to put on 
the Web or do much else with.

-- 
Keith C. Ivey <kcivey@cpcug.org>
http://cpcug.org/user/kcivey/
Washington, DC

==== Want to unsubscribe from Fun With Perl?
==== Well, if you insist... Send mail with body "unsubscribe" to
==== fwp-request@technofile.org

References:
- Re: mmap was Re: [FWP] rewrite and simplify (out of memory)
  - From: "David L. Nicol" <david@kasey.umkc.edu>
- Re: mmap was Re: [FWP] rewrite and simplify (out of memory)
  - From: John Carter <john@dwaf-hri.pwv.gov.za>

Prev by Date: Re: [FWP] Problem - rewrite and simplify
Next by Date: Re: [FWP] Gender differentiation
Prev by thread: Re: mmap was Re: [FWP] rewrite and simplify (out of memory)
Next by thread: Re: mmap was Re: [FWP] rewrite and simplify (out of memory)
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net