[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] how to prepare (mac) multi-line (text) file, thenmatch ...

To: s.m.eastham@abdn.ac.uk (Stephen Eastham)
Subject: Re: [MacPerl] how to prepare (mac) multi-line (text) file, thenmatch ...
From: Juergen Christoffel <Mac.Christoffel@gmd.de>
Date: Sun, 10 Aug 1997 12:38:46 +0200
Cc: mac-perl@iis.ee.ethz.ch
In-Reply-To: <v02130500b0133d5b802c@[139.133.41.2]>

At 10:08 +0000 10.08.97, Stephen Eastham wrote:
>hopefully, i'm asking the right question:
>
>imagine a file containing a list of words and another file containing
>a text which has these words in. [...]

Stephen,

If your list of words isn't too long, I'd put them all into one regexp.
Otherwise, if the list is long but doesn't change too often, I'd build a
control-structure of if-then-elses with MacPerl first to speed up the
matches.

>the first approach seems to be either to read in the whole text document
>(which sounds dangerous - just how big could the document be before things
>go hay-wire?) or read in a paragraph at a time. thinking about
>the paragraph-oriented approach, how does a mac/macperl distinguish lines from
>paragraphs in a text file?

Reading in the whole file isn't dangerous as long as you set MacPerl's
partition size large enough or check the file's size first with something
like

die "File $file too large\n" if (-s $file > 100_000);

Set $/ = ''; to tell Perl to read paragraphs (i.e. blocks of lines
delimited by \n\n) and set $/ = undef; to tell Perl to read in the whole
text in one chunk. Check the documentation of $/ for details.

After reading paragraphs all you have to do is split() your paragraphs into
sentences. That should be the hard part I'd expect, because parsing text
into SENTENCES isn't trivial for general texts (the definition of
'sentence' isn't all that regular for general texts, just think about
abbrevs like "Dr." or "Mr.") I would do something like split() paragraphs
at [!?.] and then concatenate a sentence" with the next one again, if it
ends in something like "Dr|Mr|Mrs|..."

	--jc

--
Ju:rgen Christoffel, GMD - Forschungszentrum Informationstechnik GmbH
E-Mail: christoffel@gmd.de or one of {ftp,news,web}master@gmd.de

***** Want to unsubscribe from this list?
***** Send mail with body "unsubscribe" to mac-perl-request@iis.ee.ethz.ch

References:
- [MacPerl] how to prepare (mac) multi-line (text) file, then match ...
  - From: s.m.eastham@abdn.ac.uk (Stephen Eastham)

Prev by Date: [MacPerl] how to prepare (mac) multi-line (text) file, then match ...
Next by Date: Re: [MacPerl] how to prepare (mac) multi-line (text) file, then match ...
Prev by thread: [MacPerl] how to prepare (mac) multi-line (text) file, then match ...
Next by thread: Re: [MacPerl] how to prepare (mac) multi-line (text) file, then match ...
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net