[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[FWP] Substitution



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 22 Jun 1999 10:31:04 -0400, Chaim Frenkel wrote:

>Isn't this a written language problem. The final result has to be
>easily parsed by a human. Consider the problem of the replacement
>text colliding with a common word.

We have lots of acronyms that collide with common words. "Diario
Oficial de las Comunidades Europeas" => "DOCE". "doce" means "twelve".

>Is this different in Spanish? Is that why you need a regular
>expression?

No, in Spanish we use uppercase acronyms too. Rules are a bit strange
because in abbreviations we use dots (like "S.M." for "Su Majestad",
meaning "His (or Her) Majesty"), and duplicate the character to make
plurals (so "SS.MM." means "Their Majesties"). "CC.OO." is the name of
a big Worker's Union, named "Comisiones Obreras" (both words are
plural).

Problem is, we're doing case-insensitive searches because input texts
have *lots* of typos(1), so it isn't infrequent for an organization
name to be spelt in lowercase, or mixed or whatever.

Hope you find all that amusing/interesting,

                                                       /L/e/k/t/u


(1) So many typos that we not only search for a correct word like
"séptimo" with /s[eé]ptimo/ (supposing they've misspelt it), but also
much less frequent errors like searching /s[eé]gundo/ to find
"segundo"; but when you process 5x10^5 documents (about 1GB of text)
you can find just about *anything* you can imagine, and worse :)


-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 6.0.2i

iQA/AwUBN2+XD/4C0a0jUw5YEQL3FQCgkWeGLyfIU8y3eZrXiSL8GAZ27LcAoMXW
1kqABlvMUA7bw2n4SWgP4xaT
=32OE
-----END PGP SIGNATURE-----


==== Want to unsubscribe from Fun With Perl?
==== Well, if you insist... Send mail with body "unsubscribe" to
==== fwp-request@technofile.org