-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22 Jun 1999 10:31:04 -0400, Chaim Frenkel wrote: >Isn't this a written language problem. The final result has to be >easily parsed by a human. Consider the problem of the replacement >text colliding with a common word. We have lots of acronyms that collide with common words. "Diario Oficial de las Comunidades Europeas" => "DOCE". "doce" means "twelve". >Is this different in Spanish? Is that why you need a regular >expression? No, in Spanish we use uppercase acronyms too. Rules are a bit strange because in abbreviations we use dots (like "S.M." for "Su Majestad", meaning "His (or Her) Majesty"), and duplicate the character to make plurals (so "SS.MM." means "Their Majesties"). "CC.OO." is the name of a big Worker's Union, named "Comisiones Obreras" (both words are plural). Problem is, we're doing case-insensitive searches because input texts have *lots* of typos(1), so it isn't infrequent for an organization name to be spelt in lowercase, or mixed or whatever. Hope you find all that amusing/interesting, /L/e/k/t/u (1) So many typos that we not only search for a correct word like "séptimo" with /s[eé]ptimo/ (supposing they've misspelt it), but also much less frequent errors like searching /s[eé]gundo/ to find "segundo"; but when you process 5x10^5 documents (about 1GB of text) you can find just about *anything* you can imagine, and worse :) -----BEGIN PGP SIGNATURE----- Version: PGPfreeware 6.0.2i iQA/AwUBN2+XD/4C0a0jUw5YEQL3FQCgkWeGLyfIU8y3eZrXiSL8GAZ27LcAoMXW 1kqABlvMUA7bw2n4SWgP4xaT =32OE -----END PGP SIGNATURE----- ==== Want to unsubscribe from Fun With Perl? ==== Well, if you insist... Send mail with body "unsubscribe" to ==== fwp-request@technofile.org