[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] Forms: parsing accented chars



At 14:26 11/04/96 -0400, "Stephane Jose" <jose.stephane@uqam.ca> wrote:

>I am setting up a cgi with MacPerl that allows me to return a web page of
>info on a particular city by browsing a text based database (tab separated
>fields, return separated records). I request the name of that city from a
>form. Nothing fancy. My script works fine when I request unaccented data
>(ie. 'Verdun' or 'Yamaska'). But when I submit a city name with accented
>chars the mess begins.
>
>Is there a way to deal consistently with accented data submitted from a
>form, independantly from the platform from which it was sent?

I've not seen any replies to this post, so I thought I'd jump in.

Perl does *not* have any troubles parsing accented characters.

The problem here, is that only the standard ASCII set is platform
independent. That is: from space, chr(32) to chr(126).

Below that are the control characters. These are mostly portable (including
tab, chr(9)) with one important difference: line terminations, "\n".

This means different things for Unix: chr(10), Mac: chr(13), PC:
chr(13)+chr(10). But you can easily work around that.

A bigger difference is those accented characters you're talking about. These
are *not* part of the standard ASCII set, and have codes between 128 and
255. I know of 4 platforms: Mac, PC DOS (OEM), PC Windows (ANSI), Unix
(probably ANSI as well). Each has it's own "standard".

In fact, in Perl you can easily convert from one platform to another using a
single command like

        tr/\200-\377/ .... /;

where the .... 's are replaced by a list of 128 characters, the translation
table. If anyone's interested, I can post my tables for DOS>MAC and ANSI>MAC.


But this isn't too relevant here. If I understand correctly, you want to
return data in a HTML form, to the user?

HTML has it's own standard way of dealing ith this. You need to use special
HTML code strings, instead of accented characters. As an example, an "Ž"
(that's an "e" with and "accent aigue" on it, must be included  as
"&eacute;". So you need a translation table, and a lot of lines like this:

        s/Ž/&eacute;/g;

You can probably get a table on the net, in documentation about html. That's
better than in a book, because you can simply incorporate it into your
script (maybe write a Perl script?).

I would think it's best to use a table like:

        %htmlised={'Ž','&eacute', ...};
 and
 
        foreach $key (keys(%htmlised)) {
                s/$key/$htmlised{$key}/g;
        }

The table %htmlised could well be built from a text table, with lines for
every translation, key + value on one line, with a tab between them.

Now that you know *what* to do, the question is: when?

You could store the table of cities as it is now, and convert every line on
the fly, *every time* a html document is generated.

Or, my suggestion, create a html-ised version of the table once, and simply
incorporate the results into your html document without any further
cnversion, as you create it.

The disadvantage is that you can only see the full table of cities as it
should look, using a web browser. So keep an "original" that you can edit,
and a html-ised version to be used by your client program.


Bart
--- Embracing the KIS principle: Keep It Simple