[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] Forms: parsing accented chars



At 12:29 27/04/96 -0400, "jose.stephane@uqam.ca" <Stephane.Jose@uqam.ca> wrote:

>The question was more how can I make sure the submitted request is going to
>be converted properly to html wheteher it has been posted from Netscape/Mac,
>or Netscape/Windoze or InternetExplorer, or whatever platform/software
>combination possible?
>
>For instance when I use Netscape 2.0 to post my request, an 'e-acute' is
sent as
>'%E9' or directly as an 'e-acute' depending on the Document encoding option
>chosen (Menu: Options->Document encoding). Microsoft InternetExplorer (Mac
>Version) uses another value... I have not tried the Win95 version, but I'd
>guess it is another value too...
>
>If each browser uses a different way to encode accented chars in forms,
>it is a real nightmare (can't wait for Unicode ;)
>
>Do you think the cgi should detect what 'user-agent' is used to post
>the request and then apply a different conversion table for each?

I see several possible ways to solve this. I don't know how feasable these
are, so figure this out for yourself.

>>>>
Approach 1:
Let the CGI script check the kind of browser the user is using. Thsi can be
done with the environment variable, readable within the CGI script:
HTTP_USER_AGENT . Provide different approaches for different browsers. If
you don't know how to handle a specific browser, assume Latin-1 ISO 8859-1
encoding, which, as someone pointed out, *should* be the character set for
all browsers. Note: this is the same as Windows/ANSI/Unix text.

I found a URL in a book, where you might find a useful script for parsing
this environment variable. I haven't checked it out, but it's elegidly
written by a Doug Stevenson, in

        http://www.mps.ohio-state.edu/cgi-bin/clientinfo.pl

>>>>
Approach 2:
Hide the characters with code 128 to 255 in your form in a hidden field. Let
the browser include this string together with the request, split the result
up, and use this as a translation table, to get your standard encoding.

I have absolutely no idea whether this can be done. It sounds nice though,
because it would be totally independent of the kind of browser the user is
using. Ah, wishful thinking...

>>>>
Approach 3: A pragmatic approach. I've noticed that Mac and Ansi encoding
hardly overlap: Ansi uses mainly codes from 192 and up, while the Mac sticks
mainly to codes beneath 160. In fact, codes above 218 are meaningless on the
Mac, and codes beneath 160 mean nothing in Ansi (well, they might be
considered as duplicates of the controls codes 0-31 actually).

I've noticed the only overlap in accented characters is in: 203 thru 207,
and 216.

But see for yourself. Use this script to generate a test file for you, and
import this in text editors (or spreadsheets):

        open(OUT,">charset.txt"); select(OUT);
        $,="\t";$\="\n";
        for($i=128;$i<256;$i++){
           print $i,chr($i);
        }

For Perl 4, use pack("c",$i) instead of chr($i).


>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>For instance when I use Netscape 2.0 to post my request, an 'e-acute' is sent
>as '%E9' or directly as an 'e-acute' depending on the Document encoding option
>chosen (Menu: Options->Document encoding).

Making your parsing independent of the type of document encoding isn't too
difficult. There are 3 cases:

1) as "&eacute;": use script staments like

        s/(&\w+;)/$character{$1}/g;

provided, of course, you have a translation table in %character.


2) as %E2 or %e2 : use

        s/%(\w\w)/chr(hex($1))/ge;

This should give you the Latin-1 (Ansi) characters.

3) as &#233; : use 

        s/&#(\d+);/chr($1)/ge;

also Latin-1.

That's the principle. The danger is, of course, if you use one translation,
and that gives you something that could be interpreted as one of the other
encodings, you'll end up converting too much. Example:

        &amp;%23255;

would be converted to   &%23255; by the first, &#255; by the second, and Ø
by the third step. The solution is combining them together:

        s/(&\w+;)|%(\w\w)|&#(\d+);/$character{$1}||chr($3||hex($2))/ge;


Hope this helps,
   
Bart Lateur,
Gent (Belgium)

--- Embracing the KIS principle: Keep It Simple