At 12:29 27/04/96 -0400, "jose.stephane@uqam.ca" <Stephane.Jose@uqam.ca> wrote: >The question was more how can I make sure the submitted request is going to >be converted properly to html wheteher it has been posted from Netscape/Mac, >or Netscape/Windoze or InternetExplorer, or whatever platform/software >combination possible? > >For instance when I use Netscape 2.0 to post my request, an 'e-acute' is sent as >'%E9' or directly as an 'e-acute' depending on the Document encoding option >chosen (Menu: Options->Document encoding). Microsoft InternetExplorer (Mac >Version) uses another value... I have not tried the Win95 version, but I'd >guess it is another value too... > >If each browser uses a different way to encode accented chars in forms, >it is a real nightmare (can't wait for Unicode ;) > >Do you think the cgi should detect what 'user-agent' is used to post >the request and then apply a different conversion table for each? I see several possible ways to solve this. I don't know how feasable these are, so figure this out for yourself. >>>> Approach 1: Let the CGI script check the kind of browser the user is using. Thsi can be done with the environment variable, readable within the CGI script: HTTP_USER_AGENT . Provide different approaches for different browsers. If you don't know how to handle a specific browser, assume Latin-1 ISO 8859-1 encoding, which, as someone pointed out, *should* be the character set for all browsers. Note: this is the same as Windows/ANSI/Unix text. I found a URL in a book, where you might find a useful script for parsing this environment variable. I haven't checked it out, but it's elegidly written by a Doug Stevenson, in http://www.mps.ohio-state.edu/cgi-bin/clientinfo.pl >>>> Approach 2: Hide the characters with code 128 to 255 in your form in a hidden field. Let the browser include this string together with the request, split the result up, and use this as a translation table, to get your standard encoding. I have absolutely no idea whether this can be done. It sounds nice though, because it would be totally independent of the kind of browser the user is using. Ah, wishful thinking... >>>> Approach 3: A pragmatic approach. I've noticed that Mac and Ansi encoding hardly overlap: Ansi uses mainly codes from 192 and up, while the Mac sticks mainly to codes beneath 160. In fact, codes above 218 are meaningless on the Mac, and codes beneath 160 mean nothing in Ansi (well, they might be considered as duplicates of the controls codes 0-31 actually). I've noticed the only overlap in accented characters is in: 203 thru 207, and 216. But see for yourself. Use this script to generate a test file for you, and import this in text editors (or spreadsheets): open(OUT,">charset.txt"); select(OUT); $,="\t";$\="\n"; for($i=128;$i<256;$i++){ print $i,chr($i); } For Perl 4, use pack("c",$i) instead of chr($i). >>>>>>>>>>>>>>>>>>>>>>>>>>>> >For instance when I use Netscape 2.0 to post my request, an 'e-acute' is sent >as '%E9' or directly as an 'e-acute' depending on the Document encoding option >chosen (Menu: Options->Document encoding). Making your parsing independent of the type of document encoding isn't too difficult. There are 3 cases: 1) as "é": use script staments like s/(&\w+;)/$character{$1}/g; provided, of course, you have a translation table in %character. 2) as %E2 or %e2 : use s/%(\w\w)/chr(hex($1))/ge; This should give you the Latin-1 (Ansi) characters. 3) as é : use s/&#(\d+);/chr($1)/ge; also Latin-1. That's the principle. The danger is, of course, if you use one translation, and that gives you something that could be interpreted as one of the other encodings, you'll end up converting too much. Example: &%23255; would be converted to &%23255; by the first, ÿ by the second, and Ø by the third step. The solution is combining them together: s/(&\w+;)|%(\w\w)|&#(\d+);/$character{$1}||chr($3||hex($2))/ge; Hope this helps, Bart Lateur, Gent (Belgium) --- Embracing the KIS principle: Keep It Simple