On Sun, 10 Dec 2000 03:27:16 +0100, apeiros wrote: >I need something like an "ASCII translation table" (I don't know how to call >this). What I mean is a table with 4 columns. The first column with the >chars with ASCII-Code from 0 to 255 (MacOS System), second column the >according ASCII chars (which means simply an enumberation from 0 to 255), >the third column the according ASCII chars in DOS/Windows systems and the >last column the ASCII codes for UNIX systems. >Has someone got something like this? I was glad if he or she could send it >to me! I'm sorry, but ASCII runs only from 0 to 127. The deifferecne between systems is only that silly difference in end-of-line marker, CR on Mac, LF on Unix, CR+LF on PC. A translation table won't even help for this, because the count of characters may differ. But, whenever I need something like this, I take Unicode as a reference. ISO-Latin-1 is a subset of this, characters 0 through 255, and Ascii is the 0 - 127 subset of this. Windows (CP-1252) is an extended version of ISO-Latin-1, with some characters added in previously unused places, in the range 128-159. Now, after this intro, the nitty-gritty: you can get nice conversion tables in Ascii text files, from tghe subdirecories in <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/>. For example, the ordinary Windows character set is in <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>, and the ordinary Mac character set is in <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT>. The format is: * comment lines start with "#" * table contents is in 3 columns, separated by tabs: - hex code of the character in local character set (byte) - hex code of equivalent Unicode character (16 bit int) - optionally, comment (description) starting with "#". For example, from the Mac-Roman table: 0xCA 0x00A0 # NO-BREAK SPACE So, on Mac, chr(0xCA) is the non-braking space (" " in HTML lingo), which is character code 0x00A0 in Unicode, and thus, chr(0xA0) in ISO-Latin-1. The advantage of this approach, is that you can convert character from Mac to Windows, that need not exist in ISO-Latin-1, such as the "smart quotes", for example Mac: 0xD2 0x201C # LEFT DOUBLE QUOTATION MARK Windows: 0x93 0x201C #LEFT DOUBLE QUOTATION MARK Note that DOS and Windows use different character sets. I don't know which is the so-called "OEM" character encoding on a PC, but it must be one of the files in <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/>. Unix uses ISO-Latin-1. I guess that by "Ascii" you actually imply ISO-Latin-1. To be absolutely clear: you don't need such a conversion table for ISO-Latin-1, because it would only consist of lines like 0xEF 0x00EF #... Now, how can you process these files in Perl? Something like this: while(<FILE>) { tr/\r\n//d; s/#.*//. my($char, $uni) = split /\s+/; $uni or next; # skip empty lines or unused characters $unicode{chr hex $char} = pack 'v', hex $uni; } This populates a hash with for each character (byte) with their Unicode 2 byte character, this time, in little endian format. A decoding hash can be constructing by just reversing this hash. For example, if you have a %mac2unicode hash, based on the Mac/Roman.txt file, and a %win2unicode hash, based on the CP1252.txt file, then converting from Mac to Windows can basically be done using: %unicode2win = reverse %win2unicode; s/(.)/$unicode2win{$mac2unicode{$1}}/sg; but, [A], this doesn't properly handle holes in the conversion table, and [B], it isn't optimized. Conversions for character codes under 128 are unnecessary (apart from the CR/LF thing, which must be delt with separately, anyway). And you can easily construct a %mac2win hash, like: %mac2win = map { chr($_), $unicode2win{$mac2unicode{chr($_)}} } 0 .. 255; and the conversion can be simplified to s/([\200-\377])/$mac2win{$1}/g; -- Bart. # ===== Want to unsubscribe from this list? # ===== Send mail with body "unsubscribe" to macperl-request@macperl.org