[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] ASCII-Translations



On Sun, 10 Dec 2000 03:27:16 +0100, apeiros wrote:

>I need something like an "ASCII translation table" (I don't know how to call
>this). What I mean is a table with 4 columns. The first column with the
>chars with ASCII-Code from 0 to 255 (MacOS System), second column the
>according ASCII chars (which means simply an enumberation from 0 to 255),
>the third column the according ASCII chars in DOS/Windows systems and the
>last column the ASCII codes for UNIX systems.
>Has someone got something like this? I was glad if he or she could send it
>to me!

I'm sorry, but ASCII runs only from 0 to 127. The deifferecne between
systems is only that silly difference in end-of-line marker, CR on Mac,
LF on Unix, CR+LF on PC. A translation table won't even help for this,
because the count of characters may differ.

But, whenever I need something like this, I take Unicode as a reference.
ISO-Latin-1 is a subset of this, characters 0 through 255, and Ascii is
the 0 - 127 subset of this. Windows (CP-1252) is an extended version of
ISO-Latin-1, with some characters added in previously unused places, in
the range 128-159.

Now, after this intro, the nitty-gritty: you can get nice conversion
tables in Ascii text files, from tghe subdirecories in
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/>. For example, the
ordinary Windows character set is in
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>,
and the ordinary Mac character set is in
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT>.

The format is:

 * comment lines start with "#"

 * table contents is in 3 columns, separated by tabs:

   - hex code of the character in local character set (byte)
   - hex code of equivalent Unicode character (16 bit int)
   - optionally, comment (description) starting with "#".

For example, from the Mac-Roman table:

	0xCA	0x00A0	# NO-BREAK SPACE

So, on Mac, chr(0xCA) is the non-braking space ("&nbsp;" in HTML lingo),
which is character code 0x00A0 in Unicode, and thus, chr(0xA0) in
ISO-Latin-1.

The advantage of this approach, is that you can convert character from
Mac to Windows, that need not exist in ISO-Latin-1, such as the "smart
quotes", for example

Mac:
	0xD2	0x201C	# LEFT DOUBLE QUOTATION MARK
Windows:
	0x93	0x201C	#LEFT DOUBLE QUOTATION MARK

Note that DOS and Windows use different character sets. I don't know
which is the so-called "OEM" character encoding on a PC, but it must be
one of the files in
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/>.

Unix uses ISO-Latin-1. I guess that by "Ascii" you actually imply
ISO-Latin-1.

To be absolutely clear: you don't need such a conversion table for
ISO-Latin-1, because it would only consist of lines like

	0xEF	0x00EF	#...


Now, how can you process these files in Perl? Something like this:

	while(<FILE>) {
	    tr/\r\n//d;
	    s/#.*//.
	    my($char, $uni) = split /\s+/;
	    $uni or next; # skip empty lines or unused characters
	    $unicode{chr hex $char} = pack 'v', hex $uni;
	}

This populates a hash with for each character (byte) with their Unicode
2 byte character, this time, in little endian format. A decoding hash
can be constructing by just reversing this hash. For example, if you
have a %mac2unicode hash, based on the Mac/Roman.txt file, and a
%win2unicode hash, based on the CP1252.txt file, then converting from
Mac to Windows can basically be done using:

	%unicode2win = reverse %win2unicode;

	s/(.)/$unicode2win{$mac2unicode{$1}}/sg;

but, [A], this doesn't properly handle holes in the conversion table,
and [B], it isn't optimized. Conversions for character codes under 128
are unnecessary (apart from the CR/LF thing, which must be delt with
separately, anyway). And you can easily construct a  %mac2win hash,
like:

	%mac2win = map { chr($_), $unicode2win{$mac2unicode{chr($_)}} }
	  0 .. 255;

and the conversion can be simplified to

	s/([\200-\377])/$mac2win{$1}/g;

-- 
	Bart.

# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org