[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl-Modules] XML::Parser: support for native Mac (Roman) character set?



On Fri, 28 Apr 2000 11:04:42 -0700, Mat Marcus wrote:

>I am trying to parse XML files output by CodeWarrior on the Mac. 
>Typically these files contain some native mac roman characters -- µ 
>(mu) for example. I was searching the archives of this list and 
>noticed that Bart Lateur posted some relevant scripts some time back. 
>Unfortunately these don't appear to be accessible via the archive. 
>What was the final outcome? Is there a file I can drop into my 
>site_perl somewhere to make XML::Parser happy? Could someone forward 
>me the scripts so that I can try it myself?

Somebody kick me. I deserve it.

I did indeed get things to work. It needed a lot of cleaning up, (it
still does), but it worked for me and I pretty much forgot about it.
Besides, we didn't reach a real concensus on the proper name for the
characterset, which is another neat excuse for me. ;-)

For complete support, you need two things: (a) an ".enc" file for
XML::Parser, and (b) a decoder for converting UTF-8 back to Mac-Roman.
Now, for (b), theoretically you're best off using Unicode::String.I
havan't tried it yet, but I'm going to Real Soon Now (tm).  I don't know
if it supports Mac-Roman out of the box, or if you need an extra file. I
had my own little module, in plain Perl, which worked nicely. I am a bit
curious on how it compares to Unicode::String, speedwise.

Now, back to (a). You can, as Chris Nandor wrote, download the
XML::Encoding package from CPAN, fiddle with it to make things work on
the Mac, and all of us come up with the same file. That is a bit silly,
if you ask me, especially since for single-byte character sets, the
format of the ".enc" file is pretty simple. It's a bit overkill, as a
script to process the encoding files, as available from
<ftp://ftp.unicode.org/Public/MAPPINGS/> is only 7 lines of code.

So I present two straightformward ways to come up with the ".enc" file.
once you get that, put it in the folder ":XML:Parser:Encodings", which
most likely is in your "site_perl" library folder.

A: if all you need is the encoding file for Mac-Roman, for the really
lazy people: here is a script that will simply generate the ".enc" file
for Mac-Roman.

Since we couldn't agree on the name, this script generates two almost
identical (and equivalent) copies, under the names "mac-roman.enc" and
"macintosh.enc". You actually need only one: you cold just as well
replace the other one by an alias to the first one. But, the files are
only 1072 bytes, so you wouldn't actually save any disk space.

#! perl -w
my @map = (0 .. 127);
@map[128..255] = (0xC4, 0xC5, 0xC7, 0xC9, 0xD1, 0xD6, 0xDC, 0xE1,
  0xE0, 0xE2, 0xE4, 0xE3, 0xE5, 0xE7, 0xE9, 0xE8, 0xEA, 0xEB, 0xED,
  0xEC, 0xEE, 0xEF, 0xF1, 0xF3, 0xF2, 0xF4, 0xF6, 0xF5, 0xFA, 0xF9,
  0xFB, 0xFC, 0x2020, 0xB0, 0xA2, 0xA3, 0xA7, 0x2022, 0xB6, 0xDF,
  0xAE, 0xA9, 0x2122, 0xB4, 0xA8, 0x2260, 0xC6, 0xD8, 0x221E, 0xB1,
  0x2264, 0x2265, 0xA5, 0xB5, 0x2202, 0x2211, 0x220F, 0x3C0,
  0x222B, 0xAA, 0xBA, 0x3A9, 0xE6, 0xF8, 0xBF, 0xA1, 0xAC, 0x221A,
  0x192, 0x2248, 0x2206, 0xAB, 0xBB, 0x2026, 0xA0, 0xC0, 0xC3,
  0xD5, 0x152, 0x153, 0x2013, 0x2014, 0x201C, 0x201D, 0x2018,
  0x2019, 0xF7, 0x25CA, 0xFF, 0x178, 0x2044, 0x20AC, 0x2039,
  0x203A, 0xFB01, 0xFB02, 0x2021, 0xB7, 0x201A, 0x201E, 0x2030,
  0xC2, 0xCA, 0xC1, 0xCB, 0xC8, 0xCD, 0xCE, 0xCF, 0xCC, 0xD3, 0xD4,
  0xF8FF, 0xD2, 0xDA, 0xDB, 0xD9, 0x131, 0x2C6, 0x2DC, 0xAF, 0x2D8,
  0x2D9, 0x2DA, 0xB8, 0x2DD, 0x2DB, 0x2C7);

foreach my $encoding (qw(Mac-Roman Macintosh)) {
    open OUT, ">\L$encoding.enc";
    print OUT pack 'Na40n2N256', 0xFEEBFACE, $encoding, 0, 0, @map;
}
__END__


B: If ever you need an update on that file, or if you need ".enc" files
for other single byte character sets, here is a script that will
populate the conversion map from the files available from
<ftp.unicode.org>, and write out the ".enc" file. Save as a droplet.
Give the downloaded file an appropriate name (like "Mac-Roman.txt" or
"Macintoh.txt"), and drop it on the the encodings file. For the Mac, get
the file <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT>

Another obvious choice, is the western Windows character set, which is
almost ISO-Latin-1, but not really. I used the file from
<ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>,
and generated "cp1252.enc". Here is the script.

#! perl -w
while(@ARGV) {
    my $encoding = shift;
    open MAP, $encoding or die "Cannot open file $encoding: $!";
    for ($encoding) {
        s/.*[\/\\:]//s; # path
        s/\.[^.]*$//s;  # extension
    }
    my @map = (0 .. 127, (-1) x 128);
    while (<MAP>) {
        /^\s*0x(\w{2})\s+0x(\w{4})\b/i or next;
        $map[hex $1] = hex $2;
    }
    open OUT, ">\L$encoding.enc";
    print OUT pack 'Na40n2N256', 0xFEEBFACE, $encoding, 0, 0, @map;
}
__END__

-- 
	Bart.

==== Want to unsubscribe from this list?
==== Send mail with body "unsubscribe" to macperl-modules-request@macperl.org