On Fri, 28 Apr 2000 11:04:42 -0700, Mat Marcus wrote: >I am trying to parse XML files output by CodeWarrior on the Mac. >Typically these files contain some native mac roman characters -- µ >(mu) for example. I was searching the archives of this list and >noticed that Bart Lateur posted some relevant scripts some time back. >Unfortunately these don't appear to be accessible via the archive. >What was the final outcome? Is there a file I can drop into my >site_perl somewhere to make XML::Parser happy? Could someone forward >me the scripts so that I can try it myself? Somebody kick me. I deserve it. I did indeed get things to work. It needed a lot of cleaning up, (it still does), but it worked for me and I pretty much forgot about it. Besides, we didn't reach a real concensus on the proper name for the characterset, which is another neat excuse for me. ;-) For complete support, you need two things: (a) an ".enc" file for XML::Parser, and (b) a decoder for converting UTF-8 back to Mac-Roman. Now, for (b), theoretically you're best off using Unicode::String.I havan't tried it yet, but I'm going to Real Soon Now (tm). I don't know if it supports Mac-Roman out of the box, or if you need an extra file. I had my own little module, in plain Perl, which worked nicely. I am a bit curious on how it compares to Unicode::String, speedwise. Now, back to (a). You can, as Chris Nandor wrote, download the XML::Encoding package from CPAN, fiddle with it to make things work on the Mac, and all of us come up with the same file. That is a bit silly, if you ask me, especially since for single-byte character sets, the format of the ".enc" file is pretty simple. It's a bit overkill, as a script to process the encoding files, as available from <ftp://ftp.unicode.org/Public/MAPPINGS/> is only 7 lines of code. So I present two straightformward ways to come up with the ".enc" file. once you get that, put it in the folder ":XML:Parser:Encodings", which most likely is in your "site_perl" library folder. A: if all you need is the encoding file for Mac-Roman, for the really lazy people: here is a script that will simply generate the ".enc" file for Mac-Roman. Since we couldn't agree on the name, this script generates two almost identical (and equivalent) copies, under the names "mac-roman.enc" and "macintosh.enc". You actually need only one: you cold just as well replace the other one by an alias to the first one. But, the files are only 1072 bytes, so you wouldn't actually save any disk space. #! perl -w my @map = (0 .. 127); @map[128..255] = (0xC4, 0xC5, 0xC7, 0xC9, 0xD1, 0xD6, 0xDC, 0xE1, 0xE0, 0xE2, 0xE4, 0xE3, 0xE5, 0xE7, 0xE9, 0xE8, 0xEA, 0xEB, 0xED, 0xEC, 0xEE, 0xEF, 0xF1, 0xF3, 0xF2, 0xF4, 0xF6, 0xF5, 0xFA, 0xF9, 0xFB, 0xFC, 0x2020, 0xB0, 0xA2, 0xA3, 0xA7, 0x2022, 0xB6, 0xDF, 0xAE, 0xA9, 0x2122, 0xB4, 0xA8, 0x2260, 0xC6, 0xD8, 0x221E, 0xB1, 0x2264, 0x2265, 0xA5, 0xB5, 0x2202, 0x2211, 0x220F, 0x3C0, 0x222B, 0xAA, 0xBA, 0x3A9, 0xE6, 0xF8, 0xBF, 0xA1, 0xAC, 0x221A, 0x192, 0x2248, 0x2206, 0xAB, 0xBB, 0x2026, 0xA0, 0xC0, 0xC3, 0xD5, 0x152, 0x153, 0x2013, 0x2014, 0x201C, 0x201D, 0x2018, 0x2019, 0xF7, 0x25CA, 0xFF, 0x178, 0x2044, 0x20AC, 0x2039, 0x203A, 0xFB01, 0xFB02, 0x2021, 0xB7, 0x201A, 0x201E, 0x2030, 0xC2, 0xCA, 0xC1, 0xCB, 0xC8, 0xCD, 0xCE, 0xCF, 0xCC, 0xD3, 0xD4, 0xF8FF, 0xD2, 0xDA, 0xDB, 0xD9, 0x131, 0x2C6, 0x2DC, 0xAF, 0x2D8, 0x2D9, 0x2DA, 0xB8, 0x2DD, 0x2DB, 0x2C7); foreach my $encoding (qw(Mac-Roman Macintosh)) { open OUT, ">\L$encoding.enc"; print OUT pack 'Na40n2N256', 0xFEEBFACE, $encoding, 0, 0, @map; } __END__ B: If ever you need an update on that file, or if you need ".enc" files for other single byte character sets, here is a script that will populate the conversion map from the files available from <ftp.unicode.org>, and write out the ".enc" file. Save as a droplet. Give the downloaded file an appropriate name (like "Mac-Roman.txt" or "Macintoh.txt"), and drop it on the the encodings file. For the Mac, get the file <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT> Another obvious choice, is the western Windows character set, which is almost ISO-Latin-1, but not really. I used the file from <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>, and generated "cp1252.enc". Here is the script. #! perl -w while(@ARGV) { my $encoding = shift; open MAP, $encoding or die "Cannot open file $encoding: $!"; for ($encoding) { s/.*[\/\\:]//s; # path s/\.[^.]*$//s; # extension } my @map = (0 .. 127, (-1) x 128); while (<MAP>) { /^\s*0x(\w{2})\s+0x(\w{4})\b/i or next; $map[hex $1] = hex $2; } open OUT, ">\L$encoding.enc"; print OUT pack 'Na40n2N256', 0xFEEBFACE, $encoding, 0, 0, @map; } __END__ -- Bart. ==== Want to unsubscribe from this list? ==== Send mail with body "unsubscribe" to macperl-modules-request@macperl.org