On Sun, 14 May 2000 20:57:56 -0300, Arved Sandstrom wrote: >At 01:32 PM 5/14/00 -0400, M. Christian Hanson wrote: >>I am taking the output of the XML parser in macperl and want to push >>it into a text processor that is really itching for Latin-1 not the >>UTF8 that the xml parser hands me. Any body have any advice? >You can check the archives for this list, for one. This just came up >recently. Bart Lateur, if I recall correctly, has looked at this. Says the guy who ported the Unicode::String module to MacPerl. See <http://pudge.net/cgi-bin/mmp.plx>. It's time that I put my code where my mouth is. I had only mentioned that I got a solution, but I'm still having big problem with finalizing it. There are so many ways to do that, and none looks ideal. Worse: the code gets 5 times more complicated, than the bare essence of it. I posted some code in comp.lang.perl.misc, and the original module author, Gisle Aas, pointed me to this module. I have taken a look at it, but I just can't seem to get my brain to wrap around it. I simply don't get it. So here's the code again. I provide you with enough clues to finish it off by yourself, in a way that YOU like. The kernel of the code is this little function, that turns a 16-bits Unicode character number, into a (multibyte) UTF-8 character. sub UTF8::chr { my $ord = 0+shift; if($ord && $ord < 0x80) { return chr $ord; } elsif ($ord < 0x800) { return pack 'C*', $ord>>6|0xC0, $ord&0x3F|0x80; } else { # $ord <= 0xFFFF return pack 'C*', $ord>>12|0xE0, $ord>>6&0x3F|0x80, $ord&0x3F|0x80; } } Of course, calling this function for every single to convert is rahter inefficient. Since we're talking about only 128 different characters that need conversion, a substitution pattern with a look-up encoder hash, is a simple, compact and fast way of doing that. With an inverted decoder hash, you can decode the UTF-8 back into the character set of your own. As you may know, the character codes of ISO-Latin-1 are the same as the Unicode characters with code 0 through 255. So populating the encoder hash is easy: my(%encode, %decode); for my $c (0, 128 .. 255) { $encode{chr $c} = UTF8::chr($c); } %decode = reverse %encode; And finally, here are the functions that use these hashes to do the actual conversion: sub L1ToUTF8 { # encode my @arg = @_; for(@arg) { s/([\000\200-\377])/$encode{$1}/g; } return wantarray?@arg:$arg[-1]; } sub UTF8ToL1 { # decode my @arg = @_; for(@arg) { s/([\300-\377][\200-\277]+)/$decode{$1} || (defined($decode{$1})?$decode{$1}:"\177$1")/ge; } return wantarray?@arg:$arg[-1]; } This last function does a rather efficient implementation of the non-existent operator '??', which would have been a shortcut 'or' which only continues if the LHS is not defined (as opposed to false). I used the character '\177', AKA chr(127) or 'DEL', 'delete', as a conversion failure flag. This character probably will never ever appear in any text file. Now, take all these functions, put them in a simple library file, which I called 'utf8ToL1.pl', and don't forget to put & "1;" at the end. This is a test program, which generates a character table. Save the text output, and look at it with a hex file viewer. You should see a character with character ("Ascii") code which is identical to the number on the left of it. #! perl -w require 'UTF8ToL1.pl'; use XML::Parser; $p1 = new XML::Parser(Handlers => { Char => \&Print }, ProtocolEncoding => 'ISO-8859-1'); $" = "\n"; my $xml = <<"__EOT__"; <charset> @{[ map { sprintf "%02X%s", $_, "\t&#$_;" } (32 .. 255) ]} </charset> __EOT__ sub Print { my $self = shift; local $_ = shift; tr/\r/\n/; # solves a small bug in XML::Parser on Mac print UTF8ToL1($_); } $p1->parse($xml); __END__ And finally, how can youuse this to encode/decode from other character sets? Well, assuming one-byte character sets, of which the section 1 .. 127 is plain Ascii, you can simply reuse the above encoding/decoding functions, but with different encoding and decoding hashes. If anybody knows a good and simple way of implementing this generally, I'd like to know. I had implemented them as closures, but using those in your own code turned out to be not so trivial. And I don't really like Perl's OO. Anyway, suggestions welcome. How can you populate the encoding hash? It's easy. Get the proper encoding text file from <ftp://ftp.unicode.org/Public/MAPPINGS/> (for the Mac, it's under VENDORS/APPLE/ROMAN.TXT). I assume you open that file using the file handle ENC. Trust me, doing this in a module is far from easy. Once you got past that hurdle, here's how you can populate the hash: my(%encode, %decode); $encode{chr 0} = UTF8::chr(0); while(<ENC>) { next unless /^\s*0x(\w+)\s+0x(\w+)/; $encode{chr hex $1} = UTF8::chr(hex $2); } %decode = reverse %encode; There. That's basically all there is to it. Wrapping it all neatly up in a nice and generic module, is another matter, though. If anybody has any really bright ideas about it, I'd like to hear them. -- Bart. # ===== Want to unsubscribe from this list? # ===== Send mail with body "unsubscribe" to macperl-request@macperl.org