[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] UTF8 conversion



On Sun, 14 May 2000 20:57:56 -0300, Arved Sandstrom wrote:

>At 01:32 PM 5/14/00 -0400, M. Christian Hanson wrote:
>>I am taking the output of the XML parser in macperl and want to push 
>>it into a text processor that is really itching for Latin-1 not the 
>>UTF8 that the xml parser hands me.  Any body have any advice?

>You can check the archives for this list, for one. This just came up 
>recently. Bart Lateur, if I recall correctly, has looked at this.

Says the guy who ported the Unicode::String module to MacPerl. See
<http://pudge.net/cgi-bin/mmp.plx>.

It's time that I put my code where my mouth is. I had only mentioned
that I got a solution, but I'm still having big problem with finalizing
it. There are so many ways to do that, and none looks ideal. Worse: the
code gets 5 times more complicated, than the bare essence of it.

I posted some code in comp.lang.perl.misc, and the original module
author, Gisle Aas, pointed me to this module. I have taken a look at it,
but I just can't seem to get my brain to wrap around it. I simply don't
get it.

So here's the code again. I provide you with enough clues to finish it
off by yourself, in a way that YOU like.

The kernel of the code is this little function, that turns a 16-bits
Unicode character number, into a (multibyte) UTF-8 character.

  sub UTF8::chr {
      my $ord = 0+shift;
      if($ord && $ord < 0x80) {
          return chr $ord;
      } elsif ($ord < 0x800) {
          return pack 'C*', $ord>>6|0xC0, $ord&0x3F|0x80;
      } else {    # $ord <= 0xFFFF
          return pack 'C*', $ord>>12|0xE0, $ord>>6&0x3F|0x80,
            $ord&0x3F|0x80;
      }
  }

Of course, calling this function for every single to convert is rahter
inefficient. Since we're talking about only 128 different characters
that need conversion, a substitution pattern with a look-up encoder
hash, is a simple, compact and fast way of doing that. With an inverted
decoder hash, you can decode the UTF-8 back into the character set of
your own.

As you may know, the character codes of ISO-Latin-1 are the same as the
Unicode characters with code 0 through 255. So populating the encoder
hash is easy:

  my(%encode, %decode);

  for my $c (0, 128 .. 255) {
      $encode{chr $c} = UTF8::chr($c);
  }

  %decode = reverse %encode;

And finally, here are the functions that use these hashes to do the
actual conversion:

  sub L1ToUTF8 {   # encode
      my @arg = @_;
      for(@arg) {
          s/([\000\200-\377])/$encode{$1}/g;
      }
      return wantarray?@arg:$arg[-1];
  }

  sub UTF8ToL1 {   # decode
      my @arg = @_;
      for(@arg) {
          s/([\300-\377][\200-\277]+)/$decode{$1} ||
            (defined($decode{$1})?$decode{$1}:"\177$1")/ge;
      }
      return wantarray?@arg:$arg[-1];
  }

This last function does a rather efficient implementation of the
non-existent operator '??', which would have been a shortcut 'or' which
only continues if the LHS is not defined (as opposed to false).

I used the character '\177', AKA chr(127) or 'DEL', 'delete', as a
conversion failure flag. This character probably will never ever appear
in any text file.


Now, take all these functions, put them in a simple library file, which
I called 'utf8ToL1.pl', and don't forget to put & "1;" at the end. This
is a test program, which generates a character table. Save the text
output, and look at it with a hex file viewer. You should see a
character with character ("Ascii") code which is identical to the number
on the left of it.

#! perl -w

require 'UTF8ToL1.pl';
use XML::Parser;

  $p1 = new XML::Parser(Handlers => { Char => \&Print },
    ProtocolEncoding => 'ISO-8859-1');

  $" = "\n";
  my $xml = <<"__EOT__";
<charset>
@{[ map { sprintf "%02X%s", $_, "\t&#$_;" } (32 .. 255) ]}
</charset>
__EOT__

  sub Print {
      my $self = shift;
      local $_ = shift;
      tr/\r/\n/; # solves a small bug in XML::Parser on Mac
      print UTF8ToL1($_);
  }

  $p1->parse($xml);
__END__


And finally, how can youuse this to encode/decode from other character
sets? Well, assuming one-byte character sets, of which the section 1 ..
127 is plain Ascii, you can simply reuse the above encoding/decoding
functions, but with different encoding and decoding hashes. If anybody
knows a good and simple way of implementing this generally, I'd like to
know. I had implemented them as closures, but using those in your own
code turned out to be not so trivial. And I don't really like Perl's OO.
Anyway, suggestions welcome.

How can you populate the encoding hash? It's easy. Get the proper
encoding text file from <ftp://ftp.unicode.org/Public/MAPPINGS/> (for
the Mac, it's under VENDORS/APPLE/ROMAN.TXT).

I assume you open that file using the file handle ENC. Trust me, doing
this in a module is far from easy. Once you got past that hurdle, here's
how you can populate the hash:

  my(%encode, %decode);
  $encode{chr 0} = UTF8::chr(0);
  while(<ENC>) {
      next unless /^\s*0x(\w+)\s+0x(\w+)/;
      $encode{chr hex $1} = UTF8::chr(hex $2);
  }
  %decode = reverse %encode;

There. That's basically all there is to it. Wrapping it all neatly up in
a nice and generic module, is another matter, though. If anybody has any
really bright ideas about it, I'd like to hear them.

-- 
	Bart.

# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org