[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] UTF8 conversion



Thanks for the info on closures, Bart.

I do want to make one thing clear to list readers who aren't familiar 
with the problems of encoding UNICODE for UNIX. UTF-8 is an encoding 
scheme that allows an expanding (multi-byte) encoding for character sets 
and is (somewhat) compatible with a number of parsing techniques used in 
standard UNIX tools. It is being applied to UNICODE in some of the UNIX 
implementations. It is called UTF-8 for the fact that the basic parse 
unit is the octet (we usually call octets bytes, but "byte" has certain 
ambiguities.)

UTF-8 uses the high bits of the lead octet (byte) to show the count of 
octets remaining to be read before you have a complete character. It also 
uses the high bits of the remaining octets to distinguish between octets 
in the middle and octets at either end. It is designed so it can not 
extend beyond six or seven bytes, so you don't have to worry about 
getting caught in an endless loop just parsing past one character, and 
you can also find the character boundaries even if you start scanning in 
the middle, whether going forward or backward.

Why not make everything a constant 2-byte width and be done with it? 
Early estimates for CJKV were about 3,000 each for everyday use, with 
lots of overlap, so the 20,000 allocated in UNICODE was thought to have 
been enough. But the real numbers are coming in, and each is counting 
more than 50,000. And the overlap is not as useful as we want to believe.

How about making everything a constant 32 bits wide? Other than political 
problems, it might work. Might not. A lot of the core algorithms in UNIX 
(and other systems) are dependent on the approximate width of the byte.

Anyway, UTF-8 is an encoding scheme intended to let major UNIX 
applications run without choking on the large character sets, not a 
character set. UNICODE has been mapped onto the encoding scheme.


Joel Rees
----------------------------------------
                       Keeping the Faith
<joel_rees@sannet.ne.jp>
<http://www.page.sannet.ne.jp/joel_rees>
(free account:) <reiisi@nettaxi.com>
<http://www.nettaxi.com/citizens/reiisi>
----------------------------------------


# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org