[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] UTF8 conversion



On Tue, 23 May 00 23:01:37 -0000, Joel Rees wrote:

>>I had implemented them as closures, but using those in your own
>>code turned out to be not so trivial.
>
>Probably a stupid question, but what's a closure? (just point me to pod 
>or whatever.)

It's not really a stupid question, except that's in the FAQ. :-) It's
one of the more advanced things in Perl. I learned of it from the book
"Advanced Perl Programming", page 56 and onward. But it's in the online
docs too, see perlfaq7 ("Perl Language Issues") , "What's a closure" and
perlref, point 4. The example in perlref in particular is very cute.

It is, in short, a way to define a sub on the fly, while baking in some
variables that can be set at creation time. Each created sub will have
it's own copy of those (static) variables. In fact, the ONLY difference
between different subs is the static data.

So how does it apply to my code? Well, as you may recall, the only
difference between encoding/decoding for different character sets, is in
the encoding/decoding hashes. Same code, different (static) data.

So here is more or less, how my stuff could be used:

	use UTF8::Simple;
	*UTF8ToMac = UTF8::Simple::decoder('Macintosh');
	*WinToUTF8 = UTF8::Simple::encoder('Windows-1252');
	
	$mactext = UTF8ToMac(WinToUTF8($wintext)); # Win -> Mac

or 

	my $UTF8ToMac = UTF8::Simple::decoder('Macintosh');
	my $WinToUTF8 = UTF8::Simple::encoder('Windows-1252');

	$mactext = $UTF8ToMac->($WinToUTF8->($wintext));


One of the major problems with the first approach is with "use strict":
you need to predeclare

	use vars qw(*UTF8ToMac *WinToUTF8)

which isn't too obvious to most people. Even I tend to forget about it
sometimes.

A few remarks:

 * The name UTF8::Simple refers to two limitations on the character sets
to be encoded/decoded: 

   A) They must be single-byte character sets, or the encoding hash
would get too big;

   B) The lower half (1 .. 127) *must* be Ascii-compatible. For speed,
my encoding substitution pattern only tries to replace characters with
character code between 128 and 255, which is usually a minority in the
string.

/$pattern/o doesn't work "properly" inside closures. It compiles the
regex the first time any one of the generated functions is called. For
the rest of it's lifetime, all functions will use that one pattern.
Therefore, I use a predefined pattern [\000\200-\377] instead.

 * The strings 'Macintosh' and 'Windows-1252' refer to encoding files,
which need to be loaded into the encoding and decoding hashes the first
time an encoding is used.

These could be the text files, as downloaded from
<ftp://ftp.unicode.org/Public/MAPPINGS/>, used directly and simply
parsed every time.

Or, I could reuse the ".enc" files as used by XML::Parser. The downside
is that you would need to have XML::Parser (and these files) installed.
I'm not convinced about the speed-up.

One of the bigger problems for implementing this, is finding out where
those encoding files are!

It is difficult for people to provide alternatives to my code, when they
don't have anything to start from. So, I will post the module code in a
few days, but first, I have some more cleaning up to do. Don't worry,
the code won't be very big. Maybe even smaller than the length of this
post... :-)

-- 
	Bart.

# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org