[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] hi-bit characters in regex's



At 19.40 -0600 2000.03.11, Kevin van Haaren wrote:
>Anyway to make a long story short, I discovered that both mac and
>windows allow high-order characters in the filenames (discovered this
>on the Husker Du album, the u's have those 2 little dot's over them).
>Does anyone know how I can test for these characters in a regex?  Is
>there a standard octal code for these characters (I always thought
>they were font specific)?

Well, you can type the character code directly.  Type "option-u, u".  That
will give you the actual character code which you can put directly in your
script:

   print "has u with umlaut" if $text =3D~ /=FC/;

Works just fine.  You can easily find out the octal / hex value this way:

   $_ =3D "=FC";
   printf "octal %o, hex %X", ord $_, ord $_;

Returns:

   octal 237, hex 9F

And that character is also included in the \w character class is you use
locale:

   $_ =3D "\237";
   print /\w/;  # nothing

And:

   use locale;
   $_ =3D "\237";
   print /\w/;  # nothing

Yes, the character is font-specific.  However, on a standard Mac keyboard
layout, "option-u, u" is guranteed to produce octal 237 / hex 9F.  Also,
that value is guranteed to be included in \w in MacPer when use locale is
in effect, no matter what.  How that character is rendered by the font
depends on the font.  For MacRoman fonts, it will be an u with an umlaut.
For Latin-1 fonts, it will be something else.  For Symbol, dingbats, and
others, it will be something else.  But MacPerl doesn't care about the font
or rendered character, only the value.  And it basically uses the MacRoman
character set for "locale" settings.

--=20
Chris Nandor          mailto:pudge@pobox.com         http://pudge.net/
%PGPKey =3D ('B76E72AD', [1024, '0824090B CE73CA10  1FF77F13 8180B6B6'])

# =3D=3D=3D=3D=3D Want to unsubscribe from this list?
# =3D=3D=3D=3D=3D Send mail with body "unsubscribe" to macperl-request@macp=
erl.org