[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl] Hi-bit characters in regex's



My apologies for getting back to the list on this so late.  I had a bit of a family emergency that prevented me from working on any perl stuff for a while.



thanks to Chris, Brian and David for their help.  Using perl on my VirtualPC and MacPerl on the mac side I've made some interesting discoveries.

First, if a font has accented characters they always seem to fall in the same character code.  Most alphanumeric fonts have the accented characters.  So while not a complete standard, there is some standardization.

Second, the character codes for the accented characters are different between mac and windows (anyone surprised by this please raise your hand).

Third, my use, in filenames seems to be taken care of by the various apps that move the files.  For example, dragging and dropping a file named "ькт" from the mac to VirtualPC 3.0/Win98 the filename is retained correctly.  the same happens when Toast burns a cd in the hybrid Joliet/HFS format.

Fourth, Well it's been far too many years since I've actually translated anything from latin (and I was far more interested in learning 6502 assembly code at the time) but with a little help from the web <http://www.nd.edu/~archives/latgramm.htm> Brian's phrase (Exceptio probat regulam de rebus non exceptus. Exceptis excipiendis) seems to be:
the exception proves the rule about rules not excepting. expect the exception.




Here is the subroutine I use to change a Mac name to a legal Windows name (watch out for word wrap if you paste this into anything, which brings up the question -- is there a reason the tr/ doesn't support the x modifier that the s/// has?):


# changes a string to replace illegal Windows filename characters
# rules:
#  known illegal characters (\/*?<>|) are replaced with a -, multiple - in a row are squished to 1 -
#  whitespace stripped from beginning and end of name
#	whitespace at end of name is definitely illegal (although toast allows it)
#	whitespace at beginning of name stripped because i don't like it
#  any character not in the known good list replaced with a _, mulitple _ are squished
#	this is where extended characters are specifically allowed, translation of mac character
#	code to pc character code handled by app to move between systems (hopefully)
sub changestring {
	my $old_string = shift;
	
	# Replace \/*?<>| with a -, multiple consecutive -'s are squished to one -
	$old_string =~ tr{\\\/\*\?\<\>\|}{-}s;

	# Replace " with ', multiples are not squished
	$old_string =~ tr{\"}{\'};
	
	# names can't begin with whitespace
	$old_string =~ s/^(\s*)//;
	
	# names can't end with whitespace
	$old_string =~ s/(\s*)$//;
		
	# OK those were known bad characters.  The following characters
	# are known good ones so we will replace anything else with an _
	# multiples will be squished.
	$old_string =~ tr{a-zA-Z0-9\$\%\'\`\-\@\{\}\~\!\#\(\)\&\_\^\ \.\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa1\xa2\xa3\xa4\xa6\xa7\xa9\xab\xac\xae\xaf\xb4\xb5\xbb\xbc\xbe\xbf\xc0\xc1\xc2\xcb\xcc\xcd\xd6\xd8\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf1\xf2\xf3\xf4}{_}cs;
	
	return $old_string;
}

# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org