[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] File conversion script problem



Ero Brown wrote:
> 
> 
> I wrote the following script to convert files from one line
> ending format to another (i.e., Mac2Unix, Dos2Mac, etc.).
> 
> 1) Please let me know if you know of a better script to do
>    this sort of thing.
> 
> 2) This script has a bug, when performing several of the
>    conversions (Mac2Dos for one), where any run of
>    multiple carriage returns get converted to a single
>    cr-lf pair.
> 
> The skinny of what happens in the script is:
> 
> $/ = "";   # enable paragraph mode

This is the reason for the bug you describe in (2).

from perlvar:

     $INPUT_RECORD_SEPARATOR
 
     $RS
 
     $/      The input record separator, newline by default.
             [...]
             Note that setting it to "\n\n" means something slightly
             different than setting it to "", if the file contains
             consecutive empty lines.  ***Setting it to "" will treat two
             or more consecutive empty lines as a single empty line.***
             Setting it to "\n\n" will blindly assume that the next input
             character belongs to the next paragraph, even if it's a
             newline.  (Mnemonic: / is used to delimit line boundaries when
             quoting poetry.)  [my emph]


You don't really need paragraph mode anyway, do you?


> $* = 1;    # enable multi-line pattern matching

$* is deprecated.  Don't use it.  Based on the regular expressions in your
script, I'm not even sure why you set it.


> # ...
> # ...
> $fromExpression = "(\x0D{1}\x0A?|\x0A{1})";

\x0D{1} is redundant.  Any unquantified subexpression will match exactly
once.  Just write \x0D.


> $toExpression = "\x0D\x0A";
> # ...
> # ...
> eval($filestr =~ s/$fromExpression/$toExpression/g);
> 
> I realize the 'g' option will cause a "global" replace,
> but I was thinking the the explicit "\x0D{1}" would
> cause only one <cr> to be replaced at a time, not a
> whole bunch of them globbed together.

Only one <cr> is replaced at a time, with or without the unnecessary {1}.
The 'globbing together' has occured before you ever get to the
substitution, as explained above.

> I'm sure I'm just overlooking something stupid.

Yup.  :-)



> TIA for your help!
> 

More comments below...

> 
> 
> 
> #========================================================================
> #========================================================================
> #
> # Convert2.pl
> #
> # Author: Ero Brown <ero@fiber.net>
> #
> # Version: 1.0
> #
> #========================================================================
> #========================================================================
> 
> #!/usr/local/bin/perl
> 
> ### required modules needed for this script
> use File::Find;
> use Getopt::Long;
> 

No use strict?

> 
> #========================================================================
> 
> ### Set some special Perl system variables that effect pattern matching
> $/ = "";   # enable paragraph mode
> $* = 1;    # enable multi-line pattern matching

See above.

> ### define true and false.
> $TRUE = 1;
> $FALSE = 0;
> 

Please don't define TRUE and FALSE.  '', 0, '0', and undef are FALSE.
Everything else is TRUE.

> 
> #========================================================================
> 
> ### Forward declare all sub-routines used
> sub Usage;
> sub GetHostOS;
> sub GetDirSeperatorChar;
> sub ReadArguments;
> sub DoAFile;
> sub SearchFilter;
> 
> 
> #========================================================================
> # Function: Usage
> # Purpose:  this details the usage of this script -- we can call this
> #           function when a invalid usage error occurs.
> #========================================================================
> sub Usage {
>    die <<END_USAGE;
> # usage: $0 [-p] [-r] [-x] [-f <expression>] -t <translation>
> #                                            <folder(s) &| file(s) . . .>
> #
> #   -p   print progress information
> #   -r   recursively process encountered sub-directories
> #   -x   fix the file type if possible
> #   -f   file expression representing a file filter
> #   -t   translation to be performed, where <translation> is:
> #           Cnv2Dos | Mac2Dos | Unx2Dos = convert to a DOS format
> #           Cnv2Mac | Dos2Mac | Unx2Mac = convert to a Mac format
> #           Cnv2Unx | Dos2Unx | Mac2Unx = convert to a Unix format
> #
> #   *Parameters can be given in any order.
> #
> #   *Fully resolved pathnames are required, partial pathnames will
> #    cause this script to barf.
> #
> #   Version 1.0
> END_USAGE
> }
> 

Outputting the usage with # prepended to each line seems rather unusual.
If you are doing this for formatting/syntax-coloring in your editor,
consider using something like the following:

($usage = <<END_USAGE) =~ s/^# //mg;
# usage: ...
END_USAGE

die $usage;


> 
> #========================================================================
> # Function: GetHostOS
> # Purpose:  Determine what OS we are running on -- $^O is a Perl variable
> #           that contains to name of the OS that Perl was compiled on.
> #========================================================================
> sub GetHostOS
> {
>    return($^O);
> }
> 

Do you really need a subroutine for this?

> 
> #========================================================================
> # Function: GetDirSeperatorChar
> # Purpose:  Each platform (OS) uses a unique character to seperate
> #           directory names within a full path specification string.
> #           Here we set a variable to contain the correct directory
> #           seperator character to be correct for the OS that we are
> #           currently running this script on.
> #========================================================================
> sub GetDirSeperatorChar
> {
>    my ($tOSString) = @_;
>    my ($tDirSepChar);
>    if ($tOSString eq 'MacOS') {  #it's a Macintosh!
>       $tDirSepChar = ':';
>    }
>    elsif ($tOSString =~ /nix/i) {  #it's probably some flavor of Unix
>       $tDirSepChar = '/';
>    }
>    else {  #it could be other things, but assume a Wintel PC
>       $tDirSepChar = '/';  #!?! -- same as unix, not '\' as expected!
>    }
>    return ($tDirSepChar);
> }
> 

Use the File::Spec module for this, please.

> 
> #========================================================================
> # Function: ReadArguments
> # Purpose:  Parse the arguments array and set the appropriate flags.
> #========================================================================
> sub ReadArguments
> {
>    #if no arguments are passed in, assume the user needs some help
>    if (@ARGV < 1) {
>       Usage;
>    }
> 
>    $res = Getopt::Long::GetOptions("r", "p", "x",
>                            "f=s" => \$fileFilter,
>                            "t=s" => \$translation);
>    if (!$res) {
>       Usage;
>    }
> 
>    #if no directories/files are passed in, assume the user needs some help
>    if (@ARGV < 1) {
>       print "# At least one directory or file to search must be specified.\n";
>       Usage;
>    }
> 
>    $recurseOpt = $opt_r;
>    $progressOpt = $opt_p;
>    $fixTypeOpt = $opt_x;
>    $fileFilterOpt = $fileFilter ne '';
> 
>    if ($translation eq '') {
>       print "# The required -t (translation) option was NOT specified.\n";
>       Usage;
>    }
> 
>    $translation =~ s/Conv/Cnv/gi;
>    $translation =~ s/Pc/Dos/gi;
>    $translation =~ s/Unix/Unx/gi;
>    if ($translation =~ /(Cnv2Dos|Mac2Dos|Unx2Dos)/i) {

Wouldn't it make more sense just to specify the target system?  Especially
since you appear to ignore the source system anyway?

>       $fromExpression = "(\x0D{1}\x0A?|\x0A{1})";

See above.  {1} is redundant.

>       $toExpression = "\x0D\x0A";
>    }
>    elsif ($translation =~ /(Cnv2Mac|Dos2Mac|Unx2Mac)/i) {
>       $fromExpression = "\x0D?\x0A{1}";

See above.  {1} is redundant.

>       $toExpression = "\x0D";
>    }
>    elsif ($translation =~ /(Cnv2Unx|Dos2Unx|Mac2Unx)/i) {
>       $fromExpression = "\x0D{1}\x0A?";

See above.  {1} is redundant.

>       $toExpression = "\x0A";
>    }

It seems like you could just use the same from expression in all cases,
especially since, as I said, you ignore the source system.

$fromExpression = "\x0D\x0A?|\x0A";

>    else {
>       print "# Unkown -t (translation) option was specified.\n";
>       Usage;
>    }
> 
> }
> 
> 
> #========================================================================
> # Function: DoAFile
> # Purpose:  process the file/directory
> #           this is where the action is -- look at every file and
> #           directory and perform the desired operations.
> #========================================================================
> sub DoAFile {
>    my ($theFileName) = @_;
>    my ($result,$filestr);
>    my ($shortName, $doSearch);
>    my (@dirNameParts);
> 
>    if (-f $theFileName) {
>       $fileCount++;
>       $shortName = $theFileName;
>       @dirNameParts = split($dirSepChar,$shortName);  #break full path apart
>       $shortName = $tempName = pop(@dirNameParts);  #filename by itself

Use the File::Basename module for this.

>       if ($progressOpt) {
>          print "# FILE: $shortName -- \"$theFileName\"\n";
>       }
> 

Ah, I guess the # at the beginning of all the out was inspired by the way
MacPerl outputs all the warning and die messages...  It's unnecessary, and
will look out of place on other platforms.

>       $doSearch = $TRUE;
>       if ($fileFilterOpt) {
>          $doSearch = $FALSE;
>          if (eval($shortName =~ /$fileFilter/oi)) {
>             $doSearch = $TRUE;
>          }

Need error checking with $@.

>       }
> 
>       $result = "";
>       if ($doSearch) {
>          $fileCount++;
>          if (open(THEFILE, "< $theFileName\0")) {
>             $result = "";
>             while ($filestr= <THEFILE>) {
>                eval($filestr =~ s/$fromExpression/$toExpression/g);

Why are you using eval here?  Aren't $fromExpression and $toExpression
always defined within the program?

>                $result = join('',$result,$filestr);

$result .= $filestr;

>             }
>             close(THEFILE);
> 
>             if ($fixTypeOpt) {
>                if ($osString eq 'MacOS') {
>                   &MacPerl'SetFileInfo("MPS ","TEXT", $theFileName);

Use :: in place of ' as the package separator.  (Except when using the D'oh
module, of course.  :-)

What is "MPS "?  All my text files are "R*ch".  This should be a
user-controlled specifification.

>                }
>             }
> 
>             if (open(THEFILE, "> $theFileName\0")) {

You may need to call binmode() on the filehandle, to make sure your
line-endings aren't changed to the native format upon output.

>                print THEFILE $result;
>                close(THEFILE);
>                $processedCount++;
>                if ($progressOpt) {
>                   print "\t# CHANGED FILE: $shortName\n";
>                }
>             }
>             else {
>                $errorCount++;
>                print "# ERROR: Couldn't write to the file: \"$theFileName\"\n";

Why couldn't you write to the file?  $!

>             }
>          }
>          else {
>             $errorCount++;
>             print "# ERROR: Couldn't open the file: \"$theFileName\"\n";

Why couldn't you close the file?  $!

>          }
>       }
>    }
>    # it must be a directory (-d $theFileName)
>    else {
>       $dirCount++;
>    }
> }
> 
> 
> #========================================================================
> # Function: SearchFilter
> # Purpose:  We need to provide this function for the File::Find::find()
> #           function in the File::Find module -- this is so we can control
> #           (via filter) what kind of files and/or directories get handled.
> #========================================================================
> sub SearchFilter
> {
>    #filter so we only process the files/directories we want
>    (
>       ( -e $_ && -s $_ ) &&
>       (
>          ( -f $_ && -r $_ && -T $_ ) ||

You might also want to check that the file is writeable.

>          ( -d $_ && -x $_ &&
>          ( $recurseOpt || ( !$recurseOpt && ($File::Find::prune = 1) ) ) )
>       )
>    ) &&
>    DoAFile($File::Find::name);  #call our function that does the checking
> }
> 
> 
> #========================================================================
> #============================   MAIN   ==================================
> #========================================================================
> 
> ### main part of the program --
> ### we go through every file and directory passed in on
> ### the command line and check it for validity.
> 
> #what OS are we running on?
> $osString = GetHostOS;
> 
> #what's the right dir seperator char to use?
> $dirSepChar = GetDirSeperatorChar($osString);
> 
> #parse the command line for commands and arguments
> ReadArguments;
> 
> if ($progressOpt) {
>    print $lst, "Executing this script on a computer running $osString...\n";
> }
> 
> #initialize these counter variables
> $fileCount = 0;
> $dirCount = 0;
> $processedCount = 0;
> $errorCount = 0;
> 
> #process every directory (and/or file) provided on the command line
> while (@ARGV) {
>    $arg = shift(@ARGV);  #get the next directory/file to examine
>    if ((-f $arg) || (-d $arg)) {
>       File::Find::find(\&SearchFilter, $arg);  #'find' calls "SearchFilter"
>    }   #which we use as a filter and to call our "DoAFile".
>    else {
>       print STDERR "# COULD NOT READ \"$arg\"\n\n";

Why all caps?

This error is misleading.  The problem is not that $arg could not be read,
but that it is of the wrong type, i.e. not a plain file or a directory.

>    }
> }
> 
> if ($progressOpt) {
>    print $lst, "Changed $processedCount files from Pc to Mac text format.\n";

This message needs to reflect the actual conversion performed.

> 
>    print $lst, "Examined $fileCount files, in $dirCount directories";
>    if ($errorCount > 0) {
>       print " -- encountered $errorCount errors";
>    }
>    print ".\n";
> 
>    if ($unknownCount > 0) {
>       print $lst, "THERE WERE $unknownCount UNKNOWN ENTITIES ENCOUNTERED.\n";

Why all caps?

What is this undefined $lst variable that you keep printing?  The only
references to it are in these print statements.

>    }
> }
> exit;
> 
> #========================================================================
> #========================================================================
> 


I like the use of File::Find to do a big conversion with one execution of
the program.  Looks good!
Hope that helps!

Ronald

***** Want to unsubscribe from this list?
***** Send mail with body "unsubscribe" to mac-perl-request@iis.ee.ethz.ch