[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]
Re: [MacPerl] Unexpected behaviour with hashes

To: robinmcf@altern.org
Subject: Re: [MacPerl] Unexpected behaviour with hashes
From: Ronald J Kimball <rjk@linguist.dartmouth.edu>
Date: Wed, 4 Aug 1999 16:33:41 -0400
Cc: MacPerl List <macperl@macperl.org>
In-Reply-To: <v04011702b3ce1fdbc9a9@[210.142.124.98]>; from robinmcf@altern.org on Thu, Aug 05, 1999 at 02:52:38AM +0900
References: <v04011702b3ce1fdbc9a9@[210.142.124.98]>
On Thu, Aug 05, 1999 at 02:52:38AM +0900, robinmcf@altern.org wrote:
> tech specs: Performa 5430 with 48Mb ram running OS8.5, using 
> MacPerl5.2.4or  with 15Mb of ram to play with 
> 
> Anybody care to venture a suggestion as to what is going on with this 
> script? 
> It reads in a text file, does a word count and then converts the word 
> count to a frequency of use percentage. So far so good. However certain 
> words (ie verbs) make multiple appearances (am,is,are,was,were,been to 
> name just one) so I wrote a sub script that reads in from a separate 
> file and eliminates these double entries adding the accumulated word 
> count value to a single base form, and then deletes the now unwanted 
> derivates. 
> 
> Separately these two doodads work as I wrote them to, but when I merge 
> them, then things start doing "something completely different" - the 
> double entries don't get deleted and in some cases the values don't 
> seem to get added into the main count in the base form hash. Oddly 
> enough though - checking the values of the hashes they contain the data 
> I expect them to: 

I'm going to go through and offer miscellaneous comments, as well as try to
figure out the overall problem you describe.  I hope that's okay.


> #!/usr/bin/perl-w  
> 
> #---------------- 
> #declare includes 
> use strict; 
> use Mac::Files; 
> require "StandardFile.pl"; 
> 
> #------------------ 
> #declare variables 
> my(%wordcount,%freq,%dictionary,); 
> my(@values,@terms,@words,); 
> my(@derived,@definition,); 
> my($percent,$count,$key,$base,$word,); 
> my ($file,$output,$definition,$terms,$total,$temp,); 
> 
> #------------------- 
> $|=1; 
> print "Compiling verbs dictionary.... \n"; 
> open (VERBS, "Path:to:verbs.dict")|| die "can't  open it :-$!"; 
> 
> #read in words that will be doubled (verbs) 
> #this must come here otherwise the whole read in  
> #is affected by setting paragraph mode. 

If you want to restrict changes to global variables, consider using local().
For example, you could put a block around this while loop, and start it off
with local($/) = "\n", and a block around the following while loop, and
start that one with local($/) = "".

Just something to keep in mind.


> while (<VERBS>){                                                        
>  
>     #create hashes containing the values 
>     ($definition,$terms) = split(/:/); 
>     chomp($definition,$terms);                                          

You could chomp $_ first, then split.  $definition doesn't need to be
chomped either way.


>     unshift(@terms,$terms); 
>     unshift(@definition,$definition); 
> } 
> close (VERBS); 
> #create a hash for later use containing 
> #the data from the irregs file 
> @dictionary{@terms}=@definition; 

Wouldn't it make more sense to build up the hash as you go along, rather
than store up these big lists of keys and definitions just so you can do
one big assignment to the hash?


> 
> print "Getting set up to parse the file ...\n"; 
> # Enable paragraph mode. 
> $/ = "";   
> # Enable multi-line patterns. 
> $* = 1;                                                                 

Keep in mind that $* is deprecated; /s and /m control single-line and
multi-line semantics.


> 
> #slect the input file and create an output file 
> $file= &StandardFile::GetFile("what file do you want", "TEXT"); 
> if ($file) { 
> 	open (IN, $file)|| die "hmm, now that's odd......:- $!" ; 
>     } else { 
> 	print".... and next time think before you start clicking on 
> stuff!\n"; 
> 	die "user killed script\n"; 
>     } 
> 
> #create an output file through a dialog 
> $output = &StandardFile'PutFile("what shall we call it?", 
> "::newfile.freq"); 
> if(-e $output) { 
>     open (OUT, ">$output")|| die "Script has lost the keys to the 
> file:- $!"; 
> }elsif(! -e $output) { 
>     FSpCreate("$output", 'ttxt', 'TEXT')|| die "can't write the file: 
> $!"; 
>     open (OUT, ">>$output")|| die "Script has lost the keys to the 
> file:- $!";   
> } else { 
>     die "\nScript died heroically (though in vain) while attempting to 
> fulfill its destiny .\n"; 
> } 

This if block seems odd.  If the file exists, then clobber it.  Else if the
file doesn't exist, create a new file and open it for append.  Else (not
sure what else there is) die.


> # Now read each paragraph and split into words.  Record each 
> # instance of a word in the %wordcount hash. 
> print "reading from the file\n"; 
> 
> while (<IN>) { 
>     #kill punctuation 
>     tr %[\!\?"',\=\+\-\.;:\<\>\(\)\*\&]\\\/% %ds;  

Backslashitis.  The only characters that need to be backslashed within a
translation are the delimiter and backslash.

The use of /d and /s is odd here.  You want to squash runs of exclamation
marks to a space, but delete all the other characters?

I think this is more what you want:

tr %[!?"',=+-.;:<>()*&/\\%%d;


>     #kill numbers 
>     tr/0-9//ds; 

/d and /s again.  /s is not useful if you are deleting _all_ the characters
specified.


>     #kill hyphenations. 
>     s/-\n/\s/g; 

\s in a replacement is odd.  You are replacing "-\n" with "s".  Only you're
not, because you already deleted all the hyphens in the first translation.
You should do this step first.


>     #upper case to lower case. 
>     tr/A-Z/a-z/;                                                        

Consider using lc() instead if your code will ever need to handle locales.


>     @words = split(/\b\W*\b\s+\b\W*\b/, $_); 

That split pattern is very odd.  All \s characters are also \W characters,
and you can never have a word boundary between two \W characters.
The pattern is equivalent to /\b\s+\b/.

Do you in fact intend not to split on the space in, e.g., "foo' bar"?


>     foreach $word (@words) { 
> 	#kill any spaces and new lines 
> 	$word=~s/\s*|\n{2,}|\0x2f //eg;                                 

\s* will always match, because it can match zero characters.  The other two
alternatives will never be matched.

\0x2f is the null character followed by the characters x2f.  Perhaps you
meant \x2f, which can be more easily written as \/.  But you already
deleted all the slashes in that first translation.

Why do you need to /e-valuate an empty replacement?


> 	# Increment the entry. 
> 	$wordcount{$word}++;                                            
>           
>     } 
> } 
> close (IN); 
> 
> print "taking verbs back to their base form\n"; 
> #iterate through the dictionary containing the base forms and the 
> "doubles" 
>     foreach $key (sort keys(%dictionary)){ 

Do you really need to process the keys in sorted order?  It just seems like
an extra, unnecessary step.


> 	chomp($key); 
> 	$base=$dictionary{$key};

If $key really had a trailing newline, then chomp()ing it would be the
wrong thing to do.  $dictionary{"foo"} is not the same as
$dictionary{"foo\n"}.  Remove that chomp() from the code.

 
> 	#on a positive match of an item from the dictionary list 
> 	if (exists $wordcount{$key}){                                  
> 	    #the following two lines are for testing the content of the 
> variables before deletion 
> 	    print"\$base is $base, \$wordcount\{\$base\}= 
> $wordcount{$base}\n"; 
> 	    print "\$key is $key, \$wordcount\{\$key\}= 
> $wordcount{$key}\n"; 
> 	    #delete the non-base form from the hash 
> 	    print "deleting $key from the hash (value 
> =$wordcount{$key})\n"; 
> 	    delete $wordcount{$key};  
> 	    print " $key deleted from the hash (value 
> =$wordcount{$key})\n"; 

Okay, here's where you delete the derivative form from the wordcount hash.
You seem to be missing the step where you add the wordcount for the
derivative form to the wordcount for the original form.

Before you delete the key:

$wordcount{$base} += $wordcount{$key};


> 	} 
>     } 
> 
> print "commiting to file ...........\n"; 
> 
> @values=values(%wordcount);  
> #get the number of entries 
> $total=$#values + 1;   

If you don't need the actual values(), then saving them to an array just to
count them is silly.

$total = keys %wordcount;

In a scalar context, keys() returns the number of entries in the hash,
without constructing an intermediate list of keys.


> #stamp the total at the top of the file 
> print OUT  "number of words used:$total\n";                             
>           
> 
> 
> foreach $word (sort keys(%wordcount)) { 
>     #calculate the frequency percentage  
>     $percent=($wordcount{$word}/($total/100));                          
>      
>                                             
>     $percent= sprintf("%.3f", $percent);                                
>                   
>     $wordcount{$word}=$percent."%"; 
>      #round up the percentage 
>      #copy the most frequently used words into their own hash 
>      if ($percent >1.0){ 
> 	 $freq{$word}= $wordcount{$word};   
> 	 #count them 
> 	 $count++;                                                      
>                                                 
>      } 
>      #write data to file 
>      print OUT $word,":",$wordcount{$word},"\n";                        
>          
>  } 

A little verbose, but it works.  :)


> print "Done it!\n\n", "~<>"x20,"~\n\n"; 
> print "document contains $total words:\n"; 
> print "$count frequently used words\n";                                 
>              
> foreach $key (sort keys(%freq)){ 
>     #print the most frequently used words to STDOUT  
>     print "$freq{$key}\t $key\n"; 
> }   
> close(OUT); 
> 
> 
> #_END_ 
> 


So, the problem seems to be due to the fact that you never add in the
frequency for the derivative forms.

HTH!

Ronald

===== Want to unsubscribe from this list?
===== Send mail with body "unsubscribe" to macperl-request@macperl.org
References:
- [MacPerl] Unexpected behaviour with hashes
  - From: robinmcf@altern.org
Prev by Date: [MacPerl] Unexpected behaviour with hashes
Next by Date: Re: [MacPerl] Unexpected behaviour with hashes
Prev by thread: [MacPerl] Unexpected behaviour with hashes
Next by thread: Re: [MacPerl] Unexpected behaviour with hashes
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net