On Thu, Aug 05, 1999 at 02:52:38AM +0900, robinmcf@altern.org wrote: > tech specs: Performa 5430 with 48Mb ram running OS8.5, using > MacPerl5.2.4or with 15Mb of ram to play with > > Anybody care to venture a suggestion as to what is going on with this > script? > It reads in a text file, does a word count and then converts the word > count to a frequency of use percentage. So far so good. However certain > words (ie verbs) make multiple appearances (am,is,are,was,were,been to > name just one) so I wrote a sub script that reads in from a separate > file and eliminates these double entries adding the accumulated word > count value to a single base form, and then deletes the now unwanted > derivates. > > Separately these two doodads work as I wrote them to, but when I merge > them, then things start doing "something completely different" - the > double entries don't get deleted and in some cases the values don't > seem to get added into the main count in the base form hash. Oddly > enough though - checking the values of the hashes they contain the data > I expect them to: I'm going to go through and offer miscellaneous comments, as well as try to figure out the overall problem you describe. I hope that's okay. > #!/usr/bin/perl-w > > #---------------- > #declare includes > use strict; > use Mac::Files; > require "StandardFile.pl"; > > #------------------ > #declare variables > my(%wordcount,%freq,%dictionary,); > my(@values,@terms,@words,); > my(@derived,@definition,); > my($percent,$count,$key,$base,$word,); > my ($file,$output,$definition,$terms,$total,$temp,); > > #------------------- > $|=1; > print "Compiling verbs dictionary.... \n"; > open (VERBS, "Path:to:verbs.dict")|| die "can't open it :-$!"; > > #read in words that will be doubled (verbs) > #this must come here otherwise the whole read in > #is affected by setting paragraph mode. If you want to restrict changes to global variables, consider using local(). For example, you could put a block around this while loop, and start it off with local($/) = "\n", and a block around the following while loop, and start that one with local($/) = "". Just something to keep in mind. > while (<VERBS>){ > > #create hashes containing the values > ($definition,$terms) = split(/:/); > chomp($definition,$terms); You could chomp $_ first, then split. $definition doesn't need to be chomped either way. > unshift(@terms,$terms); > unshift(@definition,$definition); > } > close (VERBS); > #create a hash for later use containing > #the data from the irregs file > @dictionary{@terms}=@definition; Wouldn't it make more sense to build up the hash as you go along, rather than store up these big lists of keys and definitions just so you can do one big assignment to the hash? > > print "Getting set up to parse the file ...\n"; > # Enable paragraph mode. > $/ = ""; > # Enable multi-line patterns. > $* = 1; Keep in mind that $* is deprecated; /s and /m control single-line and multi-line semantics. > > #slect the input file and create an output file > $file= &StandardFile::GetFile("what file do you want", "TEXT"); > if ($file) { > open (IN, $file)|| die "hmm, now that's odd......:- $!" ; > } else { > print".... and next time think before you start clicking on > stuff!\n"; > die "user killed script\n"; > } > > #create an output file through a dialog > $output = &StandardFile'PutFile("what shall we call it?", > "::newfile.freq"); > if(-e $output) { > open (OUT, ">$output")|| die "Script has lost the keys to the > file:- $!"; > }elsif(! -e $output) { > FSpCreate("$output", 'ttxt', 'TEXT')|| die "can't write the file: > $!"; > open (OUT, ">>$output")|| die "Script has lost the keys to the > file:- $!"; > } else { > die "\nScript died heroically (though in vain) while attempting to > fulfill its destiny .\n"; > } This if block seems odd. If the file exists, then clobber it. Else if the file doesn't exist, create a new file and open it for append. Else (not sure what else there is) die. > # Now read each paragraph and split into words. Record each > # instance of a word in the %wordcount hash. > print "reading from the file\n"; > > while (<IN>) { > #kill punctuation > tr %[\!\?"',\=\+\-\.;:\<\>\(\)\*\&]\\\/% %ds; Backslashitis. The only characters that need to be backslashed within a translation are the delimiter and backslash. The use of /d and /s is odd here. You want to squash runs of exclamation marks to a space, but delete all the other characters? I think this is more what you want: tr %[!?"',=+-.;:<>()*&/\\%%d; > #kill numbers > tr/0-9//ds; /d and /s again. /s is not useful if you are deleting _all_ the characters specified. > #kill hyphenations. > s/-\n/\s/g; \s in a replacement is odd. You are replacing "-\n" with "s". Only you're not, because you already deleted all the hyphens in the first translation. You should do this step first. > #upper case to lower case. > tr/A-Z/a-z/; Consider using lc() instead if your code will ever need to handle locales. > @words = split(/\b\W*\b\s+\b\W*\b/, $_); That split pattern is very odd. All \s characters are also \W characters, and you can never have a word boundary between two \W characters. The pattern is equivalent to /\b\s+\b/. Do you in fact intend not to split on the space in, e.g., "foo' bar"? > foreach $word (@words) { > #kill any spaces and new lines > $word=~s/\s*|\n{2,}|\0x2f //eg; \s* will always match, because it can match zero characters. The other two alternatives will never be matched. \0x2f is the null character followed by the characters x2f. Perhaps you meant \x2f, which can be more easily written as \/. But you already deleted all the slashes in that first translation. Why do you need to /e-valuate an empty replacement? > # Increment the entry. > $wordcount{$word}++; > > } > } > close (IN); > > print "taking verbs back to their base form\n"; > #iterate through the dictionary containing the base forms and the > "doubles" > foreach $key (sort keys(%dictionary)){ Do you really need to process the keys in sorted order? It just seems like an extra, unnecessary step. > chomp($key); > $base=$dictionary{$key}; If $key really had a trailing newline, then chomp()ing it would be the wrong thing to do. $dictionary{"foo"} is not the same as $dictionary{"foo\n"}. Remove that chomp() from the code. > #on a positive match of an item from the dictionary list > if (exists $wordcount{$key}){ > #the following two lines are for testing the content of the > variables before deletion > print"\$base is $base, \$wordcount\{\$base\}= > $wordcount{$base}\n"; > print "\$key is $key, \$wordcount\{\$key\}= > $wordcount{$key}\n"; > #delete the non-base form from the hash > print "deleting $key from the hash (value > =$wordcount{$key})\n"; > delete $wordcount{$key}; > print " $key deleted from the hash (value > =$wordcount{$key})\n"; Okay, here's where you delete the derivative form from the wordcount hash. You seem to be missing the step where you add the wordcount for the derivative form to the wordcount for the original form. Before you delete the key: $wordcount{$base} += $wordcount{$key}; > } > } > > print "commiting to file ...........\n"; > > @values=values(%wordcount); > #get the number of entries > $total=$#values + 1; If you don't need the actual values(), then saving them to an array just to count them is silly. $total = keys %wordcount; In a scalar context, keys() returns the number of entries in the hash, without constructing an intermediate list of keys. > #stamp the total at the top of the file > print OUT "number of words used:$total\n"; > > > > foreach $word (sort keys(%wordcount)) { > #calculate the frequency percentage > $percent=($wordcount{$word}/($total/100)); > > > $percent= sprintf("%.3f", $percent); > > $wordcount{$word}=$percent."%"; > #round up the percentage > #copy the most frequently used words into their own hash > if ($percent >1.0){ > $freq{$word}= $wordcount{$word}; > #count them > $count++; > > } > #write data to file > print OUT $word,":",$wordcount{$word},"\n"; > > } A little verbose, but it works. :) > print "Done it!\n\n", "~<>"x20,"~\n\n"; > print "document contains $total words:\n"; > print "$count frequently used words\n"; > > foreach $key (sort keys(%freq)){ > #print the most frequently used words to STDOUT > print "$freq{$key}\t $key\n"; > } > close(OUT); > > > #_END_ > So, the problem seems to be due to the fact that you never add in the frequency for the derivative forms. HTH! Ronald ===== Want to unsubscribe from this list? ===== Send mail with body "unsubscribe" to macperl-request@macperl.org