[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl] Unexpected behaviour with hashes



tech specs: Performa 5430 with 48Mb ram running OS8.5, using MacPerl5.2.4or with 15Mb of ram to play with

Anybody care to venture a suggestion as to what is going on with this script?
It reads in a text file, does a word count and then converts the word count to a frequency of use percentage. So far so good. However certain words (ie verbs) make multiple appearances (am,is,are,was,were,been to name just one) so I wrote a sub script that reads in from a separate file and eliminates these double entries adding the accumulated word count value to a single base form, and then deletes the now unwanted derivates.

Separately these two doodads work as I wrote them to, but when I merge them, then things start doing "something completely different" - the double entries don't get deleted and in some cases the values don't seem to get added into the main count in the base form hash. Oddly enough though - checking the values of the hashes they contain the data I expect them to:

#!/usr/bin/perl-w

#----------------
#declare includes
use strict;
use Mac::Files;
require "StandardFile.pl";

#------------------
#declare variables
my(%wordcount,%freq,%dictionary,);
my(@values,@terms,@words,);
my(@derived,@definition,);
my($percent,$count,$key,$base,$word,);
my ($file,$output,$definition,$terms,$total,$temp,);

#-------------------
$|=1;
print "Compiling verbs dictionary.... \n";
open (VERBS, "Path:to:verbs.dict")|| die "can't open it :-$!";

#read in words that will be doubled (verbs)
#this must come here otherwise the whole read in
#is affected by setting paragraph mode.

while (<VERBS>){

#create hashes containing the values
($definition,$terms) = split(/:/);
chomp($definition,$terms);

unshift(@terms,$terms);
unshift(@definition,$definition);
}
close (VERBS);
#create a hash for later use containing
#the data from the irregs file
@dictionary{@terms}=@definition;


print "Getting set up to parse the file ...\n";
# Enable paragraph mode.
$/ = "";
# Enable multi-line patterns.
$* = 1;


#slect the input file and create an output file
$file= &StandardFile::GetFile("what file do you want", "TEXT");
if ($file) {
open (IN, $file)|| die "hmm, now that's odd......:- $!" ;
} else {
print".... and next time think before you start clicking on stuff!\n";
die "user killed script\n";
}

#create an output file through a dialog
$output = &StandardFile'PutFile("what shall we call it?", "::newfile.freq");
if(-e $output) {
open (OUT, ">$output")|| die "Script has lost the keys to the file:- $!";
}elsif(! -e $output) {
FSpCreate("$output", 'ttxt', 'TEXT')|| die "can't write the file: $!";
open (OUT, ">>$output")|| die "Script has lost the keys to the file:- $!";
} else {
die "\nScript died heroically (though in vain) while attempting to fulfill its destiny .\n";
}

# Now read each paragraph and split into words. Record each
# instance of a word in the %wordcount hash.
print "reading from the file\n";

while (<IN>) {
#kill punctuation
tr %[\!\?"',\=\+\-\.;:\<\>\(\)\*\&]\\\/% %ds;
#kill numbers
tr/0-9//ds;
#kill hyphenations.
s/-\n/\s/g;
#upper case to lower case.
tr/A-Z/a-z/;

@words = split(/\b\W*\b\s+\b\W*\b/, $_);
foreach $word (@words) {
#kill any spaces and new lines
$word=~s/\s*|\n{2,}|\0x2f //eg;

# Increment the entry.
$wordcount{$word}++;

}
}
close (IN);

print "taking verbs back to their base form\n";
#iterate through the dictionary containing the base forms and the "doubles"
foreach $key (sort keys(%dictionary)){
chomp($key);
$base=$dictionary{$key};
#on a positive match of an item from the dictionary list
if (exists $wordcount{$key}){
#the following two lines are for testing the content of the variables before deletion
print"\$base is $base, \$wordcount\{\$base\}= $wordcount{$base}\n";
print "\$key is $key, \$wordcount\{\$key\}= $wordcount{$key}\n";
#delete the non-base form from the hash
print "deleting $key from the hash (value =$wordcount{$key})\n";
delete $wordcount{$key};
print " $key deleted from the hash (value =$wordcount{$key})\n";
}
}

print "commiting to file ...........\n";

@values=values(%wordcount);
#get the number of entries
$total=$#values + 1;
#stamp the total at the top of the file
print OUT "number of words used:$total\n";



foreach $word (sort keys(%wordcount)) {
#calculate the frequency percentage
$percent=($wordcount{$word}/($total/100));



$percent= sprintf("%.3f", $percent);

$wordcount{$word}=$percent."%";
#round up the percentage
#copy the most frequently used words into their own hash
if ($percent >1.0){
$freq{$word}= $wordcount{$word};
#count them
$count++;

}
#write data to file
print OUT $word,":",$wordcount{$word},"\n";

}

print "Done it!\n\n", "~<>"x20,"~\n\n";
print "document contains $total words:\n";
print "$count frequently used words\n";

foreach $key (sort keys(%freq)){
#print the most frequently used words to STDOUT
print "$freq{$key}\t $key\n";
}

close(OUT);


#_END_

-----------------
Verb dictionary :
-----------------
be:am
be:is
be:are
be:was
be:were
be:been
become:became
begin:began
begin:begun
blow:blew
blow:blown
break:broke
break:broken
bring:brought
build:built
burn:burnt
buy:bought
can:could
catch:caught
choose:chose
choose:chosen
come:came
cost:cost
cut:cut
do:does
do:did
do:done
draw:drew
draw:drawn
dream:dreamt
drink:drank
drink:drunk
drive:drove
drive:driven
eat:ate
eat:eaten
fall:fell
fall:fallen
feel:felt
find:found
fly:flew
fly:flown
forget:forgot
forget:forgotten
get:got
give:gave
give:given
go:went
go:gone
grow:grew
grow:grown
have:has
have:had
hear:heard
hit:hit
hold:held
hurt:hurt
keep:kept
know:knew
know:known
lead:led
learn:learnt
leave:left
lend:lent
lose:lost
make:made
mean:meant
meet:met
must:had to
pay:paid
put:put
read:read
ring:rang
ring:rung
rise:rose
rise:risen
run:ran
run:run
say:said
see:saw
see:seen
sell:sold
send:sent
show:showed
show:shown
shut:shut