[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl] Parsing Script



According to Strider:
> This is frustrating. I've done everything here, and I can't get this script
<snip>

> Input File (sum.tab) :
> 
> S	9/5/97	554	1	0	c3
> t	9/2/97	14403	1	0	w2
> t	9/3/97	14404	1	0	w2
> t	9/5/97	33059	3	0	w2
> c	9/1/97	652	1	0	b4
> c	9/2/97	24123	13	2	b4
> c	9/3/97	5758	6	1	b4
> c	9/4/97	23898	17	0	b4
> c	9/5/97	104	1	0	b4
> r	9/2/97	355	1	0	w2
> G	9/2/97	1897	1	0	c3
> s	9/5/97	1539	1	0	o2
> a	9/2/97	2569	1	0	p6
> a	9/3/97	1273	1	0	o2
> a	9/4/97	2460	1	0	p6
> a	9/5/97	4465	1	0	p6
> r	9/2/97	1678	1	0	p6
> r	9/3/97	1238	1	0	p6
> r	9/4/97	446	1	0	p6
> s	9/2/97	1840	1	0	w2
> s	9/4/97	1326	1	0	b4
> s	9/5/97	10466	3	0	w2
> e	9/2/97	1692	1	0	w2
> e	9/4/97	6097	4	1	c3
> e	9/5/97	3927	2	0	w2
> s	9/2/97	3089	1	0	c3
> s	9/3/97	2726	1	0	c3
> s	9/4/97	2283	1	0	c3
> s	9/5/97	7027	1	0	c3
> R	9/2/97	177	1	0	w2
> R	9/3/97	3365	1	0	w2
> R	9/4/97	6291	2	0	w2
> W	9/2/97	677	1	0	w2
> W	9/3/97	2710	1	0	w2
> W	9/5/97	1079	1	0	w2
> 

I looked at your script and I believe you are doing a bit
of overkill.  Why not something like this:

#
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#
#
#	Assuming you read in whatever amount of data you want into
#	an array called @theArray.
#
	for( $i=0; $i<=$#theArray; $i++ ){
#
#	Split up the line.
#
		@theLine = split( /\s/, $theArray[$i] );
#
#	Re-arrange the information.
#
		$newLine = join( "\t", $theLine[1], $theLine[0], $theLine[2], $theLine[3], $theLine[4] );
		}
#
#	Now sort it.
#
	@newArray = sort @theArray;
#
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#

The above will re-arrange the elements of the array to be
sorted so they are in the proper order.  This then allows
you to just use the built in SORT command.

However, there are a couple of problems.  The first is that
the date field will vary from 1/1/97 to 12/31/97.  Or to
put that another way - you need to convert the date from
being 1/1/97 to being 01/01/97 so all of the fields of the
date are the same width.  You can do this via the SPRINTF
command.

The second problem is the size of the database.  300mb
won't all fit into memory so you will have to determine a
different way to sort the information.  My suggestion is to
break the file up into X number of smaller files which you
can pull into memory individually.  These files should be
no larger than 1/3 of your available memory.  Thus, if you
have MacPerl set to 8192 and you have 2mb of memory
available to use, your files should not be any larger than
about 500k.  Then the easiest thing to do (although it is a
bit slow) is to do the following:

#
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#
	for( $i=0; $i<=$numFiles; $i++ ){
		open( THEFILE, "file.$i" ) || die $!;
		@theInfo = <THEFILE>;
		close( THEFILE );
		for( $j=$i; $j<=$numFiles; $j++ ){
			open( THEFILE, "file.$j" ) || die $!;
			@moreInfo = <THEFILE>;
			close( THEFILE );

			for( $k=0; $k<=$#theInfo; $k++ ){
				@theArray[++$#theArray] = $theInfo[$k];
				}

			for( $k=0; $k<=$#moreInfo; $k++ ){
				@theArray[++$#theArray] = $moreInfo[$k];
				}

			@newArray = sort @theArray;
			undef @theArray;

			for( $k=0; $k<=$#theInfo; $k++ ){
				$theInfo[$k] = $newArray[$k];
				}

			for( $k=0; $k<=$#moreInfo; $k++ ){
				$moreInfo[$k] = $newArray[$k+$#theInfo+1];
				}

			open( THEFILE, ">file.$j" ) || die $!;
			print THEFILE @moreInfo;
			close( THEFILE );

			undef @newArray;
			}

		open( THEFILE, ">file.$i" ) || die $!;
		print THEFILE @theInfo;
		close( THEFILE );
		}
#
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#

This will do a sort/merge where two of the files are
opened, their contents sorted into order, and then the
results are placed back into their appropriate files.  It
is basically a bubble sort expanded into files.  I'm sure
there are more efficient ways to do this but this should
work.

If this is a continuation of the single entry problem you
wrote about earlier, then you will probably want to
continue using the hash entries.  However, 300mb will not
fit into your computer's memory unless you have about 450mb
of RAM.  This is due to overhead in the creation of
strings, the hash entries, and the like.  So unless you
have that much memory you need to go to a disk based
methodology.

Believe me - disk based solutions are slow.  But they do
work.  :-)

***** Want to unsubscribe from this list?
***** Send mail with body "unsubscribe" to mac-perl-request@iis.ee.ethz.ch