[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl] My prorgams for HTML document indexing



One of the projects where I've used MacPerl is for producing
an index page of the HTML files on my Creativity Web pages
at http://www.ozemail.com.au/~caveman/Creative

I wanted to have one file showing the TITLE attribute
of each HTML document, sorted by title with links to the
appropriate pages. My approach was to use a perl program
(originally develeped in Unix perl), to process the htm
files in the current directory and write a temporary file
in that directory.

Because I have subdirectories requiring processing (subdirectories
called Books, Authors and Software), I ran the
perl program in each directory, sort-merged the files together
and used another perl program to produce the results.

The temporary files had two fields separated by a brace character
(my choice of file deliimiter). The second field was the URL
and it had to have the correct prefix on the URL, eg ./Books/

The programs have been moved to Macperl, and last night
I wrote the intermediate sort program to take a set of files
dragged over the droplet and product a sorted output file.
Before that, I would include the files in a BBEdit document
and use the "Line Sort" BBEDIT extension. Very clumsy in
retrospect!

Here we go...

===================================================================

PROGRAM ONE

#!/usr/local/bin/perl
#
# htm_href.pl
#
# print out a list of *htm documents and the Title lines in
#  order to produce a sorted table of contents
require "GUSI.ph" ;

$folder = &MacPerl'Choose(&GUSI'AF_FILE, 0, "", "", &GUSI'CHOOSE_DIR);

print "Searching the folder: $folder \n\n";

$prefix = &MacPerl'Ask('prefix for cindex filenames ');
# for example  ./Books/

opendir(DIR, $folder) || die "die: Could not open $folder\n" ;
$outfile = ">".$folder.":cindex.raw";
print "opening the file $outfile\n";

open(CINDEX, $outfile) || die "could not create the cindex.raw file
                          in directory";
while ($filename = readdir(DIR)) {
     $_ = $filename;
     print "$_ \n";
     if (/htm/i) {
     print " $filename is htm!\n";
     open(HTFILE, $folder.":".$filename) || die "could not open the HTML
file\n";
     $notitle = 1;
     while (<HTFILE>) {
       chop;
       # print "*** $_\n";
       if (/^\<TITLE\>/i) {
           s/title/TITLE/;   # need to work in uppercase
           s/<TITLE>//;   # strip out the TITLE tags
           s/<\/TITLE>//;

           print CINDEX "$_}$prefix$filename\n";
           $notitle = 0;
       }  # endif
     } #end while
     if ($notitle == 1) { print "\n" };
     close(HTFILE);
  }
}
closedir(DIR);
close(CINDEX);
print "\nSearch has finished\n";


===================================================================
MY SORT PROGRAM:

# sort
# author: Charles Cave   16th April 1996
#  this program takes all the files dragged on to the droplet
#  and sort merges them together into a file specified
#  by the user
#
require "GUSI.ph";

@buffer = ();

$outfile = &MacPerl'Choose(&GUSI'AF_FILE, 0, "", "",
   &GUSI'CHOOSE_NEW + &GUSI'CHOOSE_DEFAULT, "sort output");
open (OUTFILE,">".$outfile) || die "could not open $outfile\n";

foreach $file (@ARGV) {
  # print "Processing file $file\n";
  open (INFILE, $file) || die "could not open $file\n";
  while (<INFILE>) {
    push(@buffer, $_);
  }
  close(INFILE);
}

foreach $line (sort(@buffer)) {
   print OUTFILE "$line";
}
close (OUTFILE);


===================================================================

FINALLY, THE INDEX HTML GENERATOR
This reads the result of the previous operation...


require "GUSI.ph" ;

$filename = &MacPerl'Choose(&GUSI'AF_FILE, 0, "", "" );
$cindex = ">".$filename.".index";

open (CINDEX, $filename) || die "could not open $filename\n";
open (NEWINDEX, $cindex) || die "could not create cindex.index \n";
print NEWINDEX "<HTML>\n<HEADER>\n<TITLE>Index of Web Documents</TITLE>\n";
print NEWINDEX "</HEADER> <BODY>\n";
print NEWINDEX "<H1>Index of Documents by Name</H1>\n\n";
while (<CINDEX>) {
  chop;
  ($desc, $href) = split(/}/, $_);
  print NEWINDEX "<A HREF=\"$href\">$desc</A><BR>\n";
  }
print NEWINDEX "<HR> Last updated \n</BODY> </HTML>\n";

close(NEWINDEX);


===================================================================

Sample of the final result:

<HTML>
<HEADER>
<TITLE>Index of Web Documents</TITLE>
</HEADER> <BODY>
<H1>Index of Documents by Name</H1>

<A HREF="./Books/B30620.htm">99% Inspiration</A><BR>
<A HREF="./Software/ACTA.htm">ACTA (Software)</A><BR>
<A HREF="crefaq33.htm">Action steps for improving creativity </A><BR>
<A HREF="./Books/B16890.htm">Actual Minds, Possible Worlds</A><BR>
<A HREF="newbook.htm">Add Book Details to the Web (Form)</A><BR>
<A HREF="mbagents.htm">Agents & Creativity (Margaret Boden)</A><BR>

===================================================================


------------------------------------------------------
Charles Cave
Sydney, Australia
Email: charles@jolt.mpx.com.au
URL:   http://www.ozemail.com.au/~caveman
------------------------------------------------------