[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]
[MacPerl] How To Scripts

To: "macperl@macperl.org" <macperl@macperl.org>
Subject: [MacPerl] How To Scripts
From: "ken towry" <ktowry@austin.rr.com>
Date: Tue, 22 Jun 1999 23:50:55 -0500
These are the scripts I'm having trouble with from the book "Perl 5 How 
To".
Unfortunately, who ever is responsible for this book was cheap and they
didn't include the text on the CD-ROM or the website, so i had to enter
it in, may contain a few typo's. I did a syntax check and it returned no
errors. Iv'e also included the parsehtm.pl in it's entirety at the end
of this email.

Problem: One of the major tasks of a CGI script is to provide some kind
of dynamic HTML.
I would like to use HTML created by a graphic artist, but I would like
to alter some of the contents dynamically. This requires me to parse the
HTML file created by an artist and insert my own information.

Technique

Instead of treating each HTML file individually, create the general
parsing engine, illustrated in Figure 10-1. This engine will allow you
to write Perl subroutines to handle each HTML tag encountered. For
example, the subroutine inputHandler could be called every time an INPUT
tag is encountered. This handler could change the value of the tag or,
in the case of a radio button, turn it on or off. The handlers will be
expected to accept text from the file and return the text that is sent
to the client. The handler may return different text than it received as
input.

Tags will be classified into three categories: unary, binary, and
end-of-line. Unary tags, such as INPUT, have no end tag. Binary tags,
such as TEXTAREA, bracket some form of body text between themselves and
an end tag. End-of-line tags, such as OPTION, rely on the text that
follows them, in which the end of the line acts as an end tag.

The parsing engine will be built from several subroutines. The primary
routine is parseHtml. This routine takes a file name as an argument and
returns a string containing the parsed HTML. Using this parser will
involve registering handlers for the tags you are interested in and
calling parseHtml. Because this parser wlll not provide any interesting
functionality until various tag handler subroutines are provided, you
will not test it in this How-To.

This engine is rather lengthy. If you are not interested in the details
of parsing HTML, you may prefer to read through this section without
writing the code and proceed to later sections in this chapter that
focus on building handlers for various tags.

Steps

1. Create and open the file parseHtm.pl. This file will contain all of
the primary subroutines for the HTML parser. You will need this file in
each of the How-Tos in this chapter.

2. Start creating the parseHtm.pl subbroutine. This routine takes a file
name and returns a string of parsed HTML. All the parsing is handled by
a subroutine called mainHtmlParser.

sub parseHtml
{

3. Declare a local variable called SfileName for the argument and
another, $retval, for the parsed HTML string that the subroutine
returns.

    # Declare variables to hold the arguments
    local($fileName) = @_;

    # Declare a variable to store the return value
    local($retVal);

4. Open the HTML file using the filehandle HTMLFile. This filehandle is
a global value used in all of the parsing routines.

    # Open the file
    open(htmlFile,$fileName);

5. Call the main parser. This main parsing routine, mainHtmlParser,
looks for a stop string or stop character. If no stopper is provided,
the routine will read to the end of the file and return the entire
parsed file.

    # If the file opened, call the parser on it
    $retVal = &mainHtmlParser("",0) if htmlFile;

6. Close the file and return the parsed HTML.

    # Close the file
    close(htmlFile);

    # Return the string parsed from the file
    return $retVal;
}

7. Start the mainHtmlParser subroutine. This is a large subroutine. It
reads characters from the HTML file looking for tags, plain text, the
stop string, and the stop character. When either a tag or plain text is
encountered, another subroutine is called to handle the text. These
other subroutines are handlePlainText and handleTag. The main parser
uses two buffers, $mainBuffer and StmpBuffer. $mainBuffer is used to
keep track of the total parsed text. $tmpBuffer is used to keep track of
text as it is being parsed, for example, the text between the < and >
characters.

sub mainHtmlParser
{

8. Declare local variables to store the arguments. Declare another set
to maintain the main buffer, the temporary buffer, and the current
character, and to determine whether or not a tag is being read.

    # Declare locals to store the arguments
    local($stopStr,$stopChar) = @_;

 # Declare several local variables
    local($char,$inTag,$tmpBuffer,$mainBuffer);

9. Initialize the main buffer and the S i nT a g variable.

 # Initialize the main buffer, this is what is returned
    $mainBuffer = "";

    # $inTag is used to denote when we are inside <>'s
    $inTag = 0;

10. Start the main parsing loop. Use the do-until syntax.

 # Loop until the end of the file, or
 # we encounter the stop string or stop character.
    do
    {

11.  Get the next character from the file h tm l F i l e Store the
character in the $char variable. You will use getc to grab characters
from the file. This is not the most efficient way to read a file, but it
will make your parsing code cleaner.

     # Get the next character from the file.
     # This is not the most effecient method of reading a file
     # But makes our code cleaner

  $char = getc(htmlFile);

12. Check if the character read is a <. This character will start the
tags in an HTML file.

  # Check if we are at the start of a tag
  if($char eq "<")
  {

13. If you got a <, then check if you are in a tag. Don't let tags exist
inside other tags.

   # Dont allow any tags inside other tags
      if($inTag)
      {
    die "This is an invalid html file.\n";
      }

14. If the parser is not already in a tag, set $inTag to 1, because you
are now in one.

      else
      {
       # Denote that we are in a tag
    $inTag = 1;

15. Check if you have a tmpBuffer If so, then handle the temporary
buffer as plain text and add the parsed plain text to the main buffer.
Add a < to the tmpBuffer This concludes the if($char eq <) statement.

    # If we were reading plain text
    if($tmpBuffer)
    {
     # Handle the plain text
        $mainBuffer .= &handlePlainText($tmpBuffer);

        # Reset the tmp buffer
        $tmpBuffer = "";
    }

    # Start the new tmp buffer
    $tmpBuffer = "<";
      }
  }

16. See if the new character is an >. This indicates the end of a tag.

  elsif($char eq ">") # Check if we are at the end of a tag
  {

17. Make sure that you are in a tag, and die if you are not. If you got
a > but are not in a tag, this is a bad HTML file.

   # Dont allow end tags without start tags
      if(! $inTag)
      {
    die "This is an invalid html file.\n";
      }

18. Handle the end of the current tag. Add the > to the end of the
temporary buffer. Then check if this tag is the stop string. In this
case, the subroutine is supposed to return. Otherwise, handle the tag,
then add the parsed tag to the main buffer and reset the temporary
buffer.

      else
      {
       # Denote the end of the tag
    $inTag = 0;

    # Finish the tmp buffer
    $tmpBuffer .= ">";

    # See if we are at the stop string
    if($stopStr && ($tmpBuffer =~ /$stopStr/i))
    {
        return $mainBuffer;#we have read to the stop string
    }
    else
    {
     # If not handle the tag, and keep reading
        $tmpBuffer = &handleTag($tmpBuffer);

        # Add the tmp buffer to the main buffer
        $mainBuffer .= $tmpBuffer;

        # Reset the tmp buffer
        $tmpBuffer = "";
    }
      }
  }

19. Check if you are at the end of the file or if you got the stop
character. A stop character is required by tags that need the
information at the end of a Line, such as OPTION.

  elsif(eof(htmlFile)

       || ($stopChar && ($char eq $stopChar))) # check for stopchar
  {

20. Handle errors. If you are at the end of the file or found the stop
character and are in a tag, then die, because this is considered a
failure.

   # Dont allow the parsing to end inside a tag
      if($inTag)
      {
    die "This is an invalid html file.\n";
      }

21. Finalize the temporary buffer. You either got the stop character or
are at the end of the file. Handle the plain text in StmpBuffer, add the
parsed text to the main buffer, reset $tmpBuffer, and return the main
buffer.

     else
      {
       # Add the character to the tmp buffer
    $tmpBuffer .= $char if (!eof(htmlFile));

    # Add the tmp buffer to the main buffer,
    # after handling it.
    $mainBuffer .= &handlePlainText($tmpBuffer);

    # Reset the tmp buffer
    $tmpBuffer = "";
      }

   # We are at the end of the file, or found
   # the stop string, so return the main buffer
      return $mainBuffer;
  }
  else # If nothing else add the character to the tmp buffer
  {
      $tmpBuffer .= $char;
  }

    }
    until(eof(htmlFile));

 # Return the main buffer
    return $mainBuffer;
}

22. Handle the case of the nonspecial, not < or >, character. Append it
to the temporary buffer.

  else # If nothing else add the character to the tmp buffer
  {
      $tmpBuffer .= $char;
  }

    }

23. Close the do-until loop. Let the loop continue until the end of the
file. If a stop character or stop string is provided, it will be caught
earlier than this. Return the mainbuffer and close the mainHtmlParser
subroutine.

    until(eof(htmlFile));

 # Return the main buffer
    return $mainBuffer;
}

24. Create the subroutine used to handle tags encountered by the
mainHtmlParser subroutine. This subroutine handles the different cases
in which the tag handler wants to have a stopping tag or wants to
process all of the data from the initial tag to the end of the line.
Call this subroutine handLeTag.

sub handleTag
{
 # Declare local variables for the argument, as well
 # as the other required locals.

25. handleTag requires a number of local variables. These include one to
hold the argument, one for an associative array that will make access to
the tag string easier, scalars for the handler's name, the end tag, and
the text between the initial tag and the end tag. This subroutine uses
the eval subroutine to call the tag's handler. You need a local scalar
to store the string that you will send to eval.

    local($tagString) = @_;
    local(%tagDict,$endTag,$handler,$argString);
    local($evalString);

26. Use the dictForTag subroutine, created in later steps, to parse the
tag string into an associative array. This will take everything between
the < and > and return an array with keys like TAG, NAME, and VALUE. All
the keys will be capitalized.

 # Create an associative array containing the data for the
 # tag string.

    %tagDict = &dictForTag($tagString);

27. See if an end tag was registered for the tag. Use the tag dictionary
to find the name of the tag and the global associative array, %endTags,
to find the end tag. End tags are registered by the programmer writing
the handler for that tag.

 # Look for an end tag. These are registered in the %endTags
 # global associative array.

    $endTag = $endTags{$tagDict{"TAG"}};

28. See if a handler has been registered for the tag. Again, a global
associative array variable is used. In this case, it is called
handlerDict.

 # Look for a handler subroutine for the tag.
 # These are registered in the %handlerDict global
 # associative array.

    $handler = $handlerDict{$tagDict{"TAG"}};

29. If this tag doesn't have a registered handler, then treat it as
plain text. Call the subroutine handLePlainText and return the result.
You will write this subroutine in later steps.

 # If no handler is found, treat the tag as plain text, and
 # return the parsed data.

    if(!($handler))
    {
  $tagString = &handlePlainText($tagString);

  return $tagString;
    }

30. Build the eval string. Based on the tag's registered end tag. you
may need to read to the end of the line or read to the end tag. Evaluate
the string and catch the resulting parsed HTML.

 # If the tag wants the data to the end of the line
 # use mainHtmlParser to read to the end of the line, then
 # call the tag's handler subroutine with the data to the
 # end of the line.

    if($endTag eq "eol") # Tag that needs data to eol
    {
  $argString = &mainHtmlParser("","\n");

  $evalString = "&".$handler.'($tagString,$argString,0,%tagDict);';
    }
    elsif($endTag)  # Tag with an end tag
    {
     # Use mainHtmlParser to read any text, up to
     # the end tag. Remove the end tag from the sting.

  $argString = &mainHtmlParser($endTag,0);
  $argString =~ s/<.*>$//; # Remove the end tag

  # Call the tag's handler
  $evalString =
"&".$handler.'($tagString,$argString,$endTag,%tagDict);';
    }
    else   # General unary tag
    {
     #For unary tags, simply call the handler.
  $evalString = "&".$handler.'($tagString,0,0,%tagDict);';
    }

    $tagString = eval($evalString);

31. Return the result from the tag handler. aose the subroutine
definition for handleTag.

    # Return the parsed text.
    return $tagString;
}

32. Define the subroutine handlePlainText This is called whenever text
is encountered outside a tag or when a tag without a handler is
encountered. handlePlainText is like handleTag, except no end tags are
used. A default handler is used for all plain text.

sub handlePlainText
{
 # Declare the locals

    local($plainString) = @_;
    local($handler,$evalString);

    # Look for a default handler for plain text
    $handler = $handlerDict{"DEFAULT"};

 #If there is a handler, call it and catch the return value.

    if($handler)
    {
  $evalString = "&".$handler.'($plainString,0,0,0);';
  $plainString = eval($evalString);
    }

 # Return either the text passed in, or the parsed text if there
 # was a default handler.

    return $plainString;
}

33. Start the subroutme dictFormTag. This subroutine takes a tag string
as an argument. A tag string is all the text between and including a <
and a > character. dictForTag breaks the string into a tag, key-value
pairs, and unary attributes. These are inserted into an associative
array that is then returned. The tag is inserted into the array as the
value for the key TAG, and key-value pairs are added to the array as is,
after the key is capitalized. Unary attributes are added to the array,
capitalized as both the key and
value.

sub dictForTag
{

34. Declare the locals for the argument and the working associative
array. A scalar is also used to track the keys you create.

 # Declare locals
    local($tagString) = @_;
    local(%tagDict,$key);

35. Look for the tag. It should be at the front of the tag string and
consist only of alphanumeric characters. You are using a regular
expression to identify the tag. The parentheses indicate that the
matching pattern should be stored in the $1 special variable. If the tag
is found, remove it from the tag string to make further parsing easier.
Capitalize the tag using tr and add it to the associative array %tagDict
. If no tag is found, this is an error; return an empty tag dictionary.

 # Look for the tag
 # Remove it from the tag string
 # Capitalize the tag, and put it into the dict
 # with the key, TAG
 # If no tag is found, then this is not a tag string.

    if(($tagString =~ s/^<(\w*)[\s>]//) && $1)
    {
  ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case

  $tagDict{"TAG"} = $key;
    }
    elsif(($tagString =~ s/^<!--(\w*)[\s>]//) && $1)
    {
  ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case

  $tagDict{"TAG"} = $key;
    }
    else
    {
  return %tagDict;
    }

36. Look for key-value strings. Again, a regular expression is used to
find the strings. In this case, you are looking for a single word
followed by zero or more spaces, then an = character. After the =, look
for zero or more spaces and any pattern inside quotes. This does require
that all key-value attributes have their value in quotes. Once a pattern
is found, the parentheses in the regular expression cause the key to be
placed in the scalar $1 and the value in the scalar $2. Capitalize the
key and add it and the value to the associative array.

    # Find all of the tag's key/value attributes
 # Remove them from the tag string.

    while($tagString =~ s/(\w*)\s*=\s*\"([^\"]*)\"//)
    {

  if($1)
  {
      ($key = $1) =~ tr/a-z/A-Z/;  # Make upper case

      if($2)
      {
   $tagDict{$key} = $2; # Add the key to the dict
      }
      else
      {
   $tagDict{$key} = "";
      }
  }
    }

37. Look for single attributes. Use a regular expression with
parentheses. When an attribute is found, remove it from the string,
capitalize it, and add it to tagDict as a value with itself as the key.

    # Find the single attributes
 # and remove them from the string.
    while($tagString =~ s/\s+(\w*)[\s>]*//)
    {
  if($1)
  {
      ($key = $1) =~ tr/a-z/A-Z/;  # Make upper case
      $tagDict{$key} = $key; # Add to the dict
  }
    }

38. Return the tag dictionary and close the definition of dictForTag.

    return %tagDict;
}

39. The last subroutine in the parsing toolkit is not really used in
parsing. stringForTagDict takes the dictionary for a tag, like the one
created by dictForTag, and returns a string. This string will have a <
followed by the tag, key-value attributes, unary attributes, and the
closing >. The implementation of stringForTagDict uses foreach to find
the keys in the dictionary, then creates the return string with
concatenation. This routine will be useful when you are writing
tag-handling routines. It is not used by the other library routines.

sub stringForTagDict
{
    # Declare locals
    local(%tagDict) = @_;
    local($tagString);

 # If there was a tag dictionary passed in
    if(%tagDict)
    {
     #If the tag dictionary has a TAG in it, build the tag string
  if($tagDict{"TAG"})
  {
   # Start the string with a < and the tag

      $tagString .= "<";
      $tagString .= $tagDict{"TAG"};

   # Add the keys to the string

      foreach $key (keys %tagDict)
      {
    # Ignore TAG, we already added it

    if($key eq "TAG")
    {
        next;
    }
    elsif($key eq $tagDict{$key}) # unary attribute
    {
        $tagString .= " ";
        $tagString .= $key;
    }
    elsif($tagDict{$key}) #key/value attributes
    {
        $tagString .= " ";
        $tagString .= $key;
        $tagString .= "= \"";
        $tagString .= $tagDict{$key};
        $tagString .= "\"";
    }
      }

   #Close the tag string
      $tagString .= ">";
  }
    }

 #Return the tag string
    return $tagString;
}

40. Return 1 at the end of parseHtml.pl. This will ensure that require
will accept the file appropriately.

1;

How It Works

The HTML parsing code is made up of a set of subroutines that separate
the task of parsing HTML into reasonably sized chunks. This code is
intended to provide a library of useful subroutines. The library itself
really has only two public subroutines: parseHtml and stringForTagDict.
A developer's primary interaction with the library is by defining tag
handler subroutines and registering them in the global associative array
%handlerDict. If the handler wants to receive the data between a tag and
it's end tag, or a tag and the end of the line it is on, then the
programmer also registers the end tag in the global associative array
%endTags. In the case of a tag wishing to receive data to the end of its
line, the string eol should be placed in %endTags.

Once a programmer registers all the tag handlers that he or she is
interested in, the programmer calls the subroutine parseHtml with the
name of the HTML file as an argument. parseHtmParser will open the HTML
file, using the global file handle htmLFile. If the file opens
successfully, then parseHtmL calls the subroutine mainHtmlParser to do
the actual parsing.

The subroutine mainHtmlParser serves two purposes: reading HTML and
parsing HTML. Reading HTML consists of looking for the end-of-file, a
stop character like the end of a line, or a tag that should act as a
stopping string. When any of these is encountered, the subroutine
returns the parsed text. Parsing the HTML is the process of looking for
tags and plain text. When either of these is encountered, another
subroutine is called to parse the actual text. The resulting parsed text
is then added to abuffer, which is ultimately returned to the caller of
the mainHtmlParser subroutine.

The subroutines mainHtmlParser uses to parse text are handlePlainText
and handleTag. Both of these use the eval function to call an
appropriate handler, Tags that have a handler registered in the
%handlerDict will have their handler called.
All other tags and plain text will have the default handler called. This
handler is either a subroutine in %handlerDict with the name DEFAULT, or
nothing. In the case where no default handler exists, the text is
returned as is.

The final two subroutines in the library are dictForTag and
stringForTagDict. These subroutines translate a tag into an associative
array and back. This translation makes it easier to write handlers,
which can rely on the associative array to provide the tag's name,
value, and other attributes. stringForTagDict allows the developer to
change values in the associative array, then turn it back into a string
before returning it from a handler function. This is much easier than
parsing the tag inside each handler routine.



Problem

The HTML that I am dynamically displaying has a form on it. I would like
to use the HTML parsing library from How-To 10.1 to set the action and
request methods for the form. I know that I need to write a handler
subroutine for this to work.

Technique

The parsing library from How-To 10.1 provides generic HTML parsing; you
will rely on it for the majority of your work. The library allows
programmers to define handler subroutines for any HTML tag. To manage a
form's method and action, you will write a handler for the FORM tag. A
handler subroutine is passed the string that represents the HTML tag, as
well as a dictionary of information about the tag. The handler uses this
information to return a parsed version of the tag that will ultimately
be sent to a browser.

To facilitate testing, this How-To describes how to build a form handler
in the context of a test script. An HTML file is provided to test the
script.

steps

1. Create a work directory. You will be using several Perl files, so it
is easier to work on the program if these files are all together.

2.  Copy the file parseHtm.pL created in How-To 10.1 into the working
directory. You can find this file on the CD-ROM.

3. Create an HTML file to test the form handler. This test page can be
fairly simple. For example, you might use the page in Figure 10-2 that
displays a message and a Submit button. The HTML for this page is

<HTML>
<HEAD>
<TITLE~CGI How-to, Form Handler Test Page</TITLE>
</HEAD~
<BODY>
<H4><FORM METHOD="POST" ACTION="http:///cgi-bin/form pl">

This is a POST form with the action: form pl
Pressing select will return a page containing a form that has no initial
method or action, but will have the method set to POST and the ACTION to
form pl
<p>
<INPUT TYPE="SUBMIT" NAME="SUBMI " VALUE="Run Form Through Script">

</FORM></H4>

</BODY>
</HTML>

The idea of this test page is to provide a Submit button that will
initiate the test script. The script will display another file after
setting its form's action and method. Call the file for the page in
Figure 10.2 form_pl.htm if you would like it to be compatible with the
provided HTML for the follow-up page is

<HTML>
<HEAD>
<TITLE>CGI How-to, Form Handler Result Page</TITLE>
</HEAD>
<BODY>
<H4><FORM METHOD="" ACTION="">

This is a form with no initial method and action. Press submit to test
that a method and action was provided by the displaying script.



<p>
~INPUT TYPE="SUBMIT" NAME="SUBMIT" VALUE="Run Form Through Script"~

</FORM></H4>

</BODY>
</HTML>

Call the file for the follow-up page f2pl.htm if you would like it to be
compatible with the provided test script code.

4. Create a Perl file called form.pl. This file is also on the CD-ROM.
It will include the handler for the FORM tag and act as a CGI script.

5. Start the file form.pl with the appropriate comment for describing
this as a Perl script. Make sure that the path used is correct for your
machine.

6. Require the file containing the HTML parsing library. This file is
called parsehtm.pl.

require "parsehtm.p~";


7. Start the form input handler subroutine. CalI it formHandler.

8. Declare local variables to hold the subroutine's arguments. All
handler subroutines or the parsing library take four arguments. These
are the tag string (everything between the < and >); a possible argument
string, unused in this case: an end string, also unused; and a
dictionary of information about the tag. This tag dictionary will be the
primary source of information about the tag.

Local($tagString,SargString,SendString,%tagDict)
= @_;

9. Declare a local to hold the string this handler will return. This
string will be inserted into the HTML file, in place of the original tag
string, before the file is sent to the client's browser.

local($retVal);

10. Change the HTML actually sent to the client. Alter the values in the
tag dictionary, then convert the dictionary to an appropriate string.
Because you are handling a FORM tag, set the dictionary's values for the
keys METHOD and ACTION. All the keys are capitalized by the subroutine
that created the dictionary.

StagDict{"METHOD"} = "POST";
StagDict{"ACTION"} = "form.pl";

11. Use the library routine stringForTagDict to turn the updated tag
dictionary into a tag string.

# Get the string for the ne~ dictionary
SretVal .= &stringForTagDict(%tagDict);

12. Return the new tag string and close the subroutine.

return SretVal

ShandlerDict{"FORM"~ = "formHandler":

13. Begin the rest of the test script by adding the formHandler to the
global associative array %handlerDict. Handlers are registered by name,
with the key equal to the tag that they handle.

14. Use the library routine parsseHtml to parse the file f2_pl.htm,
created in an earlier step. The return value of this routine is the
newly parsed HTML.

$output = &parseHtml("f2 pl.htm");

15. Print the content type for this script's reply to standard out.

print "Content-type: text/html\n\n";

16. Print the parsed HTML to standard out. This will send it to the
browser.

print $output;

17. Set the permissions on form.pl to allow execution. Be sure to
install the parsehtm.pl file as well as form.pl. Remember that the
script also needs access to the raw HTML file to parse and return it.
Therefore, you also need to put formpl.htm and f2_pl.htm where form.pl
can find them. Open the test HTML file, form_pl.htm Press the Submit
button. View the HTML for the follow-up page.

How It Works

Dealing with dynamically parsed HTML can be a complex problem. You
should rely on the library created in How-To 10.1 to handle the majority
of the parsing. Using the librarv, YOU have to write handler subroutines
only for the tags you want to handle. In this case you are handling FORM
tags. Handling a tag involves setting the appropriate values in a
dictionary and translating the dictionary into a string. Actually
creating the dictionary is the library's job as is turning the
dictionary back into a string.

Comments

When completed, your form handler should look like this:

sub formHandler
{
    local($tagString,$argString,$endString,%tagDict)
 = @_;

    local($retVal);

    $tagDict{"METHOD"} = "POST";
    $tagDict{"ACTION"} = "form.pl";

    $retVal = &stringForTagDict(%tagDict);

    return $retVal
}


            :::::::::::in it's entirety:::::::

#!/usr/bin/perl

# This package uses the global file handle htmlFile
# There are two global assoc. arrays, endTags & handlerDict

# parseHtml takes one argument, a filename
# and returns the parsed html in a string

sub parseHtml
{
    # Declare variables to hold the arguments
    local($fileName) = @_;

    # Declare a variable to store the return value
    local($retVal);

    # Open the file
    open(htmlFile,$fileName);

    # If the file opened, call the parser on it
    $retVal = &mainHtmlParser("",0) if htmlFile;

    # Close the file
    close(htmlFile);

    # Return the string parsed from the file
    return $retVal;
}

# mainHtmlParser takes several arguments
# This subroutine can either take a stop string, or a stop char
# it reads the file htmlFile until either the end of file
# the stopstring or the stop char is encountered.
#
# mainHtmlParser returns a string filtered from the file.
# The filters are tag handlers and a default handler.
# Handlers should take 5 arguments for:
#
# tagString - The string containing the tag
# argString - Any data between the tag and end tag
# endString - The end tag
# tagDict - The dictionary created using dictForTag
# userData - The user data argument
#
# Handlers are registered in the global dictionary
# handlerDict.
#
# If the tag has a matching end tag like <HTML> and </HTML>
# then the tag should be registered in the global
# %endTags array, with the value equal to its end tag.
#
# If the tag needs the data up to the end of the line, like
# OPTION, then if should appear in %endTags with the value
# "eol".
#
# Handlers should return the string to replace the tag with.
#
# The default is used for text that wasn't part of a tag.
# Tags are denoted by <text>.
# As plain text is encountered the handler registered under
# the string "DEFAULT" is called.

sub mainHtmlParser
{
    # Declare locals to store the arguments
    local($stopStr,$stopChar) = @_;

 # Declare several local variables
    local($char,$inTag,$tmpBuffer,$mainBuffer);

 # Initialize the main buffer, this is what is returned
    $mainBuffer = "";

    # $inTag is used to denote when we are inside <>'s
    $inTag = 0;

 # Loop until the end of the file, or
 # we encounter the stop string or stop character.
    do
    {

     # Get the next character from the file.
     # This is not the most effecient method of reading a file
     # But makes our code cleaner
     
  $char = getc(htmlFile);
  
  # Check if we are at the start of a tag
  if($char eq "<")
  {
   # Dont allow any tags inside other tags
      if($inTag)
      {
    die "This is an invalid html file.\n";
      }
      else
      {
       # Denote that we are in a tag
    $inTag = 1;
    
    # If we were reading plain text
    if($tmpBuffer)
    {
     # Handle the plain text
        $mainBuffer .= &handlePlainText($tmpBuffer);
  
        # Reset the tmp buffer
        $tmpBuffer = "";
    }
    
    # Start the new tmp buffer
    $tmpBuffer = "<";
      }
  }
  elsif($char eq ">") # Check if we are at the end of a tag
  {
   # Dont allow end tags without start tags
      if(! $inTag)
      {
    die "This is an invalid html file.\n";
      }
      else
      {
       # Denote the end of the tag
    $inTag = 0;

    # Finish the tmp buffer
    $tmpBuffer .= ">";

    # See if we are at the stop string
    if($stopStr && ($tmpBuffer =~ /$stopStr/i))
    {
        return $mainBuffer;#we have read to the stop string
    }
    else
    {
     # If not handle the tag, and keep reading
        $tmpBuffer = &handleTag($tmpBuffer);
        
        # Add the tmp buffer to the main buffer
        $mainBuffer .= $tmpBuffer;
        
        # Reset the tmp buffer
        $tmpBuffer = "";
    }
      }
  }
  elsif(eof(htmlFile)

       || ($stopChar && ($char eq $stopChar))) # check for stopchar
  {
  
   # Dont allow the parsing to end inside a tag
      if($inTag)
      {
    die "This is an invalid html file.\n";
      }
      else
      {
       # Add the character to the tmp buffer
    $tmpBuffer .= $char if (!eof(htmlFile));
  
    # Add the tmp buffer to the main buffer,
    # after handling it.
    $mainBuffer .= &handlePlainText($tmpBuffer);

    # Reset the tmp buffer
    $tmpBuffer = "";
      }
   
   # We are at the end of the file, or found
   # the stop string, so return the main buffer
      return $mainBuffer;
  }
  else # If nothing else add the character to the tmp buffer
  {
      $tmpBuffer .= $char;
  }

    }
    until(eof(htmlFile));

 # Return the main buffer
    return $mainBuffer;
}

#
# handleTag actualy handles the tags for mainHtml parser

sub handleTag
{
 # Declare local variables for the argument, as well
 # as the other required locals.
 
    local($tagString) = @_;
    local(%tagDict,$endTag,$handler,$argString);
    local($evalString);

 # Create an associative array containing the data for the
 # tag string.
 
    %tagDict = &dictForTag($tagString);

 # Look for an end tag. These are registered in the %endTags
 # global associative array.
 
    $endTag = $endTags{$tagDict{"TAG"}};

 # Look for a handler subroutine for the tag.
 # These are registered in the %handlerDict global
 # associative array.
 
    $handler = $handlerDict{$tagDict{"TAG"}};
 
 # If no handler is found, treat the tag as plain text, and
 # return the parsed data.
 
    if(!($handler))
    {
  $tagString = &handlePlainText($tagString);

  return $tagString;
    }

 # If the tag wants the data to the end of the line
 # use mainHtmlParser to read to the end of the line, then
 # call the tag's handler subroutine with the data to the
 # end of the line.
 
    if($endTag eq "eol") # Tag that needs data to eol
    {
  $argString = &mainHtmlParser("","\n");
 
  $evalString = "&".$handler.'($tagString,$argString,0,%tagDict);';
    }
    elsif($endTag)  # Tag with an end tag
    {
     # Use mainHtmlParser to read any text, up to
     # the end tag. Remove the end tag from the sting.
     
  $argString = &mainHtmlParser($endTag,0);
  $argString =~ s/<.*>$//; # Remove the end tag

  # Call the tag's handler
  $evalString =
"&".$handler.'($tagString,$argString,$endTag,%tagDict);';
    }
    else   # General unary tag
    {
     #For unary tags, simply call the handler.
  $evalString = "&".$handler.'($tagString,0,0,%tagDict);';
    }

    $tagString = eval($evalString);

    # Return the parsed text.
    return $tagString;
}

# handlePlainText actually handles plain text for htmlMainParser

sub handlePlainText
{
 # Declare the locals
 
    local($plainString) = @_;
    local($handler,$evalString);

    # Look for a default handler for plain text
    $handler = $handlerDict{"DEFAULT"};

 #If there is a handler, call it and catch the return value.
 
    if($handler)
    {
  $evalString = "&".$handler.'($plainString,0,0,0);';
  $plainString = eval($evalString); 
    }

 # Return either the text passed in, or the parsed text if there
 # was a default handler.
 
    return $plainString;
}

# Creates an associative array for a tag string

sub dictForTag
{
 # Declare locals
    local($tagString) = @_;
    local(%tagDict,$key);

 # Look for the tag
 # Remove it from the tag string
 # Capitalize the tag, and put it into the dict
 # with the key, TAG
 # If no tag is found, then this is not a tag string.
 
    if(($tagString =~ s/^<(\w*)[\s>]//) && $1)
    {
  ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case

  $tagDict{"TAG"} = $key;
    }
    elsif(($tagString =~ s/^<!--(\w*)[\s>]//) && $1)
    {
  ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case

  $tagDict{"TAG"} = $key;
    }
    else
    {
  return %tagDict;
    }

    # Find all of the tag's key/value attrubutes
 # Remove them from the tag string.
 
    while($tagString =~ s/(\w*)\s*=\s*\"([^\"]*)\"//)
    {

  if($1)
  {
      ($key = $1) =~ tr/a-z/A-Z/;  # Make upper case
 
      if($2)
      {
   $tagDict{$key} = $2; # Add the key to the dict
      }
      else
      {
   $tagDict{$key} = "";
      }
  }
    }

    # Find the single attributes
 # and remove them from the string.
    while($tagString =~ s/\s+(\w*)[\s>]*//)
    {
  if($1)
  {
      ($key = $1) =~ tr/a-z/A-Z/;  # Make upper case
      $tagDict{$key} = $key; # Add to the dict
  }
    }

    return %tagDict;
}

# Creates a string from a tag dictionary

sub stringForTagDict
{
    # Declare locals
    local(%tagDict) = @_;
    local($tagString);

 # If there was a tag dictionary passed in
    if(%tagDict)
    {
     #If the tag dictionary has a TAG in it, build the tag string
  if($tagDict{"TAG"})
  {
   # Start the string with a < and the tag
   
      $tagString .= "<";
      $tagString .= $tagDict{"TAG"};
   
   # Add the keys to the string
   
      foreach $key (keys %tagDict)
      {
    # Ignore TAG, we already added it
    
    if($key eq "TAG")
    {
        next;
    }
    elsif($key eq $tagDict{$key}) # unary attribute
    {
        $tagString .= " ";
        $tagString .= $key;
    }
    elsif($tagDict{$key}) #key/value attributes
    {
        $tagString .= " ";
        $tagString .= $key;
        $tagString .= "= \"";
        $tagString .= $tagDict{$key};
        $tagString .= "\"";
    }
      }
   
   #Close the tag string
      $tagString .= ">";
  }
    }

 #Return the tag string
    return $tagString;
}

1;








===== Want to unsubscribe from this list?
===== Send mail with body "unsubscribe" to macperl-request@macperl.org
Follow-Ups:
- Re: [MacPerl] How To Scripts
  - From: Richard Gordon <maccgi@bellsouth.net>
- Re: [MacPerl] How To Scripts
  - From: Ronald J Kimball <rjk@linguist.dartmouth.edu>
- Re: [MacPerl] How To Scripts
  - From: bart.lateur@skynet.be (Bart Lateur)
Prev by Date: [MacPerl] Learning
Next by Date: Re: [MacPerl] Learning
Prev by thread: Re: [MacPerl] default editor
Next by thread: Re: [MacPerl] How To Scripts
Navigation: Date Index | Thread Index | Search | Other lists at bumppo.net