These are the scripts I'm having trouble with from the book "Perl 5 How To". Unfortunately, who ever is responsible for this book was cheap and they didn't include the text on the CD-ROM or the website, so i had to enter it in, may contain a few typo's. I did a syntax check and it returned no errors. Iv'e also included the parsehtm.pl in it's entirety at the end of this email. Problem: One of the major tasks of a CGI script is to provide some kind of dynamic HTML. I would like to use HTML created by a graphic artist, but I would like to alter some of the contents dynamically. This requires me to parse the HTML file created by an artist and insert my own information. Technique Instead of treating each HTML file individually, create the general parsing engine, illustrated in Figure 10-1. This engine will allow you to write Perl subroutines to handle each HTML tag encountered. For example, the subroutine inputHandler could be called every time an INPUT tag is encountered. This handler could change the value of the tag or, in the case of a radio button, turn it on or off. The handlers will be expected to accept text from the file and return the text that is sent to the client. The handler may return different text than it received as input. Tags will be classified into three categories: unary, binary, and end-of-line. Unary tags, such as INPUT, have no end tag. Binary tags, such as TEXTAREA, bracket some form of body text between themselves and an end tag. End-of-line tags, such as OPTION, rely on the text that follows them, in which the end of the line acts as an end tag. The parsing engine will be built from several subroutines. The primary routine is parseHtml. This routine takes a file name as an argument and returns a string containing the parsed HTML. Using this parser will involve registering handlers for the tags you are interested in and calling parseHtml. Because this parser wlll not provide any interesting functionality until various tag handler subroutines are provided, you will not test it in this How-To. This engine is rather lengthy. If you are not interested in the details of parsing HTML, you may prefer to read through this section without writing the code and proceed to later sections in this chapter that focus on building handlers for various tags. Steps 1. Create and open the file parseHtm.pl. This file will contain all of the primary subroutines for the HTML parser. You will need this file in each of the How-Tos in this chapter. 2. Start creating the parseHtm.pl subbroutine. This routine takes a file name and returns a string of parsed HTML. All the parsing is handled by a subroutine called mainHtmlParser. sub parseHtml { 3. Declare a local variable called SfileName for the argument and another, $retval, for the parsed HTML string that the subroutine returns. # Declare variables to hold the arguments local($fileName) = @_; # Declare a variable to store the return value local($retVal); 4. Open the HTML file using the filehandle HTMLFile. This filehandle is a global value used in all of the parsing routines. # Open the file open(htmlFile,$fileName); 5. Call the main parser. This main parsing routine, mainHtmlParser, looks for a stop string or stop character. If no stopper is provided, the routine will read to the end of the file and return the entire parsed file. # If the file opened, call the parser on it $retVal = &mainHtmlParser("",0) if htmlFile; 6. Close the file and return the parsed HTML. # Close the file close(htmlFile); # Return the string parsed from the file return $retVal; } 7. Start the mainHtmlParser subroutine. This is a large subroutine. It reads characters from the HTML file looking for tags, plain text, the stop string, and the stop character. When either a tag or plain text is encountered, another subroutine is called to handle the text. These other subroutines are handlePlainText and handleTag. The main parser uses two buffers, $mainBuffer and StmpBuffer. $mainBuffer is used to keep track of the total parsed text. $tmpBuffer is used to keep track of text as it is being parsed, for example, the text between the < and > characters. sub mainHtmlParser { 8. Declare local variables to store the arguments. Declare another set to maintain the main buffer, the temporary buffer, and the current character, and to determine whether or not a tag is being read. # Declare locals to store the arguments local($stopStr,$stopChar) = @_; # Declare several local variables local($char,$inTag,$tmpBuffer,$mainBuffer); 9. Initialize the main buffer and the S i nT a g variable. # Initialize the main buffer, this is what is returned $mainBuffer = ""; # $inTag is used to denote when we are inside <>'s $inTag = 0; 10. Start the main parsing loop. Use the do-until syntax. # Loop until the end of the file, or # we encounter the stop string or stop character. do { 11. Get the next character from the file h tm l F i l e Store the character in the $char variable. You will use getc to grab characters from the file. This is not the most efficient way to read a file, but it will make your parsing code cleaner. # Get the next character from the file. # This is not the most effecient method of reading a file # But makes our code cleaner $char = getc(htmlFile); 12. Check if the character read is a <. This character will start the tags in an HTML file. # Check if we are at the start of a tag if($char eq "<") { 13. If you got a <, then check if you are in a tag. Don't let tags exist inside other tags. # Dont allow any tags inside other tags if($inTag) { die "This is an invalid html file.\n"; } 14. If the parser is not already in a tag, set $inTag to 1, because you are now in one. else { # Denote that we are in a tag $inTag = 1; 15. Check if you have a tmpBuffer If so, then handle the temporary buffer as plain text and add the parsed plain text to the main buffer. Add a < to the tmpBuffer This concludes the if($char eq <) statement. # If we were reading plain text if($tmpBuffer) { # Handle the plain text $mainBuffer .= &handlePlainText($tmpBuffer); # Reset the tmp buffer $tmpBuffer = ""; } # Start the new tmp buffer $tmpBuffer = "<"; } } 16. See if the new character is an >. This indicates the end of a tag. elsif($char eq ">") # Check if we are at the end of a tag { 17. Make sure that you are in a tag, and die if you are not. If you got a > but are not in a tag, this is a bad HTML file. # Dont allow end tags without start tags if(! $inTag) { die "This is an invalid html file.\n"; } 18. Handle the end of the current tag. Add the > to the end of the temporary buffer. Then check if this tag is the stop string. In this case, the subroutine is supposed to return. Otherwise, handle the tag, then add the parsed tag to the main buffer and reset the temporary buffer. else { # Denote the end of the tag $inTag = 0; # Finish the tmp buffer $tmpBuffer .= ">"; # See if we are at the stop string if($stopStr && ($tmpBuffer =~ /$stopStr/i)) { return $mainBuffer;#we have read to the stop string } else { # If not handle the tag, and keep reading $tmpBuffer = &handleTag($tmpBuffer); # Add the tmp buffer to the main buffer $mainBuffer .= $tmpBuffer; # Reset the tmp buffer $tmpBuffer = ""; } } } 19. Check if you are at the end of the file or if you got the stop character. A stop character is required by tags that need the information at the end of a Line, such as OPTION. elsif(eof(htmlFile) || ($stopChar && ($char eq $stopChar))) # check for stopchar { 20. Handle errors. If you are at the end of the file or found the stop character and are in a tag, then die, because this is considered a failure. # Dont allow the parsing to end inside a tag if($inTag) { die "This is an invalid html file.\n"; } 21. Finalize the temporary buffer. You either got the stop character or are at the end of the file. Handle the plain text in StmpBuffer, add the parsed text to the main buffer, reset $tmpBuffer, and return the main buffer. else { # Add the character to the tmp buffer $tmpBuffer .= $char if (!eof(htmlFile)); # Add the tmp buffer to the main buffer, # after handling it. $mainBuffer .= &handlePlainText($tmpBuffer); # Reset the tmp buffer $tmpBuffer = ""; } # We are at the end of the file, or found # the stop string, so return the main buffer return $mainBuffer; } else # If nothing else add the character to the tmp buffer { $tmpBuffer .= $char; } } until(eof(htmlFile)); # Return the main buffer return $mainBuffer; } 22. Handle the case of the nonspecial, not < or >, character. Append it to the temporary buffer. else # If nothing else add the character to the tmp buffer { $tmpBuffer .= $char; } } 23. Close the do-until loop. Let the loop continue until the end of the file. If a stop character or stop string is provided, it will be caught earlier than this. Return the mainbuffer and close the mainHtmlParser subroutine. until(eof(htmlFile)); # Return the main buffer return $mainBuffer; } 24. Create the subroutine used to handle tags encountered by the mainHtmlParser subroutine. This subroutine handles the different cases in which the tag handler wants to have a stopping tag or wants to process all of the data from the initial tag to the end of the line. Call this subroutine handLeTag. sub handleTag { # Declare local variables for the argument, as well # as the other required locals. 25. handleTag requires a number of local variables. These include one to hold the argument, one for an associative array that will make access to the tag string easier, scalars for the handler's name, the end tag, and the text between the initial tag and the end tag. This subroutine uses the eval subroutine to call the tag's handler. You need a local scalar to store the string that you will send to eval. local($tagString) = @_; local(%tagDict,$endTag,$handler,$argString); local($evalString); 26. Use the dictForTag subroutine, created in later steps, to parse the tag string into an associative array. This will take everything between the < and > and return an array with keys like TAG, NAME, and VALUE. All the keys will be capitalized. # Create an associative array containing the data for the # tag string. %tagDict = &dictForTag($tagString); 27. See if an end tag was registered for the tag. Use the tag dictionary to find the name of the tag and the global associative array, %endTags, to find the end tag. End tags are registered by the programmer writing the handler for that tag. # Look for an end tag. These are registered in the %endTags # global associative array. $endTag = $endTags{$tagDict{"TAG"}}; 28. See if a handler has been registered for the tag. Again, a global associative array variable is used. In this case, it is called handlerDict. # Look for a handler subroutine for the tag. # These are registered in the %handlerDict global # associative array. $handler = $handlerDict{$tagDict{"TAG"}}; 29. If this tag doesn't have a registered handler, then treat it as plain text. Call the subroutine handLePlainText and return the result. You will write this subroutine in later steps. # If no handler is found, treat the tag as plain text, and # return the parsed data. if(!($handler)) { $tagString = &handlePlainText($tagString); return $tagString; } 30. Build the eval string. Based on the tag's registered end tag. you may need to read to the end of the line or read to the end tag. Evaluate the string and catch the resulting parsed HTML. # If the tag wants the data to the end of the line # use mainHtmlParser to read to the end of the line, then # call the tag's handler subroutine with the data to the # end of the line. if($endTag eq "eol") # Tag that needs data to eol { $argString = &mainHtmlParser("","\n"); $evalString = "&".$handler.'($tagString,$argString,0,%tagDict);'; } elsif($endTag) # Tag with an end tag { # Use mainHtmlParser to read any text, up to # the end tag. Remove the end tag from the sting. $argString = &mainHtmlParser($endTag,0); $argString =~ s/<.*>$//; # Remove the end tag # Call the tag's handler $evalString = "&".$handler.'($tagString,$argString,$endTag,%tagDict);'; } else # General unary tag { #For unary tags, simply call the handler. $evalString = "&".$handler.'($tagString,0,0,%tagDict);'; } $tagString = eval($evalString); 31. Return the result from the tag handler. aose the subroutine definition for handleTag. # Return the parsed text. return $tagString; } 32. Define the subroutine handlePlainText This is called whenever text is encountered outside a tag or when a tag without a handler is encountered. handlePlainText is like handleTag, except no end tags are used. A default handler is used for all plain text. sub handlePlainText { # Declare the locals local($plainString) = @_; local($handler,$evalString); # Look for a default handler for plain text $handler = $handlerDict{"DEFAULT"}; #If there is a handler, call it and catch the return value. if($handler) { $evalString = "&".$handler.'($plainString,0,0,0);'; $plainString = eval($evalString); } # Return either the text passed in, or the parsed text if there # was a default handler. return $plainString; } 33. Start the subroutme dictFormTag. This subroutine takes a tag string as an argument. A tag string is all the text between and including a < and a > character. dictForTag breaks the string into a tag, key-value pairs, and unary attributes. These are inserted into an associative array that is then returned. The tag is inserted into the array as the value for the key TAG, and key-value pairs are added to the array as is, after the key is capitalized. Unary attributes are added to the array, capitalized as both the key and value. sub dictForTag { 34. Declare the locals for the argument and the working associative array. A scalar is also used to track the keys you create. # Declare locals local($tagString) = @_; local(%tagDict,$key); 35. Look for the tag. It should be at the front of the tag string and consist only of alphanumeric characters. You are using a regular expression to identify the tag. The parentheses indicate that the matching pattern should be stored in the $1 special variable. If the tag is found, remove it from the tag string to make further parsing easier. Capitalize the tag using tr and add it to the associative array %tagDict . If no tag is found, this is an error; return an empty tag dictionary. # Look for the tag # Remove it from the tag string # Capitalize the tag, and put it into the dict # with the key, TAG # If no tag is found, then this is not a tag string. if(($tagString =~ s/^<(\w*)[\s>]//) && $1) { ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case $tagDict{"TAG"} = $key; } elsif(($tagString =~ s/^<!--(\w*)[\s>]//) && $1) { ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case $tagDict{"TAG"} = $key; } else { return %tagDict; } 36. Look for key-value strings. Again, a regular expression is used to find the strings. In this case, you are looking for a single word followed by zero or more spaces, then an = character. After the =, look for zero or more spaces and any pattern inside quotes. This does require that all key-value attributes have their value in quotes. Once a pattern is found, the parentheses in the regular expression cause the key to be placed in the scalar $1 and the value in the scalar $2. Capitalize the key and add it and the value to the associative array. # Find all of the tag's key/value attributes # Remove them from the tag string. while($tagString =~ s/(\w*)\s*=\s*\"([^\"]*)\"//) { if($1) { ($key = $1) =~ tr/a-z/A-Z/; # Make upper case if($2) { $tagDict{$key} = $2; # Add the key to the dict } else { $tagDict{$key} = ""; } } } 37. Look for single attributes. Use a regular expression with parentheses. When an attribute is found, remove it from the string, capitalize it, and add it to tagDict as a value with itself as the key. # Find the single attributes # and remove them from the string. while($tagString =~ s/\s+(\w*)[\s>]*//) { if($1) { ($key = $1) =~ tr/a-z/A-Z/; # Make upper case $tagDict{$key} = $key; # Add to the dict } } 38. Return the tag dictionary and close the definition of dictForTag. return %tagDict; } 39. The last subroutine in the parsing toolkit is not really used in parsing. stringForTagDict takes the dictionary for a tag, like the one created by dictForTag, and returns a string. This string will have a < followed by the tag, key-value attributes, unary attributes, and the closing >. The implementation of stringForTagDict uses foreach to find the keys in the dictionary, then creates the return string with concatenation. This routine will be useful when you are writing tag-handling routines. It is not used by the other library routines. sub stringForTagDict { # Declare locals local(%tagDict) = @_; local($tagString); # If there was a tag dictionary passed in if(%tagDict) { #If the tag dictionary has a TAG in it, build the tag string if($tagDict{"TAG"}) { # Start the string with a < and the tag $tagString .= "<"; $tagString .= $tagDict{"TAG"}; # Add the keys to the string foreach $key (keys %tagDict) { # Ignore TAG, we already added it if($key eq "TAG") { next; } elsif($key eq $tagDict{$key}) # unary attribute { $tagString .= " "; $tagString .= $key; } elsif($tagDict{$key}) #key/value attributes { $tagString .= " "; $tagString .= $key; $tagString .= "= \""; $tagString .= $tagDict{$key}; $tagString .= "\""; } } #Close the tag string $tagString .= ">"; } } #Return the tag string return $tagString; } 40. Return 1 at the end of parseHtml.pl. This will ensure that require will accept the file appropriately. 1; How It Works The HTML parsing code is made up of a set of subroutines that separate the task of parsing HTML into reasonably sized chunks. This code is intended to provide a library of useful subroutines. The library itself really has only two public subroutines: parseHtml and stringForTagDict. A developer's primary interaction with the library is by defining tag handler subroutines and registering them in the global associative array %handlerDict. If the handler wants to receive the data between a tag and it's end tag, or a tag and the end of the line it is on, then the programmer also registers the end tag in the global associative array %endTags. In the case of a tag wishing to receive data to the end of its line, the string eol should be placed in %endTags. Once a programmer registers all the tag handlers that he or she is interested in, the programmer calls the subroutine parseHtml with the name of the HTML file as an argument. parseHtmParser will open the HTML file, using the global file handle htmLFile. If the file opens successfully, then parseHtmL calls the subroutine mainHtmlParser to do the actual parsing. The subroutine mainHtmlParser serves two purposes: reading HTML and parsing HTML. Reading HTML consists of looking for the end-of-file, a stop character like the end of a line, or a tag that should act as a stopping string. When any of these is encountered, the subroutine returns the parsed text. Parsing the HTML is the process of looking for tags and plain text. When either of these is encountered, another subroutine is called to parse the actual text. The resulting parsed text is then added to abuffer, which is ultimately returned to the caller of the mainHtmlParser subroutine. The subroutines mainHtmlParser uses to parse text are handlePlainText and handleTag. Both of these use the eval function to call an appropriate handler, Tags that have a handler registered in the %handlerDict will have their handler called. All other tags and plain text will have the default handler called. This handler is either a subroutine in %handlerDict with the name DEFAULT, or nothing. In the case where no default handler exists, the text is returned as is. The final two subroutines in the library are dictForTag and stringForTagDict. These subroutines translate a tag into an associative array and back. This translation makes it easier to write handlers, which can rely on the associative array to provide the tag's name, value, and other attributes. stringForTagDict allows the developer to change values in the associative array, then turn it back into a string before returning it from a handler function. This is much easier than parsing the tag inside each handler routine. Problem The HTML that I am dynamically displaying has a form on it. I would like to use the HTML parsing library from How-To 10.1 to set the action and request methods for the form. I know that I need to write a handler subroutine for this to work. Technique The parsing library from How-To 10.1 provides generic HTML parsing; you will rely on it for the majority of your work. The library allows programmers to define handler subroutines for any HTML tag. To manage a form's method and action, you will write a handler for the FORM tag. A handler subroutine is passed the string that represents the HTML tag, as well as a dictionary of information about the tag. The handler uses this information to return a parsed version of the tag that will ultimately be sent to a browser. To facilitate testing, this How-To describes how to build a form handler in the context of a test script. An HTML file is provided to test the script. steps 1. Create a work directory. You will be using several Perl files, so it is easier to work on the program if these files are all together. 2. Copy the file parseHtm.pL created in How-To 10.1 into the working directory. You can find this file on the CD-ROM. 3. Create an HTML file to test the form handler. This test page can be fairly simple. For example, you might use the page in Figure 10-2 that displays a message and a Submit button. The HTML for this page is <HTML> <HEAD> <TITLE~CGI How-to, Form Handler Test Page</TITLE> </HEAD~ <BODY> <H4><FORM METHOD="POST" ACTION="http:///cgi-bin/form pl"> This is a POST form with the action: form pl Pressing select will return a page containing a form that has no initial method or action, but will have the method set to POST and the ACTION to form pl <p> <INPUT TYPE="SUBMIT" NAME="SUBMI " VALUE="Run Form Through Script"> </FORM></H4> </BODY> </HTML> The idea of this test page is to provide a Submit button that will initiate the test script. The script will display another file after setting its form's action and method. Call the file for the page in Figure 10.2 form_pl.htm if you would like it to be compatible with the provided HTML for the follow-up page is <HTML> <HEAD> <TITLE>CGI How-to, Form Handler Result Page</TITLE> </HEAD> <BODY> <H4><FORM METHOD="" ACTION=""> This is a form with no initial method and action. Press submit to test that a method and action was provided by the displaying script. <p> ~INPUT TYPE="SUBMIT" NAME="SUBMIT" VALUE="Run Form Through Script"~ </FORM></H4> </BODY> </HTML> Call the file for the follow-up page f2pl.htm if you would like it to be compatible with the provided test script code. 4. Create a Perl file called form.pl. This file is also on the CD-ROM. It will include the handler for the FORM tag and act as a CGI script. 5. Start the file form.pl with the appropriate comment for describing this as a Perl script. Make sure that the path used is correct for your machine. 6. Require the file containing the HTML parsing library. This file is called parsehtm.pl. require "parsehtm.p~"; 7. Start the form input handler subroutine. CalI it formHandler. 8. Declare local variables to hold the subroutine's arguments. All handler subroutines or the parsing library take four arguments. These are the tag string (everything between the < and >); a possible argument string, unused in this case: an end string, also unused; and a dictionary of information about the tag. This tag dictionary will be the primary source of information about the tag. Local($tagString,SargString,SendString,%tagDict) = @_; 9. Declare a local to hold the string this handler will return. This string will be inserted into the HTML file, in place of the original tag string, before the file is sent to the client's browser. local($retVal); 10. Change the HTML actually sent to the client. Alter the values in the tag dictionary, then convert the dictionary to an appropriate string. Because you are handling a FORM tag, set the dictionary's values for the keys METHOD and ACTION. All the keys are capitalized by the subroutine that created the dictionary. StagDict{"METHOD"} = "POST"; StagDict{"ACTION"} = "form.pl"; 11. Use the library routine stringForTagDict to turn the updated tag dictionary into a tag string. # Get the string for the ne~ dictionary SretVal .= &stringForTagDict(%tagDict); 12. Return the new tag string and close the subroutine. return SretVal ShandlerDict{"FORM"~ = "formHandler": 13. Begin the rest of the test script by adding the formHandler to the global associative array %handlerDict. Handlers are registered by name, with the key equal to the tag that they handle. 14. Use the library routine parsseHtml to parse the file f2_pl.htm, created in an earlier step. The return value of this routine is the newly parsed HTML. $output = &parseHtml("f2 pl.htm"); 15. Print the content type for this script's reply to standard out. print "Content-type: text/html\n\n"; 16. Print the parsed HTML to standard out. This will send it to the browser. print $output; 17. Set the permissions on form.pl to allow execution. Be sure to install the parsehtm.pl file as well as form.pl. Remember that the script also needs access to the raw HTML file to parse and return it. Therefore, you also need to put formpl.htm and f2_pl.htm where form.pl can find them. Open the test HTML file, form_pl.htm Press the Submit button. View the HTML for the follow-up page. How It Works Dealing with dynamically parsed HTML can be a complex problem. You should rely on the library created in How-To 10.1 to handle the majority of the parsing. Using the librarv, YOU have to write handler subroutines only for the tags you want to handle. In this case you are handling FORM tags. Handling a tag involves setting the appropriate values in a dictionary and translating the dictionary into a string. Actually creating the dictionary is the library's job as is turning the dictionary back into a string. Comments When completed, your form handler should look like this: sub formHandler { local($tagString,$argString,$endString,%tagDict) = @_; local($retVal); $tagDict{"METHOD"} = "POST"; $tagDict{"ACTION"} = "form.pl"; $retVal = &stringForTagDict(%tagDict); return $retVal } :::::::::::in it's entirety::::::: #!/usr/bin/perl # This package uses the global file handle htmlFile # There are two global assoc. arrays, endTags & handlerDict # parseHtml takes one argument, a filename # and returns the parsed html in a string sub parseHtml { # Declare variables to hold the arguments local($fileName) = @_; # Declare a variable to store the return value local($retVal); # Open the file open(htmlFile,$fileName); # If the file opened, call the parser on it $retVal = &mainHtmlParser("",0) if htmlFile; # Close the file close(htmlFile); # Return the string parsed from the file return $retVal; } # mainHtmlParser takes several arguments # This subroutine can either take a stop string, or a stop char # it reads the file htmlFile until either the end of file # the stopstring or the stop char is encountered. # # mainHtmlParser returns a string filtered from the file. # The filters are tag handlers and a default handler. # Handlers should take 5 arguments for: # # tagString - The string containing the tag # argString - Any data between the tag and end tag # endString - The end tag # tagDict - The dictionary created using dictForTag # userData - The user data argument # # Handlers are registered in the global dictionary # handlerDict. # # If the tag has a matching end tag like <HTML> and </HTML> # then the tag should be registered in the global # %endTags array, with the value equal to its end tag. # # If the tag needs the data up to the end of the line, like # OPTION, then if should appear in %endTags with the value # "eol". # # Handlers should return the string to replace the tag with. # # The default is used for text that wasn't part of a tag. # Tags are denoted by <text>. # As plain text is encountered the handler registered under # the string "DEFAULT" is called. sub mainHtmlParser { # Declare locals to store the arguments local($stopStr,$stopChar) = @_; # Declare several local variables local($char,$inTag,$tmpBuffer,$mainBuffer); # Initialize the main buffer, this is what is returned $mainBuffer = ""; # $inTag is used to denote when we are inside <>'s $inTag = 0; # Loop until the end of the file, or # we encounter the stop string or stop character. do { # Get the next character from the file. # This is not the most effecient method of reading a file # But makes our code cleaner $char = getc(htmlFile); # Check if we are at the start of a tag if($char eq "<") { # Dont allow any tags inside other tags if($inTag) { die "This is an invalid html file.\n"; } else { # Denote that we are in a tag $inTag = 1; # If we were reading plain text if($tmpBuffer) { # Handle the plain text $mainBuffer .= &handlePlainText($tmpBuffer); # Reset the tmp buffer $tmpBuffer = ""; } # Start the new tmp buffer $tmpBuffer = "<"; } } elsif($char eq ">") # Check if we are at the end of a tag { # Dont allow end tags without start tags if(! $inTag) { die "This is an invalid html file.\n"; } else { # Denote the end of the tag $inTag = 0; # Finish the tmp buffer $tmpBuffer .= ">"; # See if we are at the stop string if($stopStr && ($tmpBuffer =~ /$stopStr/i)) { return $mainBuffer;#we have read to the stop string } else { # If not handle the tag, and keep reading $tmpBuffer = &handleTag($tmpBuffer); # Add the tmp buffer to the main buffer $mainBuffer .= $tmpBuffer; # Reset the tmp buffer $tmpBuffer = ""; } } } elsif(eof(htmlFile) || ($stopChar && ($char eq $stopChar))) # check for stopchar { # Dont allow the parsing to end inside a tag if($inTag) { die "This is an invalid html file.\n"; } else { # Add the character to the tmp buffer $tmpBuffer .= $char if (!eof(htmlFile)); # Add the tmp buffer to the main buffer, # after handling it. $mainBuffer .= &handlePlainText($tmpBuffer); # Reset the tmp buffer $tmpBuffer = ""; } # We are at the end of the file, or found # the stop string, so return the main buffer return $mainBuffer; } else # If nothing else add the character to the tmp buffer { $tmpBuffer .= $char; } } until(eof(htmlFile)); # Return the main buffer return $mainBuffer; } # # handleTag actualy handles the tags for mainHtml parser sub handleTag { # Declare local variables for the argument, as well # as the other required locals. local($tagString) = @_; local(%tagDict,$endTag,$handler,$argString); local($evalString); # Create an associative array containing the data for the # tag string. %tagDict = &dictForTag($tagString); # Look for an end tag. These are registered in the %endTags # global associative array. $endTag = $endTags{$tagDict{"TAG"}}; # Look for a handler subroutine for the tag. # These are registered in the %handlerDict global # associative array. $handler = $handlerDict{$tagDict{"TAG"}}; # If no handler is found, treat the tag as plain text, and # return the parsed data. if(!($handler)) { $tagString = &handlePlainText($tagString); return $tagString; } # If the tag wants the data to the end of the line # use mainHtmlParser to read to the end of the line, then # call the tag's handler subroutine with the data to the # end of the line. if($endTag eq "eol") # Tag that needs data to eol { $argString = &mainHtmlParser("","\n"); $evalString = "&".$handler.'($tagString,$argString,0,%tagDict);'; } elsif($endTag) # Tag with an end tag { # Use mainHtmlParser to read any text, up to # the end tag. Remove the end tag from the sting. $argString = &mainHtmlParser($endTag,0); $argString =~ s/<.*>$//; # Remove the end tag # Call the tag's handler $evalString = "&".$handler.'($tagString,$argString,$endTag,%tagDict);'; } else # General unary tag { #For unary tags, simply call the handler. $evalString = "&".$handler.'($tagString,0,0,%tagDict);'; } $tagString = eval($evalString); # Return the parsed text. return $tagString; } # handlePlainText actually handles plain text for htmlMainParser sub handlePlainText { # Declare the locals local($plainString) = @_; local($handler,$evalString); # Look for a default handler for plain text $handler = $handlerDict{"DEFAULT"}; #If there is a handler, call it and catch the return value. if($handler) { $evalString = "&".$handler.'($plainString,0,0,0);'; $plainString = eval($evalString); } # Return either the text passed in, or the parsed text if there # was a default handler. return $plainString; } # Creates an associative array for a tag string sub dictForTag { # Declare locals local($tagString) = @_; local(%tagDict,$key); # Look for the tag # Remove it from the tag string # Capitalize the tag, and put it into the dict # with the key, TAG # If no tag is found, then this is not a tag string. if(($tagString =~ s/^<(\w*)[\s>]//) && $1) { ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case $tagDict{"TAG"} = $key; } elsif(($tagString =~ s/^<!--(\w*)[\s>]//) && $1) { ($key = $1) =~ tr/a-z/A-Z/; # Make the tag upper case $tagDict{"TAG"} = $key; } else { return %tagDict; } # Find all of the tag's key/value attrubutes # Remove them from the tag string. while($tagString =~ s/(\w*)\s*=\s*\"([^\"]*)\"//) { if($1) { ($key = $1) =~ tr/a-z/A-Z/; # Make upper case if($2) { $tagDict{$key} = $2; # Add the key to the dict } else { $tagDict{$key} = ""; } } } # Find the single attributes # and remove them from the string. while($tagString =~ s/\s+(\w*)[\s>]*//) { if($1) { ($key = $1) =~ tr/a-z/A-Z/; # Make upper case $tagDict{$key} = $key; # Add to the dict } } return %tagDict; } # Creates a string from a tag dictionary sub stringForTagDict { # Declare locals local(%tagDict) = @_; local($tagString); # If there was a tag dictionary passed in if(%tagDict) { #If the tag dictionary has a TAG in it, build the tag string if($tagDict{"TAG"}) { # Start the string with a < and the tag $tagString .= "<"; $tagString .= $tagDict{"TAG"}; # Add the keys to the string foreach $key (keys %tagDict) { # Ignore TAG, we already added it if($key eq "TAG") { next; } elsif($key eq $tagDict{$key}) # unary attribute { $tagString .= " "; $tagString .= $key; } elsif($tagDict{$key}) #key/value attributes { $tagString .= " "; $tagString .= $key; $tagString .= "= \""; $tagString .= $tagDict{$key}; $tagString .= "\""; } } #Close the tag string $tagString .= ">"; } } #Return the tag string return $tagString; } 1; ===== Want to unsubscribe from this list? ===== Send mail with body "unsubscribe" to macperl-request@macperl.org