[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [MacPerl] Extracting elements in a HTML page.




4/21/99, Dave Johnson wrote:
 >If there was an example that showed how to extract a single paragraph from
 >a webpage I think I could figure out the rest. But my searches for examples
 >have come up dry.

Dave,

OO is definitely a different mindset.  Here's how HTML::Parser works.  You
write a "subclass" module as in the skeleton example below.   You save
state across method calls by using the $self->{} hash, as shown.  $X, $Y,
etc., stand for your variables...

    #------------------------------------------
    #!/usr/bin/perl -w
    use diagnostics ;
    use strict ;

    package HTMLYourSubclass ;
        require HTML::Parser ;
        use vars qw(@ISA) ;
        @ISA = qw(HTML::Parser) ;

    sub new    # Begin parsing a document file.
    {
        my ( $class, $Arg1, $Arg2 ) = @_ ;
        my $self = HTML::Parser->new ;
        bless $self, $class ;

        ## Begin your code.
        my $X, $Y ;   etc. etc.
        ## End your code

        ## Save your persistent variables.
        @$self{qw( X Y Arg1 Arg2 )} = ( $X, $Y, $Arg1, $Arg2 ) ;

        $self ;
    }

    sub start   # Process a start tag.
    {
        my ( $self, $Tag, $Attr, $AttrSeq, $TagText ) = @_ ;

        ## Get whatever persistent variables you need.
        my ( $X, $Y, $Arg2 ) =@$self{qw( X Y Arg2 )} ;

        # Your code goes here.
        my $Z ;  # etc. etc.

        ## Save whatever you've changed.
        @$self{qw( Y Z )}  = ( $Y, $Z ) ;
    }

    sub end     # Process an end tag.
    {
        my( $self, $Tag ) = @_ ;

        ## Same again re/ variables and code.
    }

    sub text    # Text found: append to ParaChunk.
    {
        my ( $self, $Text ) = @_ ;

       ## Same again re/ variables and code.
    }

    1 ;

    #------------------------------------------


>From your main script you do this:

    #------------------------------------------
    ...
    my $DocObj = HTMLYourSubclass->new ( $Arg1, $Arg2 ) ;
    $DocObj->parse_file( $DocFile ) ;
    ...
    #------------------------------------------

Your subclass methods then get called as each element is encountered.
Instead of passing a file, you can send text dynamically to the parser this
way:

    #------------------------------------------
    ...
    my $DocObj = HTMLYourSubclass->new ( $Arg1, $Arg2 ) ;
    ...
    $DocObj->parse( $chunk ) ;
    ...
    #------------------------------------------

You return information back to your main script through your $Args, which
can be references to arrays, hashes, etc.

Good luck!

rkm
Wexford, Ireland
http://cyberjournal.org




===== Want to unsubscribe from this list?
===== Send mail with body "unsubscribe" to macperl-request@macperl.org