[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

[MacPerl] Web crawler in Perl?



I'm doing something similar and think Perl is a fine choice for the task.

But I'm writing everything from the ground up, so I can't suggest existing robots. I'm designing a different approach for every site that gets searched, resulting in better indexing. This takes care of problems such as the one you describe below (Denver Post's titles). The end result will be something of a cross between search directory and search engine. In the future I plan to write something which will check periodically to see if a site has changed it's format (ie. the Denver Post starts using different and more accurate titles), thereby alerting me so that I can make the necessary modifications to my code.

I'd love to trade ideas on the approach for building this type of search engine, but again can't give you advice on where to find existing robots.

This is a bit unrelated, but since you mentioned performance issues... does anyone have thoughts on MacOS X Server 1.2. I'm about to begin using it for a traffic heavy site and it looks very promising.

Where does MacPerl come into play now that OS X Server is here. It supposedly outperforms any standard Mac server... but runs Apache (which then of course let's you compile PHP with the server, etc.), leaving MacPerl out of the scene. 

Although, for me, the one benefit that remains is being able to test code locally from BBEdit.

- Jonathan Daniel


>
>Hello,
>
>For some time now, I've been working on a Frontier and FileMaker-based web
>crawler. My intention is to index a collection of Denver-related sites,
>including the two daily newspapers. I hope to provide a better service than
>the big search engines by indexing the sites more often (daily, in the case
>of the newspapers), and using a bit of intelligence to better account for
>idiosyncrasies of the various sites (the Denver Post, for instance, titles
>every page "Denver Post Online" -- not very helpful in a list of search
>results. My index will extract a more useful title from the text).
>
>Because of the volume of pages to index, performance is crucial. It's my
>hope that MacPerl's text-processing tools and built-in TCP functions will
>provide better performance than Frontier (we'll worry about the database end
>of things later).
>
>So, I'm looking for input on two fronts. First, do the Perl mavens on this
>list think my hopes for Perl are justified? And, second, lest I re-invent
>the wheel, can anyone point me to any Perl-based robots that I can use as a
>starting point? (I realize that Perl has a robot verb, but I'm looking for
>help with the total solution.)
>
>Many thanks.
>
>-Dan 


# ===== Want to unsubscribe from this list?
# ===== Send mail with body "unsubscribe" to macperl-request@macperl.org