Re: [Corpora-List] Query on the use of Google for corpus research

From: Tom Emerson (tree@basistech.com)
Date: Mon May 30 2005 - 21:54:28 MET DST

Next message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"

Previous message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"
In reply to: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"
Reply: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mark P. Line writes:
> There's a protocol for robotic web crawlers that you should honor, whereby
> websites can specify how they wish such crawlers to behave when their site
> is encountered during a crawl. Other than that, I wouldn't worry too much
> about traffic caused by your harvesting. Kids build web mining
> applications in Java 101 these days. Heck, they're probably doing it in
> high school. *shrug*

This is, with all due respect, a very naive thing to say. If every
research group decided to unleash impolite crawlers on the world's
websites I can guarantee that you will get a lot of hostile email very
quickly from the web masters. Writing a useful crawler is a lot more
difficult than you let on, especially if you plan on crawling a
non-trivial number of sites. As far as traffic goes, one can easily
saturate a T.3 line, bringing your local IT department down on you.

> My take is that indexing can usefully be as (linguistically or otherwise)
> sophisticated as anybody cares and has the money to make it (once you've
> actually captured the text), whereas harvesting tends to gain little from
> anything but the most rudimentary filtering.

This is also rather naive. Let's say you start a crawl with 2300 seed
URLs. How deep into a site do you go? How do you deal with spider
traps? Do you follow links outside of the seed's site? How do you
prevent yourself from crawling the same content more than once? Or
what if you want to recrawl certain sites with some regularity? What
about sites that require login or cookies? How do you schedule the
URLs to be crawled? How do you store the millions of documents that
you download?

In any event, I expect that the people behind Heritrix or UbiCrawler
or any of the other scalable, high-performance crawlers will disagree
with your glib dismissal of their area of expertise.

-tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Next message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Previous message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"
In reply to: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"
Reply: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon May 30 2005 - 21:57:39 MET DST