Re: [Corpora-List] Query on the use of Google for corpus research

From: Tom Emerson (tree@basistech.com)
Date: Mon May 30 2005 - 21:54:28 MET DST

  • Next message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"

    Mark P. Line writes:
    > There's a protocol for robotic web crawlers that you should honor, whereby
    > websites can specify how they wish such crawlers to behave when their site
    > is encountered during a crawl. Other than that, I wouldn't worry too much
    > about traffic caused by your harvesting. Kids build web mining
    > applications in Java 101 these days. Heck, they're probably doing it in
    > high school. *shrug*

    This is, with all due respect, a very naive thing to say. If every
    research group decided to unleash impolite crawlers on the world's
    websites I can guarantee that you will get a lot of hostile email very
    quickly from the web masters. Writing a useful crawler is a lot more
    difficult than you let on, especially if you plan on crawling a
    non-trivial number of sites. As far as traffic goes, one can easily
    saturate a T.3 line, bringing your local IT department down on you.

    > My take is that indexing can usefully be as (linguistically or otherwise)
    > sophisticated as anybody cares and has the money to make it (once you've
    > actually captured the text), whereas harvesting tends to gain little from
    > anything but the most rudimentary filtering.

    This is also rather naive. Let's say you start a crawl with 2300 seed
    URLs. How deep into a site do you go? How do you deal with spider
    traps? Do you follow links outside of the seed's site? How do you
    prevent yourself from crawling the same content more than once? Or
    what if you want to recrawl certain sites with some regularity? What
    about sites that require login or cookies? How do you schedule the
    URLs to be crawled? How do you store the millions of documents that
    you download?

    In any event, I expect that the people behind Heritrix or UbiCrawler
    or any of the other scalable, high-performance crawlers will disagree
    with your glib dismissal of their area of expertise.

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 21:57:39 MET DST