Re: [Corpora-List] Query on the use of Google for corpus research

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Tue May 31 2005 - 10:26:21 MET DST

  • Next message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"

    > > How do you deal with spider traps?
    >
    > Why would spider traps be a concern (apart from knowing to give up on the
    > site if my IP address has been blocked by their spider trap) when all I'm
    > doing is constructing a sample of text data from the Web?

    First of all, your crawler has to understand that it fell into a trap.
    Second, some spider traps generate dynamic pages containing random text
    for you to follow -- now, that's a problem if you're trying to build a
    linguistic corpus, isn't it?

    Incidentally, a "spider trap" query on google returns many more results
    about crawlers, robots.txt files etc. than about how to capture
    eight-legged arachnids... one good example of how one should be careful
    when using the web as a way to gather knowledge about the world...

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 10:51:09 MET DST