Re: [Corpora-List] Query on the use of Google for corpus research

From: Tom Emerson (tree@basistech.com)
Date: Mon May 30 2005 - 21:43:08 MET DST

  • Next message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"

    Dominic Widdows writes:
    > Is there good reliable software out there, for those who would still be
    > fearful of hacking up a harvester for themselves?
    > There is the Internet Archive's Heritrix crawler
    > (http://crawler.archive.org/). Has anyone used this and found it
    > suitable for linguistic purposes?

    Yes, I use it for large scale crawls for linguistic research, and will
    be presenting some of my work at the "Web as Corpus" workshop being
    held with Corpus Linguistics 2005. Heritrix is an outstanding piece of
    software.

    > This still leaves some of the traditional benefits of corpora
    > unaccounted for - what about normalising the text content (presuming
    > the traditional notion that text content is the linguistics phenomenon
    > you're interested in), tagging, perhaps getting all the data into the
    > same character set, etc.? These are some of the creature comforts that
    > organizations such as the LDC have traditionally provided. We can
    [...]

    And these are the dirty little details that most researchers just wave
    off with a swish of their hand. When it comes down to it, crawling
    data is only a small part of the problem.

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 21:45:06 MET DST