Re: [Corpora-List] Query on the use of Google for corpus research

From: Tom Emerson (tree@basistech.com)
Date: Mon May 30 2005 - 21:43:08 MET DST

Next message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"

Previous message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
In reply to: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Bryar Family: "RE: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Chris Jordan: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dominic Widdows writes:
> Is there good reliable software out there, for those who would still be
> fearful of hacking up a harvester for themselves?
> There is the Internet Archive's Heritrix crawler
> (http://crawler.archive.org/). Has anyone used this and found it
> suitable for linguistic purposes?

Yes, I use it for large scale crawls for linguistic research, and will
be presenting some of my work at the "Web as Corpus" workshop being
held with Corpus Linguistics 2005. Heritrix is an outstanding piece of
software.

> This still leaves some of the traditional benefits of corpora
> unaccounted for - what about normalising the text content (presuming
> the traditional notion that text content is the linguistics phenomenon
> you're interested in), tagging, perhaps getting all the data into the
> same character set, etc.? These are some of the creature comforts that
> organizations such as the LDC have traditionally provided. We can
[...]

And these are the dirty little details that most researchers just wave
off with a swish of their hand. When it comes down to it, crawling
data is only a small part of the problem.

-tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Next message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"
Previous message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"
In reply to: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Bryar Family: "RE: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Chris Jordan: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon May 30 2005 - 21:45:06 MET DST