Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Mon May 30 2005 - 16:29:26 MET DST

Next message: Mark P. Line: "RE: [Corpora-List] Query on the use of Google for corpus research"

Previous message: Ute Römer: "[Corpora-List] Looking for a MICASE (key)wordlist..."
In reply to: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Chris Jordan: "Re: [Corpora-List] Query on the use of Google for corpus research"
Reply: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dominic Widdows said:
>
> The main problem is that "using the Web" on a large scale puts you at
> the mercy of the commercial search engines, which leads to the grim
> mess that Jean documents, especially with Google.

Actually, I don't think it's really true anymore that large-scale corpus
extraction from the Web necessarily puts you at the mercy of commercial
search engines. It's no longer very difficult to throw together a software
agent that will crawl the Web directly. (IOW: The indexing part of
commercial search engines may be rocket science, but the harvesting part
of them is not.)

> This situation may hopefully change as WebCorp
> (http://www.webcorp.org.uk/) teams up with
> a dedicated search engine. In the meantime, it's clearly true that you
> can get more results from the web, but you can't vouch for them
> properly, and so a community that values both recall and precision is
> left reeling.

I think that if you describe your harvesting procedure accurately (what
you seeded it with, and what filters you used if any), and monitor and
report on a variety of statistical parameters as your corpus is growing,
there's no reason why the resulting data wouldn't serve as an adequate
sample for many purposes -- assuming that's what you meant by "vouch for
them properly".

-- Mark

Mark P. Line
Polymathix
San Antonio, TX

Next message: Mark P. Line: "RE: [Corpora-List] Query on the use of Google for corpus research"
Previous message: Ute Römer: "[Corpora-List] Looking for a MICASE (key)wordlist..."
In reply to: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Next in thread: Chris Jordan: "Re: [Corpora-List] Query on the use of Google for corpus research"
Reply: Dominic Widdows: "Re: [Corpora-List] Query on the use of Google for corpus research"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon May 30 2005 - 16:46:46 MET DST