Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Mon May 30 2005 - 16:29:26 MET DST

  • Next message: Mark P. Line: "RE: [Corpora-List] Query on the use of Google for corpus research"

    Dominic Widdows said:
    >
    > The main problem is that "using the Web" on a large scale puts you at
    > the mercy of the commercial search engines, which leads to the grim
    > mess that Jean documents, especially with Google.

    Actually, I don't think it's really true anymore that large-scale corpus
    extraction from the Web necessarily puts you at the mercy of commercial
    search engines. It's no longer very difficult to throw together a software
    agent that will crawl the Web directly. (IOW: The indexing part of
    commercial search engines may be rocket science, but the harvesting part
    of them is not.)

    > This situation may hopefully change as WebCorp
    > (http://www.webcorp.org.uk/) teams up with
    > a dedicated search engine. In the meantime, it's clearly true that you
    > can get more results from the web, but you can't vouch for them
    > properly, and so a community that values both recall and precision is
    > left reeling.

    I think that if you describe your harvesting procedure accurately (what
    you seeded it with, and what filters you used if any), and monitor and
    report on a variety of statistical parameters as your corpus is growing,
    there's no reason why the resulting data wouldn't serve as an adequate
    sample for many purposes -- assuming that's what you meant by "vouch for
    them properly".

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 16:46:46 MET DST