Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Mon May 30 2005 - 21:22:59 MET DST

  • Next message: Tom Emerson: "Re: [Corpora-List] Query on the use of Google for corpus research"

    Dominic Widdows said:
    > Mark P. Line said:
    >
    >> Actually, I don't think it's really true anymore that large-scale
    >> corpus
    >> extraction from the Web necessarily puts you at the mercy of commercial
    >> search engines. It's no longer very difficult to throw together a
    >> software
    >> agent that will crawl the Web directly.
    >
    > But is it not quite difficult to "throw something together" that
    > doesn't cause all sorts of traffic problems? I have always shied away
    > from actually trying this, under the impression that it's a bit of a
    > dangerous art, but then this is certainly partly due to ignorance.

    There's a protocol for robotic web crawlers that you should honor, whereby
    websites can specify how they wish such crawlers to behave when their site
    is encountered during a crawl. Other than that, I wouldn't worry too much
    about traffic caused by your harvesting. Kids build web mining
    applications in Java 101 these days. Heck, they're probably doing it in
    high school. *shrug*

    >> (IOW: The indexing part of commercial search engines may be rocket
    >> science, but the harvesting part of them is not.)
    >
    > That's intriguing, as someone who's worked more in indexing, I'd have
    > said precisely the opposite :-)
    > Delighted if I'm wrong.

    My take is that indexing can usefully be as (linguistically or otherwise)
    sophisticated as anybody cares and has the money to make it (once you've
    actually captured the text), whereas harvesting tends to gain little from
    anything but the most rudimentary filtering.

    > Is there good reliable software out there, for those who would still be
    > fearful of hacking up a harvester for themselves?

    There are lots of web robots out there. Here's a good starting point:

        http://www.robotstxt.org/wc/robots.html

    If you do decide you'd like to roll your own, here's a starting point for
    that:

        http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

    >> I think that if you describe your harvesting procedure accurately (what
    >> you seeded it with, and what filters you used if any), and monitor and
    >> report on a variety of statistical parameters as your corpus is
    >> growing,
    >> there's no reason why the resulting data wouldn't serve as an adequate
    >> sample for many purposes -- assuming that's what you meant by "vouch
    >> for
    >> them properly".
    >
    > Yes, that is part of what I meant. Do we have a good sense of what
    > these statistical parameters should be?

    As in all cases of statistical sampling, it depends on the inferences you
    hope to be able to justify about the underlying population. My usual
    advice is that the research be designed in the following order:

    (1) assumptions about the population you wish to characterize;

    (2) kinds of characterizations you'd like to be able to make and justify
    about the population;

    (3) statistical techniques that will provide you with those kinds of
    characterizations of a population (given a sample, usually);

    (4) sampling requirements of those techniques;

    (5) sampling procedures that meet those requirements;

    (6) a dataset that was collected by those procedures;

    (7) statistical characterization of the sample as required by your
    inferential techniques;

    (8) inferential results about the population, based on your
    characterization of the sample.

    You'll need a very different kind of sample if you want to say something
    about the passivization of closed-class verbs in English than if you want
    to say something about the diffusion of neologisms in English
    biotechnology jargon.

    > To what extent is there a code of practice for saying exactly what you
    > did?

    I think that the code of practice should be that of statistics. It's a
    well-established practice in most of the other sciences, after all. :)

    > Again, we run into
    > standard empiricist questions - using your proposal, one could
    > guarantee to reproduce the "initial conditions" of someone's
    > experiment, but you could at best expect similar outcomes.

    Yes. That's very similar to the situation with empirical research in, say,
    wetland ecology. Science can progress usefully in either field, even
    though nobody would ever expect literally identical outcomes when a study
    is replicated.

    > This still leaves some of the traditional benefits of corpora
    > unaccounted for - what about normalising the text content (presuming
    > the traditional notion that text content is the linguistics phenomenon
    > you're interested in), tagging, perhaps getting all the data into the
    > same character set, etc.?

    I don't see how any of that is prevented by harvesting your own set of raw
    texts from the Web.

    > However, there is still the problem that the more sophisticated stuff
    > you throw at your data, the harder it is for anyone to replicate or
    > extend your results, and ideally, I would like to see a system where
    > the data itself is made available as a standard part of practice.
    > Ideally, we would still work on the same datasets if possible, rather
    > than duplicating similar datasets for each isoolated project.

    That might be laudable, were it not for the fact that different kinds of
    questions require different kinds of samples. I think the approach of
    providing ever more megalomaniacal Global Universal General-Purpose
    Standard corpora has taken us about as far as it's going to. :)

    > From an engineering point of view, storage isn't really a problem here,
    > but bandwidth is - you have to keep the files you've trawled and
    > processed on disk somewhere, but you might not be able to foot the bill
    > for other researchers hitting your web server every time they fancy
    > half-a-billion words of nice corpus data.

    Right. That's why many people (especially non-linguists) use statistics,
    and express their findings in statistical (as opposed to fictitiously
    absolutist) terms. :)

    Replication of statistical results REQUIRES the use of a different sample,
    to show that the inferences about the population were not an artefact of
    the sampling procedures or of the particular sample obtained for the
    original study.

    So, the goal would be to express your findings in such a way that they can
    be replicated (or not!) statistically by anybody who cares to crank your
    methods on a fresh sample from the same population.

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 21:42:47 MET DST