Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Wed Jun 01 2005 - 00:00:53 MET DST

  • Next message: Alexander Schutz: "Re: [Corpora-List] Query on the use of Google for corpus research"

    Tom Emerson said:
    >
    > We're obviously talking about differences in many orders of
    > magnitude. When you say "some sample texts off the Web" I assume you
    > mean a few hundred at most.

    I mean as many as it takes to construct a sample to support the study. A
    single sample might be 1 million words or 10 million words.

    Obviously, there is a break-even point where it starts making more sense
    to use high-performance tools and less sense to roll your own.

    My points have been that

    - the break-even point is significantly greater than zero and probably on
    the order of magnitude of 10 million words,
    - most academic researchers answer most of their questions on corpora that
    are significantly smaller than that,
    - such a corpus does not need to be web-exhaustive or even domain-exhaustive,
    - source diversity is a parameter that depends on your research questions,
    - the researcher can carry out any number of sampling iterations until the
    sample has the right characteristics to support the research agenda,
    - there's no reason not to expect the research team to do any amount of
    eyeballing at any stage in the sampling process, and
    - all of this can be done easily and safely with relatively simple and
    relatively simple-to-construct tools (pretty much with a naked Java
    development kit and a database server).

    I'd like to cite this little article for the second time in this thread:

       http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

    Are we to assume that Sun has done something utterly unspeakable by
    suggesting that a Java developer might have reason to sit down and build
    her own web crawler, and here's how?

    > Researchers use (or are trying to use) Google to
    > quantify linguistic phenomena because it (and the other commercial
    > search engines) has a large body of natural language text to work
    > with.

    Yes, that's where I came into this thread, because somebody expressed
    their concern about the construction of web corpora being dependent on
    search engines -- to which I replied that it's possible to use a crawler
    to harvest texts from the web without using a search engine at all, and
    that it's not very difficult to build your own crawler to do just what you
    need.

    I continue to advise against the use of Google hitcounts to quantify
    linguistic phenomena in anything but a grossly informal and exploratory
    way. (Is "modeling" more frequent than "modelling"? Does the same hold for
    "traveling" and "travelling"?)

    > If you grab content from a few dozen sites then your sample size is
    > simply too small to make any meaningful statement about the behavior
    > you are studying.

    What behavior am I studying, and how big is the sample I acquired from the
    few dozen sites, in number of words?

    >> > Because you may be building a synchronic corpus.
    >>
    >> I guess I'm going to have to get you to connect the dots for me. How
    >> does
    >> revisiting sites with some regularity help me to build a synchronic
    >> corpus
    >> in a way that I cannot build it if I never revisit any site again?
    >>
    >> Or did you mean a _diachronic_ corpus, in the belief that processes of
    >> language change can usefully be detected by means of periodic scans of
    >> websites?
    >
    > Right, I mistyped.

    Okay. I doubt that very much could be said about language change by
    revisiting websites to track text revisions in them, but if somebody
    wanted to try, I don't see that it would be much of a problem for a
    home-grown crawler.

    >> My point has been that I will not generally *need* more URL's than I can
    >> crawl at any one time. I'm not updating the Google index. I'm not
    >> acquiring named entities for an exhaustive lexical database or ontology.
    >> I'm just collecting enough text to answer certain research questions
    >> about my target language.
    >
    > What is enough text?

    What's my research question?

    >> Why in the world would I store corpus text as millions of small files,
    >> even if I were operating at such a large scale (which, again, again, is
    >> not the typical case I've been advising for here)?
    >
    > Well, a naive crawler will do just that.

    So, you're saying that nobody who builds their own crawler is going to
    have a clue about any more sophisticated means of data management than
    dropping millions of small files into the file system.

    Why do you say that?

    > Heck, just grab 'wget' and
    > let it go. You'll mirror the whole site on your disk. Simple.

    You already accepted in an earlier post that corpus linguists do *not*
    typically need scalable, high-performance crawlers to capture web corpora
    safely. So what's the need for hyperbole here?

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 00:07:25 MET DST