Re: [Corpora-List] Query on the use of Google for corpus research

From: Tom Emerson (tree@basistech.com)
Date: Tue May 31 2005 - 22:54:01 MET DST

  • Next message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"

    Mark P. Line writes:
    [...]
    > But none of this is new, and none of it is going to be much of a problem
    > for a researcher who merely wants to capture some sample texts off the
    > Web.

    We're obviously talking about differences in many orders of
    magnitude. When you say "some sample texts off the Web" I assume you
    mean a few hundred at most.

    [...]
    > And you believe that's typical for linguists wishing to capture a research
    > corpus from the Web?

    Yes, I hope so. Researchers use (or are trying to use) Google to
    quantify linguistic phenomena because it (and the other commercial
    search engines) has a large body of natural language text to work
    with.

    If you grab content from a few dozen sites then your sample size is
    simply too small to make any meaningful statement about the behavior
    you are studying. That is one reason the crawls that I do are so
    large.

    > Again, do you believe that's typical for linguists wishing to capture a
    > research corpus from the Web?

    Yes. The BNC is 100 million words. The LOB is 1 million words. The
    Brown Corpus is 1 million words. The LDC has Chinese, English, and
    Arabic gigaword corpora. The UN parallel text corpus has almost 150
    million words across three languages.

    So yes, I would say that researchers who are looking to build their
    own corpora want to crawl at the scale that I am.

    Also, as I mentioned before, I fully expect that 40-60% of the
    documents I get in my crawls will end up being discarded.

    > (You'll note that the subject line of this thread still says something
    > about "corpus research". I didn't think this was ever about
    > high-performance product development.)

    Much of my corpus work doesn't end up directly in our products, FWIW.

    > It would be an insignificant burden on leonardo (my Linux machine) to
    > track hundreds of millions of URL's if I wanted to.

    Undoubtedly so: the machine I'm running my big crawl on can handle
    this just fine. But there *is* a cost. Currently the heritrix state
    database for my large crawl weighs in at 88 GB on disk, compared to 43
    GB for the compressed content I've downloaded (fortunately HTML
    compresses well.) I'm currently pulling 2.5 MB/s through the crawler,
    which is capped by our IT staff since without that I as consuming
    almost all of our available bandwidth. Doing a non-trivial crawl will
    use a lot of resources.

    [...]
    > > Because you may be building a synchronic corpus.
    >
    > I guess I'm going to have to get you to connect the dots for me. How does
    > revisiting sites with some regularity help me to build a synchronic corpus
    > in a way that I cannot build it if I never revisit any site again?
    >
    > Or did you mean a _diachronic_ corpus, in the belief that processes of
    > language change can usefully be detected by means of periodic scans of
    > websites?

    Right, I mistyped.

    > Why would I ignore their robot exclusion rules? This assumption surprises
    > me, since you have expressed concern that readers of this thread might be
    > encouraged to do things that webmasters might not like.

    I'm not implying that you yourself would, but it is surprising the
    number of times people inquire about why they can't slurp the entire
    New York Times or Washington Post sites.

    > My point has been that I will not generally *need* more URL's than I can
    > crawl at any one time. I'm not updating the Google index. I'm not
    > acquiring named entities for an exhaustive lexical database or ontology.
    > I'm just collecting enough text to answer certain research questions about
    > my target language.

    What is enough text?

    > Why in the world would I store corpus text as millions of small files,
    > even if I were operating at such a large scale (which, again, again, is
    > not the typical case I've been advising for here)?

    Well, a naive crawler will do just that. Heck, just grab 'wget' and
    let it go. You'll mirror the whole site on your disk. Simple.

    > I think we're starting to see the outlines of a paradigm divide here. :)

    I think so.

    > Many are happy to have gotten the grant money to acquire anything more
    > than an office computer in the first place.

    This I have no argument with. It is often the same in industry,
    contrary to what many may think. ;-)

    Peace,

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 23:01:35 MET DST