Re: [Corpora-List] Query on the use of Google for corpus research

From: Mark P. Line (mark@polymathix.com)
Date: Tue May 31 2005 - 23:00:22 MET DST

  • Next message: Vincenzo Pallotta: "[Corpora-List] CfP: EUROLAN 2005 Workshop "ROMANCE FrameNet""

    Marco Baroni said:
    >
    > Consider the following spider trap (quoted in the heritrix
    > documentation):
    >
    > http://spidrs.must.dye.notttttttt/ [obfuscated]

    So, you've just inserted a link to a spider trap into the Corpora-List
    archive?

    > [snip]
    > Perhaps, there are other heuristics that are simpler and/or better (any
    > suggestion?), but in any case this means that you have to add yet another
    > module to your corpus-crawling/processing architecture,
    > and if you happen to download a few gigabytes of data from sites like the
    > one above things can get really annoying...

    If you've received grant money for a proposal in which you made your
    entire program of research dependent on the availability of corpus texts
    acquired from <spidrs.must.dye.notttttttttt>, then I guess you might have
    painted yourself into a corner.

    Fortunately, that's seldom going to be the case in real-life corpus
    research. If you can't get text from one site, you'll get it from another.

    There are lots of possible heuristics that work just fine if all you're
    doing is collecting some sample texts for a research corpus, such as
    limiting the amount of time you spend harvesting from any given website.

    Another processing technique that may perhaps never occur to somebody in
    the web-mining/indexing industry would be for the researcher to actually
    eyeball the texts that come in to see if the sampling procedure needs to
    be enhanced.

    Of course there is a high-powered product development industry out there
    that couldn't possibly contemplate even a little bit of human intervention
    in many of the large-scale, high-performance upstream processing steps.
    But that's not what the question starting this thread was about, and it's
    not what I've been trying to sketch solution approaches for.

    > Moreover, as spammers are getting smarter all the time, anti-spammers are
    > also becoming more sophisticated -- suppose that somebody built a spider
    > track by generating random _sentences_ instead of words: that would be
    > very hard to detect...

    Can you show me a list of random sentences that can fool any native
    speaker into believing it's a valid text?

    You have to get away from the high-tech product development paradigm of
    "by human hands untouched" to the scruffy, underfunded, underpowered
    paradigm in which undergraduate interns eyeball the results of each
    night's run to see if anything obviously bogus came through.

    No, you can't do that when you're updating the Google index or building an
    exhaustive named entity ontology. But I'm having more and more difficulty
    understanding why we can't just focus in this thread on the much
    smaller-scale problem actually at hand: on-the-fly capture of sample texts
    for a linguistic research corpus.

    >> > Incidentally, a "spider trap" query on google returns many more
    >> results
    >> > about crawlers, robots.txt files etc. than about how to capture
    >> > eight-legged arachnids... one good example of how one should be
    >> careful
    >> > when using the web as a way to gather knowledge about the world...
    >>
    >> I believe there's a huge difference between using the web as a way to
    >> gather knowledge about the world (especially if this is being done
    >> automatically) and using the web as a way to populate a corpus for
    >> linguistic research. The latter use is much less ambitious, and simply
    >> doesn't need to be weighed down by most of the concerns that web-mining
    >> or
    >> indexing applications do.
    >
    > I agree that, as linguists, even if what we get is not corresponding to
    > the "truth" in the outside world, we do not need to worry, but factors
    > like the distribution of senses of a word in our corpus should be of our
    > concern. For example, if I were to extract the semantics of the word
    > "spider" from a corpus, I would rather get the eight-legged-creepy-crawly
    > creature reference as the central sense. In web-data, this could be
    > tricky (of course, I'm not saying that it would be impossible -- I'm just
    > saying that one should be a bit careful about what one can find in
    > web-data...)

    That goes back to my earlier comments about statistical research design.
    You can characterize the distribution of senses of a word in a sample, and
    make inferences (which may be justifiable inferences if you're a capable
    statistician or have one in your project) about the underlying population
    from which your sample was drawn.

    You cannot, however, make justifiable inferences about supersets of the
    underlying population. (That would be an over-generalization.) One
    important trick in selling statistical results is being able to
    demonstrate that you know what your population is: that you know its
    boundary constraints, and that you haven't over-generalized in your
    inferences.

    So, with appropriate statistical techniques, you _might_ be able to
    characterize the distribution of word senses of "spider" in a sample of
    texts captured from the web and then to infer something justifiable about
    the distribution of word senses of "spider" in web-served HTML and
    plaintext documents (your "underlying population" in the jargon of
    statistics).

    But if you tried to sell me an inference from that web sample about the
    distribution of word senses of "spider" in written English, much less
    English full-stop, then I wouldn't be buying: I'd point out the flaw in
    your research design. Such an inference would be over-generalized and
    almost certainly not justified on the basis of your sample, because your
    sample would not have been representative of written English, much less
    English full-stop.

    Take a look at research journals in epidemiology, psychology or sociology
    and you'll find that this kind of over-generalization, rebuttal and
    subsequent redefinition of the underlying population goes on all the time.
    It's a natural part of the way science is generally done when statistical
    measures are the only way to fly.

    >> Most corpus linguists who are constructing a dataset on the fly are just
    >> interested
    >
    > I am suprised by how you seem to know so much about what corpus linguists
    > do and like -- personally, I am not even sure I have understood who
    > qualifies as a corpus linguist, yet...

    I guess you qualify as a corpus linguist if you spend a not-insignificant
    proportion of your time doing corpus linguistics. :)

    I've been building computer corpora and the software to acquire, store and
    process them off and on since the mid-1970's (you know, back when getting
    a grant to purchase Brown or London-Lund on magnetic tape was a Big Deal).
    Although it's certainly the case that, if pressed for precision, my idea
    of what qualifies as corpus linguistics may differ from that of others
    with equal or greater exposure to the field, I guess I'm surprised at the
    notion that I wouldn't know corpus linguistics when I see it.

    >> and are usually willing to add or change samples indefinitely
    >> until their corpus has the characteristics they need.
    >
    > In my experience, adding and changing samples indefinitely until I have
    > about 1 billion words of web-data with the characteristics I need turns
    > out to be a pretty difficult thing to do... if you can suggest a
    > procedure to do this in an easy way, I (and, I suspect, "most corpus
    > linguists") would be very grateful.

    By what procedure did you arrive at 1 billion words as your required
    sample size? Why not 500 million or 5 billion?

    That said, if you do need a corpus that big and you really don't know how
    to build one from web data with the characteristics you need, and you're
    reasonably confident that the characteristics can be achieved with a web
    sample, then there are probably several of us here who could help you. You
    could start a new thread, since that's a very different problem domain
    from the one we've been addressing here -- one that would certainly profit
    from a high-performance off-the-shelf crawler and other components.

    -- Mark

    Mark P. Line
    Polymathix
    San Antonio, TX



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 23:03:13 MET DST