Re: [Corpora-List] Query on the use of Google for corpus research

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Tue May 31 2005 - 20:00:41 MET DST

  • Next message: Mark P. Line: "Re: [Corpora-List] Query on the use of Google for corpus research"

    > It's not much of a problem unless you presuppose that a corpus linguist
    > would have difficulty finding a way to distinguish between a valid text in
    > her target language and a random text generated by a spider trap.

    Consider the following spider trap (quoted in the heritrix documentation):

    http://spiders.must.die.net/

    It looks like it generates text from a unigram model, so I guess you could
    use heuristics to find out that it's not true English text, e.g. using a
    bigram model in some way (comparing the bigram entropy of a page with that
    of a corpus of true English? Although then there is the risk that you bias
    your crawl towards documents that look more like the ones in a corpus you
    already have...), or using some kind of pos pattern filter (which would
    require pos tagging). Perhaps, there are other heuristics that are simpler
    and/or better (any suggestion?), but in any case this means that you have
    to add yet another module to your corpus-crawling/processing architecture,
    and if you happen to download a few gigabytes of data from sites like the
    one above things can get really annoying...

    Moreover, as spammers are getting smarter all the time, anti-spammers are
    also becoming more sophisticated -- suppose that somebody built a spider
    track by generating random _sentences_ instead of words: that would be
    very hard to detect...

    > > Incidentally, a "spider trap" query on google returns many more results
    > > about crawlers, robots.txt files etc. than about how to capture
    > > eight-legged arachnids... one good example of how one should be careful
    > > when using the web as a way to gather knowledge about the world...
    >
    > I believe there's a huge difference between using the web as a way to
    > gather knowledge about the world (especially if this is being done
    > automatically) and using the web as a way to populate a corpus for
    > linguistic research. The latter use is much less ambitious, and simply
    > doesn't need to be weighed down by most of the concerns that web-mining or
    > indexing applications do.

    I agree that, as linguists, even if what we get is not corresponding to
    the "truth" in the outside world, we do not need to worry, but factors
    like the distribution of senses of a word in our corpus should be of our
    concern. For example, if I were to extract the semantics of the word
    "spider" from a corpus, I would rather get the eight-legged-creepy-crawly
    creature reference as the central sense. In web-data, this could be tricky
    (of course, I'm not saying that it would be impossible -- I'm just saying
    that one should be a bit careful about what one can find in web-data...)

    > Most corpus linguists who are constructing a dataset on the fly are just
    > interested

    I am suprised by how you seem to know so much about what corpus linguists
    do and like -- personally, I am not even sure I have understood who
    qualifies as a corpus linguist, yet...

    > and are usually willing to add or change samples indefinitely
    > until their corpus has the characteristics they need.

    In my experience, adding and changing samples indefinitely until I have
    about 1 billion words of web-data with the characteristics I need turns
    out to be a pretty difficult thing to do... if you can suggest a procedure
    to do this in an easy way, I (and, I suspect, "most corpus linguists")
    would be very grateful.

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Tue May 31 2005 - 20:10:12 MET DST