[Corpora-List] web-corpora, big and small

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Wed Jun 01 2005 - 01:20:25 MET DST

  • Next message: liuqun: "Re: [Corpora-List] Chinese POS tagger and syntactic parser."

    Did I mention the Corpus Linguistics 2005 Web-as-Corpus worskhop? ;-)

    http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

    > > http://spidrs.must.dye.notttttttt/ [obfuscated]
    >
    > So, you've just inserted a link to a spider trap into the Corpora-List
    > archive?

    I'll tell you more: there is a link to it from the faqs of heritrix,
    probably the most popular publicly available crawler. We do this so that
    the simple-minded crawlers written by naive Java developers are doomed.
    ;-)

    > > Moreover, as spammers are getting smarter all the time, anti-spammers are
    > > also becoming more sophisticated -- suppose that somebody built a spider
    > > track by generating random _sentences_ instead of words: that would be
    > > very hard to detect...
    >
    > Can you show me a list of random sentences that can fool any native
    > speaker into believing it's a valid text?

    Suppose I eyeball a random sample of my data by hand. I estimate that 20%
    of what I collected is random sentences from various spider traps. Then,
    I'll still need to identify that 20% in some automated way in order to
    discard it (of course, automation is needed only if my corpus is big, but
    size is undoubtedly one of the reasons why "some" corpus linguists -- the
    evil ones, of course -- are attracted by the web, ).

    > You have to get away from the high-tech product development paradigm of
    > "by human hands untouched" to the scruffy, underfunded, underpowered
    > paradigm in which undergraduate interns eyeball the results of each
    > night's run to see if anything obviously bogus came through.

    In my experience, humans cost more than machines, and unfortunately I do
    not have access to an unlimited supply of undergraduate interns.
     
    > But I'm having more and more difficulty
    > understanding why we can't just focus in this thread on the much
    > smaller-scale problem actually at hand: on-the-fly capture of sample texts
    > for a linguistic research corpus.

    By the time I joined this conversation, it was already about spider traps
    and such things (but you'll notice I changed the topic just in case).

    > But if you tried to sell me an inference from that web sample about the
    > distribution of word senses of "spider" in written English, much less
    > English full-stop, then I wouldn't be buying: I'd point out the flaw in
    > your research design. Such an inference would be over-generalized and
    > almost certainly not justified on the basis of your sample, because your
    > sample would not have been representative of written English, much less
    > English full-stop.

    True. But there is a lot of recent empirical work (e.g., by Peter Turney)
    indicating that, despite its unrepresentativeness, the web, by its sheer
    size, can teach us things about the meaning of English words (English as
    in English full-stop) that do not emerge from smaller, carefully balanced
    corpora (success is often measured by comparison with human performance).

    I think that this has something to do with the fact that, while the
    underlying statistical population is "html English", for certain tasks and
    purposes html English is similar enough to English-period (or at least
    written-English-period) that html English is a good surrogate of
    English-period.

    That, and the issue of the Zipfian-ness of word frequency distributions in
    corpora, which makes BNC-sized corpora too small to even try to use them
    for some tasks, so that one has to go for larger data-sets, although they
    will typically not be balanced nor representative like the BNC is meant to
    be.

    And for languages other than English often the web is the only way to
    build even BNC-sized corpora...

    <dangerous_aside> I am also not so convinced that a language can be
    identified with the population of all sentences ever produced (or
    currently being produced?) in that language, in the same way in which I
    suppose that in sociology or geography it makes sense to define the
    population of, say, Californians as made of all the people living in
    California. Which means that I'm nost sure that we are on much more solid
    grounds when drawing inferences about "English" from a good,
    old-fashioned balanced corpus... but this is another story.
    </dangerous_aside>

    > of what qualifies as corpus linguistics may differ from that of others
    > with equal or greater exposure to the field, I guess I'm surprised at the
    > notion that I wouldn't know corpus linguistics when I see it.

    Well, for example I would think that corpus-based ontology building,
    lexicon extraction and named entity recognition qualify as legit
    activities for corpus linguists, whereas I gather from your replies to Tom
    Emerson that you are very confident that a corpus linguist could not
    possibly be interested in that.
     
    > By what procedure did you arrive at 1 billion words as your required
    > sample size? Why not 500 million or 5 billion?

    We (since luckily I am not alone in this: http://wacky.sslmit.unibo.it --
    although what I'm saying here only represents my own intepretation of why
    I'm doing this) would like to have as many data as possible, both for
    exploratory studies of what the web has to offer to linguists and because
    we are interested in seeing how the behaviour of certain methods,
    measures, algorithms changes as sample size increases. 1 billion words is
    an arbitrary starting point -- chosen to be as big as the largest existing
    corpora we are aware of.
     
    > That said, if you do need a corpus that big and you really don't know how
    > to build one from web data with the characteristics you need, and you're
    > reasonably confident that the characteristics can be achieved with a web
    > sample, then there are probably several of us here who could help you. You
    > could start a new thread, since that's a very different problem domain
    > from the one we've been addressing here -- one that would certainly profit
    > from a high-performance off-the-shelf crawler and other components.

    I certainly do not hesitate to ask specific questions to this or other,
    sometimes more appropriate lists (such as the heritrix crawler list), and
    I'm glad that corpus linguists and crawlers are such friendly and helpful
    comminities. My point was simply that retrieving large-ish corpora from
    the web (at least if you want them to be composed of non-duplicate,
    natural, connected text) is not a trivial task, as I (mis?)understood you
    were implying.

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Wed Jun 01 2005 - 01:25:40 MET DST