RE: [Corpora-List] Brown Corpus

From: Amsler, Robert (Robert.Amsler@hq.doe.gov)
Date: Tue Jun 21 2005 - 14:39:28 MET DST

  • Next message: Andrea Kowalski: "[Corpora-List] DGfS-06 http://www.spectrum.uni-bielefeld.de/DGfS/ Workshop on Corpus-based Approaches to Non-compositional Phenomena"

    I'm somewhat surprised by Martin Wynne's comments against using fixed size
    corpora samples.
    You have to realize that not only does the intended uses of the corpus
    change what is an appropriate sampling strategy, but whatever sampling
    strategy you employ will introduce some bias into the corpus.

    If one is constructing a corpus to sample vocabulary statistics, then it
    would be very hard to argue that
    you should not use fixed size samples. Different sizes of samples could
    seriously skew vocabulary statistics. Alternatively, if one is building a
    corpus to study narrative style, it would be hard to argue that anything
    other than large whole rhetorical text units would be adequate. There is a
    lot of middle ground between gathering statistics on word frequency and
    narrative style and those factors should also be brought to bear on corpus
    sampling strategy.

    I am not certain there is ONE strategy on creating samples that would please
    everyone. One idea might be to gather larger samples of text and provide one
    or more sub-corpora of samples within the larger corpus to produce more
    reasonable vocabulary counts. There is nothing that says your texts have to
    have only one corpus made from them any more than photographs can only be
    presented exactly as they are shot, rather than cropped to make other
    pictures.



    This archive was generated by hypermail 2b29 : Tue Jun 21 2005 - 14:54:37 MET DST