RE: [Corpora-List] Brown Corpus

From: Amsler, Robert (Robert.Amsler@hq.doe.gov)
Date: Tue Jun 21 2005 - 14:39:28 MET DST

Next message: Andrea Kowalski: "[Corpora-List] DGfS-06 http://www.spectrum.uni-bielefeld.de/DGfS/ Workshop on Corpus-based Approaches to Non-compositional Phenomena"

Previous message: Adam Kilgarriff: "RE: [Corpora-List] Brown Corpus"
Maybe in reply to: Jörg Schuster: "[Corpora-List] Brown Corpus"
Next in thread: Bryar Family: "RE: [Corpora-List] publication -specific corpus requirements."
Next in thread: Jörg Schuster: "[Corpora-List] Re: Brown Corpus"
Reply: Bryar Family: "RE: [Corpora-List] publication -specific corpus requirements."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I'm somewhat surprised by Martin Wynne's comments against using fixed size
corpora samples.
You have to realize that not only does the intended uses of the corpus
change what is an appropriate sampling strategy, but whatever sampling
strategy you employ will introduce some bias into the corpus.

If one is constructing a corpus to sample vocabulary statistics, then it
would be very hard to argue that
you should not use fixed size samples. Different sizes of samples could
seriously skew vocabulary statistics. Alternatively, if one is building a
corpus to study narrative style, it would be hard to argue that anything
other than large whole rhetorical text units would be adequate. There is a
lot of middle ground between gathering statistics on word frequency and
narrative style and those factors should also be brought to bear on corpus
sampling strategy.

I am not certain there is ONE strategy on creating samples that would please
everyone. One idea might be to gather larger samples of text and provide one
or more sub-corpora of samples within the larger corpus to produce more
reasonable vocabulary counts. There is nothing that says your texts have to
have only one corpus made from them any more than photographs can only be
presented exactly as they are shot, rather than cropped to make other
pictures.

Next message: Andrea Kowalski: "[Corpora-List] DGfS-06 http://www.spectrum.uni-bielefeld.de/DGfS/ Workshop on Corpus-based Approaches to Non-compositional Phenomena"
Previous message: Adam Kilgarriff: "RE: [Corpora-List] Brown Corpus"
Maybe in reply to: Jörg Schuster: "[Corpora-List] Brown Corpus"
Next in thread: Bryar Family: "RE: [Corpora-List] publication -specific corpus requirements."
Next in thread: Jörg Schuster: "[Corpora-List] Re: Brown Corpus"
Reply: Bryar Family: "RE: [Corpora-List] publication -specific corpus requirements."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Jun 21 2005 - 14:54:37 MET DST