Corpora: Size of a representative corpus

Tony Berber Sardinha (tony4@uol.com.br)
Wed, 19 Aug 1998 19:39:47 -0300

Hi,

The question of how large (in tokens) a representative corpus
must be came up in our classes and one of the possibilities
we came up with would be to think about this issue as follows:

'A representative corpus should include the majority of the types
in the language as recorded in a comprehensive dictionary.
Thus:
(a) assuming that a dictionary entry is analogous to a type;
(b) dictionary x is comprehensive
(c) dictionary x has 100,000 entries
(d) a majority is 1/2 + 1
A representative corpus would need to have as many tokens
as necessary to include 50,001 types.'

Since there are no references to this hypothesis in the literature
(or is there?) we would like to know people's reactions to it:
Would this be a proper criterion? What are the possible
flaws in the argument?

Also, how could we estimate the number of tokens needed
to make up for 50,001 types?

thanks in advance for any thoughts on this.

cheers,

tony.
------------------------------------------------------------------------
Dr Tony Berber Sardinha
Catholic University of Sao Paulo, Brazil
tony4@uol.com.br
http://sites.uol.com.br/tony4/homepage.html
http://www.liv.ac.uk/~tony1/homepage.html
http://www.liv.ac.uk/~tony1/corpus.html
http://members.wbs.net/homepages/c/o/r/corpuslinguistics.html
------------------------------------------------------------------------