WG: Corpora: Size of a representative corpus

Sabathy, Hellfried (hellfried.sabathy@bifab.de)
Thu, 20 Aug 1998 15:41:48 +0200

Hi,

>(a) assuming that a dictionary entry is analogous to a type;
>(b) dictionary x is comprehensive
>(c) dictionary x has 100,000 entries
>(d) a majority is 1/2 + 1
>A representative corpus would need to have as many tokens
>as necessary to include 50,001 types.'

I would rather argue:
of these 100000 types, 20000 make up 80% of the corpora
from where the dictionary was taken;
therefore, a corpus encompassing "most of" these 20000
types can be considered to model the original corpus in
a representative way.

Any better solutions?

Best regards
Hellfried Sabathy