Re: Corpora: Size of a representative corpus

Henry Kucera (Henry_Kucera@brown.edu)
Thu, 20 Aug 1998 15:58:20 -0500

Ted E. Dunning wrote:

>
> ts> Also, how could we estimate the number of tokens needed to
> ts> make up for 50,001 types?

John B. Carroll and his model of lognormal distribution makes the
predictions for English: "On Sampling from a lognormal model of
word-frequency distribution," In H. Kucera and W.N. Francis,
Computational Analysis of Present-Day American English, Brown
University Press, Providence, RI 1967, pp.406-424

Carroll's analysis is based on the graphic definition of types, i.e.
distinct forms, not on lexemes (or lemmas, as a group of forms is
usually called). The quantitive relation between types in this sense
and lemmas is discussed at length in Francis and Kucera, Frequency
Analysis of English Usage, Houghton Mifflin Co., Boston, 1982

Regards, Henry Kucera