For example, if you start from the assumption that all your 50,001 words
must appear in it, then you must ask, 'how many times?'.
The way in which a word is used does not really pop out until you've seen a
number of examples (personally, I tend to use 5 as a lower limit, but that
is for VERY empirical work).
Because of Zipfs law, to go from at least 1 occurence in the corpus to at
least 5 requires more than 5 times the size of text (assuming that you are
picking your text randomly rather than to suit your need to get 5
examples!).
Equally, if you start from a dictionary and get half of the words in it in
your corpus, you will end up with vastly more word forms in your corpus
than in the dictionary! What do you do with these extra words? ( In
scanning a 90 Million word corpus taken from commercial news sources, I
found that the number of distinct word forms increased as the square root
of the number of words - roughly half a million separate word forms for the
90 million words... By this estimate. 50,000 words would require 900,000
tokens ).
Another way to got at the number of tokens is to use Zipfs law as a
sequence (frequency is porportional to 1/rank)
So if frequency or rank 50,000 is one, then frequency 1 = 50000 so total
tokens = 50000 * (1/1 + 1/2 .. + 1/50000) which a little program I've wrote
(rather than try maths!) indicates is 570,000.
to get the 5 at the 50,0000 rank seems to require 250,000 distinct
wordforms, requiring some 3.25 million words
You will notice that the theory and experiment at 50,000 words are out by a
factor of less than 2 - not bad, eh?
They're worse at the larger numbers - perhpas Zipf didn't have the benefit
of large computers for his word counting and the 1/n rule is a poor
approximation at large numbers!
Does any one else have any real life values to add to this?
Iain Downs