RE: Corpora: Size of a representative corpus

chogan@york.mt.cs.cmu.edu
Thu, 20 Aug 98 14:57:46 EDT

Iain Downs writes:

> You will notice that the theory and experiment at 50,000 words are
> out by a factor of less than 2 - not bad, eh?

> They're worse at the larger numbers - perhpas Zipf didn't have the
> benefit of large computers for his word counting and the 1/n rule is
> a poor approximation at large numbers!

Granted that Zipf didn't have the benefits of large computers (or
even, presumably, large corpora) when he formulated his laws.
Nevertheless, I do believe that he tested it on a large amount (for
his time) of data.

The times that I have tested data against Zipf's laws, the agreement
has been fairly good.

A very interesting Web page on this topic is the following:
http://sun1.bham.ac.uk/G.Landini/evmt/zipf.htm

The page is about applying Zipf's laws to the Voynich manuscript, but
it has a very good description of Zipf's laws, and several references
concerning modifications to the laws to make the more closely model
the data.

--Chris

------- end -------

christopher m. hogan language technologies institute
chogan@cs.cmu.edu carnegie mellon university
http://www.cs.cmu.edu/~chogan pittsburgh, pa