Corpora: Text vs speech w.r.t Zipf's law

Steve Finch (steve.finch@thomson.com)
Thu, 10 Dec 1998 13:35:16 -0500

Hi All,

I have just examined the distribution of words in a transcribed corpus
of conversational speech and (surprisingly) found it to be very
different to text.

The log-log graph of rank (X) vs frequency (Y) seems to assymptote to
-1.5ish for speech, but, as is well known, only -1.1ish for text.
This clearly has major implications for the number of words you have
to consider to get, say, 98% coverage.

This result must be well-known (in fact Redington found it earlier for
"child-speak" from the CHILDES corpus of mother/child interactions,
but I thought is was peculiar to that type of discourse; apparently
not!). Could anyone enlighten me on where I could read about it
(confirmation, disconfirmation, implications for ramblings on the
a-priori necessity of Zipf's law, etc)?

Cheers,

Steve

------------------------------------------------------------------------
When you steal from one person, it's called plagiarism;
When you steal from many, it's research. -- Wilson Misner
------------------------------------------------------------------------
Steve Finch http://www.thomtech.com/nlp/steve.html

Thomson Labs/NLP | sfinch@thomtech.com
1375, Piccard Drive, | +1 301 548 4093 (voice)
Rockville, MD, 20850 | +1 301 527 4080 (paper)