Re: Corpora: MWUs and frequency

James L. Fidelholtz (jfidel@siu.buap.mx)
Thu, 8 Oct 1998 11:01:20 -0500 (CDT)

On Thu, 8 Oct 1998, Przemyslaw KASZUBSKI wrote:

>... And also its composition - to use a genre-eschewing generalisation:
>spoken data will arguably show many more frequently used clusters than
>written data. Or perhaps just "different"?

It is a well-known fact that in spoken language, the number of
types (at least word types, but surely bigrams, etc., also) tends to be
reduced relative to written language. The implication is that, eg for
words, for some frequency we would find the same frequency in general
for spoken and written language (I have argued that for English this
frequency is 5/M). For words occurring above this frequency, their
spoken frequency will generally be higher than in written lg. For words
occurring below this frequency, spoken frequency is below written
frequency. There are, of course, some exceptions (eg words that occur
'only' in spoken [berserk] or only in written [albeit] language--of
course my examples are not great, or even correct, but it'll give you
an idea of what I mean--I have actually heard people utter the word
'albeit'), but this is the general tendency.
Jim

James L. Fidelholtz e-mail: jfidel@siu.buap.mx
Maestri'a en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO