Re: Corpora: Statistics in genre differences

James L. Fidelholtz (jfidel@siu.buap.mx)
Mon, 22 Mar 1999 08:48:54 -0600 (CST)

On Fri, 19 Mar 1999, Kristen Precht wrote:
[snip]
>..., it's hard to assume that the reader would notice the difference
>between 2 per thousand words and 5 per thousand words.

Not only is it not hard, it is impossible to assume that they WOULDN'T
notice such a gross difference. In a 1976 article in the _Chicago
Linguistic Society Papers_ (approx. pp. 200-213, or some such), I showed
that English vowel reduction in certain contexts is 'statistically'
dependent on word frequency. If the word occurs over about 5 times per
MILLION [!], lax unstressed vowels in initial syllable before a
consonant cluster reduce (_a_stronomy, m_i_stake). If the occurrence is
less than 5/M, they do not reduce (g_a_stronomy, m_i_stook). Note the
Thorndike-Lorge frequencies for these words (or substitute your own
favorite count): astronomy 4/M [note: this word may be unreduced, as it
is 'on the cusp']; gastronomy 1/M; mistake [I don't remember exactly,
but it's several tens per M]; mistook [again, about 1/M, as I recall].
To compound my earlier statistical errors, let me say that the results
were VERY statistically significant (p < .[several zeros]1 using the
chi square test). What I really mean by that is that if you look at the
tables, the results of any statistical test are evident before doing
it. So if the difference between 1/M and 10/M have statistically
significant effects on linguistic behavior, it is not possible to
believe that the difference between 2000/M and 5000/M is not able to be
recognized. Just ask naive speakers which of two words (eg 'the' and
'walk', to take a perhaps slightly unfair example) they feel is more
familiar, or common. By the way, there are not very many words which
occur 2-5/K [I don't have any counts here, so I can't check the numbers,
but it's certainly not over a couple of hundred].
I should comment that the reduction of vowels in English may be
affected by a number of factors, but the examples above have factored
out most of those factors, and the overall results (on the basis of an
exhaustive check of Kenyon and Knott's pronouncing dictionary) leave no
room for doubt.
Jim

James L. Fidelholtz e-mail: jfidel@siu.buap.mx
Maestri'a en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO