Re: Corpora: frequency lists for clusters & MWU

Ted E. Dunning (ted@aptex.com)
Wed, 7 Oct 1998 11:30:41 -0700

hmmm....

It is true that you may well find interesting things in any particular
list such as the one I sent around. I noted a few such things.
Many others are likely to exist (note the prevalence of the third
person, especial sentence initial).

But what I was trying to say had more to do with the futility of using
such frequency sorted lists as generalizations. The features that I
pointed out and that you pointed out demonstrate exactly this point.
Essentially all of these points of interest are due *precisely* to the
specific nature of the text that I analysed. The fact that the
particular nature of the text I used is this prominent is a strong
argument *against* the general utility of such frequency sorted lists
of collocates.

On the other hand, it is very clear that such lists (and especially
statistical tests based on the underlying counts) can be used in
comparative studies to find *differences* between corpora. In fact,
this utility is exactly what makes most text retrieval systems useful
at all. To be repetitious, this utility is also what makes these
lists dubious as generalizations.

I think from what I read that Kay and I actually agree quite closely
on these matters.

>>>>> "KBW" == Kay Wikberg <k.b.wikberg@iba.uio.no> writes:

kbw> It is true that most bigrams are uninteresting, but even the
kbw> short list you offer contains more than you seem to be aware
kbw> of. ... [ observations such as people start sentences with
kbw> "But" ] ...