Corpora: MWUs and frequency

Jean Hudson (jhudson@cup.cam.ac.uk)
Wed, 07 Oct 1998 10:49:49 +0100

Przemek Kaszubski wrote:

Are there frequency lists of English
>(lemmatised/non-lemmatised) 2-3-4-5 word clusters available anywhere,
>preferably retrieved from large balanced corpora? Or
>frequency lists of multi-word-units?

Ted Dunning is right in saying that "frequency lists for single words are
highly suspect, especially below roughly the thousandth most common word.
The utility of a frequency list for multi-word units is even more
doubtful", though I'm curious to know what you mean by "interesting
collocations"... (frequent? unexpected? or sth to do with how they function
in discourse?)

There are (at least) two important issues here: 1) how to extract MWUs from
a corpus, and 2) how to interpret the results of that exercise. I'll leave
the first question to computational expertise (eg Ted's reference to his
own paper and many others, though my preference is Wordsmith tools).

Interpreting the data is another matter. I'd say that even the most
frequent words are suspect, viewed as single words. Words like 'of', 'as'
and 'all' are easily ignored when we interpret frequency lists, but take a
look at how they're drawn to MWUs - especially in informal language.
Extracting MWUs with these words from any corpus leaves you with a very
different frequency count for the individual word. In other words, if you
treat the MWU as a word in its own right then (depending on the focus of
your analysis) you should perhaps be subtracting the occurrences of the
component words from the final list. I don't know of any computational
tools that do this; probably they can't since the extraction of meaningful
MWUs requires manual intervention.

Finally, what does it mean that an MWU is frequent? My answer here would be
that it's emerging as a unit of meaning, ie undergoing the transition from
MWU to single word status, with accompanying change in meaning and function
(eg the most frequent MWU with 'all': 'all right' > 'alright'). Does this
mean that we should be teaching learners of English the most frequent MWUs?
Or what?

I'd be interested to hear what Przemek intends to use frequency lists for
and, indeed, what others have to say about the significance of frequency.

(- my own reference on the subject is: Perspectives on fixedness: applied
and theoretical. Lund UP.)

- regards
Jean

Jean Hudson
Research Editor
Cambridge University Press / ELT
Direct line: +1223-325123