Corpora: Re: MWUs and frequency

Jean Hudson (jhudson@cup.cam.ac.uk)
Thu, 08 Oct 1998 14:23:22 +0100

Yes, MDL techniques might provide a basis for determining which multi-word
units behave as single words, which would be a welcome development indeed.

The point of my last mail was not to claim to "know" what is and what is
not a "word", but to question the interpretation of frequency lists of
multi-word units.

There is strong evidence from diachronic studies of English that the
re-assignment of word boundaries, whether by compounding, re-bracketing
('affixation' / 'agglutination', or the fixation of phrasal syntagms
('structuration') is a unidirectional process. Words group, fuse, and
subsequently reduce, after which they can participate in the process again.
The temporal duration of the cycle varies, from a relatively brief span to
infinity (ie, it need not be completed). Concurrently, meaning shifts take
place, which enhances the possibility of computational recognition (since
the semantic and syntactic patterns around the unit also change).

There are implications for synchronic study of the phenomenon: The
transition from ad-hoc expression > multi-word expression (and perhaps to
"word") involves routinization (John Haiman) and entrenchment (see Brian
MacWhinney's recent mailing to the Funknet list, on storage parsimony) -
and hence frequency. This is perhaps parenthetical to the interests of
computational linguists and statisticians, but, as I think I indicated in
the previous message, the fact that a multi-word unit occurs frequently is
an indication of ongoing change, rather than a measure of the degree to
which it is a fixed unit of meaning/reference. Following on from this,
lower frequency multi-word units might not be so easy to detect
computationally?