Re: Corpora: MWUs and frequency

Przemyslaw KASZUBSKI (przemka@main.amu.edu.pl)
Thu, 8 Oct 1998 15:52:46 +0000

Many thanks for a most stimulating discussion! To those who have contacted me about a summary, I
promise to do so, as some other responses have been coming to my mail-box only.

1. Like John Milton has said, my area of interest is learner corpora, so the purpose for which I
use info derived from native English corpora is examining how IL material meets the criteria. The
approach is descriptive but prescriptive too, at times.

2. Currently, I'm exploring EFL learners' use of core vocabulary in writing (you may recall my
earlier query about "simplified English", though I have been corrected not to use the term since it
is a kind of controlled English claimed by the aerospace industry - thanks are due to Jeff Allen).
While core vocabularies have traditionally been presented in lists of lexemes (=what corpus
linguists would call "lemmas"), one of my interests is whether they exhibit an even distribution of
wordforms, or whether it is not the case that certain wordforms are more typical than others
(possibly even "prototypical?"). If this should be the case, the next question would be: are EFL
learners aware of this when they use any of the core words?

3. To turn to MWUs and clusters, I deliberately put them together in one group. I'm mostly after
combinations - also of more than just 2 words - in which _lexical_ core words are used. I am aware
that routine retrieval of any 2-word clusters is likely to produce "in the", "out of" and the
like, as Ted Dunning has convincigly demonstrated. Other than the frequency factor, I was unaware
of all the statistical intricacies affecting the decision which clusters in a
corpus are, say, "important". I'm inclined to think, however, that these are problems that
analysts of very large corpora should first face. It seems to me that restricting my research to
the most frequent combinations with at least one lexical core lexeme in them is a relatively
reliable heuristic, sufficient for my applied purpose. So even if I get a lot of potentially
useless clusters from, say, the BNC, I believe I can handle them on this basis. PLUS I know I will
not escape looking at this data later!

Whether I regard the status of these clusters or MWU as equivalent to that of a word is a
secondary issue, I think, though certainly not without relevance.

Chris Tribble's method looks very interesting, and I will certainly try it. Its weakness, however,
is that it binds me to a corpus, and I'd rather use a much bigger one (which I don't have access
to, nor can afford to buy) than LOB to compile the lists I want.

NB. If reliability of frequency counts drops around the thousandth word, what might be the limit
for a bigram, trigram etc? I presume it is the size of the corpus that plays the major role,
after all? And also its composition - to use a genre-eschewing generalisation: spoken data will
arguably show many more frequently used clusters than written data. Or perhaps just
"different"?

Przemek Kaszubski

On 7 Oct 98 at 10:49, Jean Hudson wrote:


> There are (at least) two important issufor.s here: 1) how to extract MWUs from
> a corpus, and 2) how to interpret the results of that exercise. I'll leave
> the first question to computational expertise (eg Ted's reference to his
> own paper and many others, though my preference is Wordsmith tools).
>
> Interpreting the data is another matter. I'd say that even the most
> frequent words are suspect, viewed as single words. Words like 'of', 'as'
> and 'all' are easily ignored when we interpret frequency lists, but take a
> look at how they're drawn to MWUs - especially in informal language.
> Extracting MWUs with these words from any corpus leaves you with a very
> different frequency count for the individual word. In other words, if you
> treat the MWU as a word in its own right then (depending on the focus of
> your analysis) you should perhaps be subtracting the occurrences of the
> component words from the final list. I don't know of any computational
> tools that do this; probably they can't since the extraction of meaningful
> MWUs requires manual intervention.
>
> Finally, what does it mean that an MWU is frequent? My answer here would be
> that it's emerging as a unit of meaning, ie undergoing the transition from
> MWU to single word status, with accompanying change in meaning and function
> (eg the most frequent MWU with 'all': 'all right' > 'alright'). Does this
> mean that we should be teaching learners of English the most frequent MWUs?
> Or what?
>
> I'd be interested to hear what Przemek intends to use frequency lists for
> and, indeed, what others have to say about the significance of frequency.
>
> (- my own reference on the subject is: Perspectives on fixedness: applied
> and theoretical. Lund UP.)
>
> - regards
> Jean
>
>
> Jean Hudson
> Research Editor
> Cambridge University Press / ELT
> Direct line: +1223-325123
>
>
>
==========================================
Przemyslaw Kaszubski, M.A.
przemka@amu.edu.pl
http://elex.amu.edu.pl/ifa/skaszub.htm

MY (ENGLISH) (LEARNER) CORPORA PAGE:
http://main.amu.edu.pl/~przemka

School of English
Adam Mickiewicz University
Al. Niepodleglosci 4
61-874 Poznan, POLAND
tel: +48 61 8528820
fax: +48 61 8523103
=========================================