Re: Corpora: frequency lists for clusters & MWU

Ted E. Dunning (ted@aptex.com)
Tue, 6 Oct 1998 17:24:39 -0700

Frequency lists for single words are highly suspect, especially below
roughly the thousandth most common word. The utility of a frequency
list for multi-word units is even more doubtful.

That being said, I would be happy to offer up several of the most
common bigrams from a small corpus (1M words) as an illustration of
how little you are likely to learn from frequency sorting bigrams:

#S The
of the
in the
said #S
to the
AP #S
#S #D
on the
for the
and the
said the
in a
at the
#S He
#S In
by the
to be
#S But
with the
of a

Here #S indicates a sentence boundary and #D a document boundary. The
only items of interest are the bigrams which include the word "said".
Their prevalence is caused by the fact that this text was from the AP
newswire.

There *are* other ways to look at word coocurrence besides frequency
sorting. I tend to like to plug my Computational Linguistics paper
(CL volume 19, number 1, pages 61-74) where I introduced a useful
statistical measure for finding interesting collocations. There are
many other measures which people use for various purposes.

>>>>> "pk" == Przemyslaw KASZUBSKI <przemka@main.amu.edu.pl> writes:

pk> Another question: Are there frequency lists of English
pk> (lemmatised/non-lemmatised) 2-3-4-5 word clusters available
pk> anywhere, preferably retrieved from large balanced corpora? Or
pk> frequency lists of multi-word-units?