Re: Corpora: frequency lists for clusters & MWU

John Milton (lcjohn@uxmail.ust.hk)
Wed, 7 Oct 1998 18:08:04 +0800 (HKT)

I've found comparing words and n-grams extremely useful, albeit, as Ted
cautions, I try to stick to high frequency items. The comparisons I am
interested in involve text corpora where various cohorts of writers
attempt to target the same discourse style (also Przemyslaw's interest --
i.e., NS vs. NNS writers). Using Ted's Log Likelihood test (a good
implementation is Mike Scott's WordSmith), I find, for example, that
Chinese learners of English undergenerate words such as 'was', 'has',
'been', 'an', 'where', compared to English NS students of the same age,
education, and purportedly writing in the same genre. Chinese learners
overgenerate words such as 'think', 'can', many personal pronouns etc.
Also, I find very interesting differences in multi-word units, which
describe and quantify the particular discourse accent of this cohort of
learners.

John
.............................................
John Milton
Hong Kong University of Science & Technology
lcjohn@usthk.ust.hk

On Tue, 6 Oct 1998, Ted E. Dunning wrote:

>
>
>
> Frequency lists for single words are highly suspect, especially below
> roughly the thousandth most common word. The utility of a frequency
> list for multi-word units is even more doubtful.
>
> That being said, I would be happy to offer up several of the most
> common bigrams from a small corpus (1M words) as an illustration of
> how little you are likely to learn from frequency sorting bigrams:
>
> #S The
> of the
> in the
> said #S
> to the
> AP #S
> #S #D
> on the
> for the
> and the
> said the
> in a
> at the
> #S He
> #S In
> by the
> to be
> #S But
> with the
> of a
>
> Here #S indicates a sentence boundary and #D a document boundary. The
> only items of interest are the bigrams which include the word "said"..
> Their prevalence is caused by the fact that this text was from the AP
> newswire.
>
> There *are* other ways to look at word coocurrence besides frequency
> sorting. I tend to like to plug my Computational Linguistics paper
> (CL volume 19, number 1, pages 61-74) where I introduced a useful
> statistical measure for finding interesting collocations. There are
> many other measures which people use for various purposes.
>
> >>>>> "pk" == Przemyslaw KASZUBSKI <przemka@main.amu.edu.pl> writes:
>
> pk> Another question: Are there frequency lists of English
> pk> (lemmatised/non-lemmatised) 2-3-4-5 word clusters available
> pk> anywhere, preferably retrieved from large balanced corpora? Or
> pk> frequency lists of multi-word-units?
>