John
.............................................
John Milton
Hong Kong University of Science & Technology
lcjohn@usthk.ust.hk
On Tue, 6 Oct 1998, Ted E. Dunning wrote:
>
>
>
> Frequency lists for single words are highly suspect, especially below
> roughly the thousandth most common word. The utility of a frequency
> list for multi-word units is even more doubtful.
>
> That being said, I would be happy to offer up several of the most
> common bigrams from a small corpus (1M words) as an illustration of
> how little you are likely to learn from frequency sorting bigrams:
>
> #S The
> of the
> in the
> said #S
> to the
> AP #S
> #S #D
> on the
> for the
> and the
> said the
> in a
> at the
> #S He
> #S In
> by the
> to be
> #S But
> with the
> of a
>
> Here #S indicates a sentence boundary and #D a document boundary. The
> only items of interest are the bigrams which include the word "said"..
> Their prevalence is caused by the fact that this text was from the AP
> newswire.
>
> There *are* other ways to look at word coocurrence besides frequency
> sorting. I tend to like to plug my Computational Linguistics paper
> (CL volume 19, number 1, pages 61-74) where I introduced a useful
> statistical measure for finding interesting collocations. There are
> many other measures which people use for various purposes.
>
> >>>>> "pk" == Przemyslaw KASZUBSKI <przemka@main.amu.edu.pl> writes:
>
> pk> Another question: Are there frequency lists of English
> pk> (lemmatised/non-lemmatised) 2-3-4-5 word clusters available
> pk> anywhere, preferably retrieved from large balanced corpora? Or
> pk> frequency lists of multi-word-units?
>