Re: Corpora: phrase (n-gram) frequency information

eric@scs.leeds.ac.uk
Tue, 29 Jun 1999 07:55:19 +0100

David,
>Hello the list! Does anyone have information to offer on the most
>common English phrases in use in a given body of text? That is, what
>4-word, 5-word (10-word, whatever) phrases appear most frequently in the
>Bible, in Shakespeare, in Tom Clancy novels, in newspapers, in any known
>corpora? Any information on this would be greatly appreciated.

I think this varies greatly depending on the type of text: whereas the list of
individual words which appear frequently is comparatively fixed across
genres, longer n-grams frequencies are much more indicative of the text genre.
Furthermore, "most frequent" 10-grams may only appear a handful of times
in the whole of a Corpus, making it harder to be sure that the "frequency"
is really significant.
If you're looking for frequent 10-grams in a specific text genre (eg epa.gov
documents???) then you're probably better off counting them yourself.
If you really want genre-independent n-grams charactersitic of English
as a whole, why not use a Dictionary, eg Collins Engish Dictionary or
COBUILD dictionary include more multi-word lexical entries than "singletons".
What's your application?
Eric

Eric Atwell, Senior Lecturer in Artificial Intelligence, SOCRATES Coordinator,
and Director, Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335761 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/scs/public/staff/eric.html