I think this varies greatly depending on the type of text: whereas the list of
individual words which appear frequently is comparatively fixed across
genres, longer n-grams frequencies are much more indicative of the text genre.
Furthermore, "most frequent" 10-grams may only appear a handful of times
in the whole of a Corpus, making it harder to be sure that the "frequency"
is really significant.
If you're looking for frequent 10-grams in a specific text genre (eg epa.gov
documents???) then you're probably better off counting them yourself.
If you really want genre-independent n-grams charactersitic of English
as a whole, why not use a Dictionary, eg Collins Engish Dictionary or
COBUILD dictionary include more multi-word lexical entries than "singletons".
What's your application?
Eric
Eric Atwell, Senior Lecturer in Artificial Intelligence, SOCRATES Coordinator,
and Director, Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335761 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/scs/public/staff/eric.html