Lemmatised frequency lists

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Wed, 12 Jun 1996 10:01:43 +0100 (BST)

Lemmatised BNC frequency list available
=======================================

Following various requests, particularly form workers in English
Language teaching, I have prepared a lemmatised frequency list from
the BNC. This is a single list giving word frequencies for the 6,318
words with more than 800 occurrences in the whole 100M-word BNC. The
definition of a 'word' approximates to a headword in an EFL dictionary
such as Longman's Dictionary of Contemporary English: so, eg, nominal
and verbal "help" are listed separately, and the count for verbal
"help" is the sum of counts for verbal 'help', 'helps', 'helping',
'helped'.

The list is available over the net in directory

ftp://ftp.itri.bton.ac.uk/pub/bnc/

The lemmatised list is called 'lemma' and is available in four forms:
ordered alphabetically or by frequency, and compressed (using gzip) or
uncompressed, so the four files are:

lemma.al (124 KB)
lemma.al.gz (55 KB)
lemma.num (124 KB)
lemma.num.gz (55 KB)

The format for the list is:

sort-order, frequency, word, word-class

and a sample from the top of the alphabetically-ordered list is:

5 2186369 a det
2107 4249 abandon v
5204 1110 abbey n
966 10468 ability n
321 30454 able a

The first line reads, the fifth most common word, with 2,186,369
occurrences in the BNC, is "a" as a determiner.

The list-creation process replicated that used at Longman for marking
dictionary frequencies in LDOCE 3rd edition, a process described in

Kilgarriff, A. Putting Frequencies in the Dictionary.
International Journal of Lexicography (to appear). Available
electronically (gzipped postscript) as:
ftp://ftp.itri.bton.ac.uk/pub/bnc/ijl.ps.gz

Numbers, names, and items that would usually be capitalised are
excluded. Only simple words (eg containing no spaces) were
considered. The following set of word classes is used:

conj (conjunction) 34 items
adv (adverb) 530
v (verb) 1652
det (determiner) 48
pron (pronoun) 50
interjection 17
a (adjective) 1585
n (noun) 4258
prep (preposition) 75
modal 13
infinitive-marker 1

A word like "right" has four list entries, for adjective, adverb,
interjection and noun. (Just ten words have more than three list
entries.)

Unlike the Longman list, only the BNC was used (so the lists only
reflect British, not American, frequencies); spoken and written
frequencies are not separated; spelling variants are not counted as a
single word; manual checking was less extensive.

The raw lists, from which the lemmatised list was generated (and which
are, consequently, a less theory-dependent form of data), are also
available from the same web site: see

ftp://ftp.itri.bton.ac.uk/pub/bnc/README

This file is available as ftp://ftp.itri.bton.ac.uk/pub/bnc/lemma.doc

The work was undertaken under EPSRC grant 'SEAL'.

I'd appreciate it if users of the list acknowledged the preparation
work (as well as the BNC itself) in any publications.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4AT email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%