Corpora: Summary: Graded English Vocabulary

Pete Whitelock (pete@sharp.co.uk)
Tue, 13 Apr 1999 15:26:02 +0100

Last week I posted the following query:

> Can anyone point me to public domain lists of English vocabulary items
> graded for English learners. I know of the work on Basic English, and
> Adam Kilgarriff's frequency lists for the BNC, but I'm interested in a
> finer-grained classification for the commonest 3-5,000 words, as well as
> gradings for the commonest multi-word expressions. Any help would be
> greatly appreciated.

Several people expressed in interest in this topic, so below I give the
replies I received that included substantive pointers:
--------------------------------------------------------------------------------------
From: "Antoinette Renouf" <ant@rdues.liv.ac.uk>

In 1984-5, I ran a lexicographic team project at Birmingham (which
worked in parallel with the main Cobuild dictionary project which I had
previously run), in which we analysed corpus data for the 650 commonest
words of the language, our criterion for grading being primarily
frequency (with a bit of utility thrown in). This analysis built the
lexical syllabus for an English language learner's course which was
written by Dave and Jane Willis. It was experimental and did not
entirely work as a teaching tool, because the constraint of
high-frequency owrds made it hard to make the book interesting. But the
experiment was interesting. The idea was that they had to incorporate
not just the commonest words but the commonest uses of the commonest
words.
An account of this is `The Lexical Syllabus' in the Longman book 1986:
Vocab and Lang Teaching, by Carter and McCarthy, and there is a book by
Dave Willis called something like `The Lexical Syllabus', but I can't
see my copy. Also the course itself, by HarperCollins, which has a tape
with some info about the wordlist on it. A full representation of the
word list and the contrextualised commonest uses was neverproduced but
I don't know why we didn't; it would have been usefuyl.

--------------------------------------------------------------------------------------

From: "Przemyslaw Kaszubski" <kprzemek@ifa.amu.edu.pl>

Your query very much coincides with my own interests (I posted a
similar question some six months ago). Try David Lee
D.Lee@lancaster.ac.uk (he's got an LDV list which he kindly allowed
me to use). Basic English is also on the Web (forget the URL at the
moment). Still, I'm afraid I don't know any finer grained lists
(recently made my own on the basis of Kilgarriff's BNC lemmatised
list; one could do the same with CELEX, I think). I would dream of
unpdated and electronically available West's GSL, or any more
pedagogically (rather than merely statistically motivated) matereial.
A good list of MWU's has long been in demand. Please let
me know of any further and useful feedback you get.
--------------------------------------------------------------------------------------
My colleague Phil Edmonds (phil@sharp.co.uk) pointed me towards the
MRC psycholinguistic database containing a variety of frequency
information:

ftp://ota.ox.ac.uk/pub/ota/public/dicts/1054/readme.

The work I mentioned in my original posting can be found as follows:

Basic English:

http://web.marshallnet.com/~manor/basiceng/

Adam Kilgarriff's frequency lists from the BNC:

http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/bnc-readme.html

Other useful links I found include the following:

University of Wales at Swansea, Centre for Applied Language Studies
Vocabulary Research Group:
http://www.swan.ac.uk/cals/calsres.htm

Reading University, Dept. of Linguistic Science
Lexicon Research Group
http://www.linguistics.reading.ac.uk/research/lexicon/

KU Leuven, European English Teaching Project
http://onyx.arts.kuleuven.ac.be/~depling/nl/led/zap/mgoe_eet.htm

The latter page in particular contains a good list of sources for graded
vocabulary, though not links to all of them.

I didn't find anything on multi-word unit frequencies, I'm afraid. I
know many dictionary publishers have a lot of info of this kind, but
little is publicly available in a form suitable for NLP. You can get a
CD from COBUILD with this sort of info - see links at:

http://titania.cobuild.collins.co.uk/

Pete Whitelock

-- 
E-mail: pete@sharp.co.uk          \ Pete Whitelock
 Internet: http://www.sharp.co.uk  \ Sharp Laboratories of Europe Ltd
  phone: +44 (0)1865 747711         \ Oxford Science Park
   fax: +44 (0)1865 714170           \ Oxford, OX4 4GA, England

The Law of Detail: Nothing is so simple that there is not a stupid way to do it.