Re: Corpora: Size of a representative corpus

Jon Mills (jon.mills@luton.ac.uk)
Thu, 20 Aug 1998 15:27:42 GMT

Tony Berber Sardinha, writes

> 'A representative corpus should include the majority of the types
> in the language as recorded in a comprehensive dictionary.
> Thus:
> (a) assuming that a dictionary entry is analogous to a type;
> (b) dictionary x is comprehensive
> (c) dictionary x has 100,000 entries
> (d) a majority is 1/2 + 1
> A representative corpus would need to have as many tokens
> as necessary to include 50,001 types.'

A dictionary entry more usually relates to a lexeme and
a lexeme may be realised by a number of types. One also
has to consider how the dictionary that you are using
treats derivatives (as run-ons or as separate entries).
There is also a sort of circularity in the notion of
"comprehensive dictionary". Isn't a "comprehensive
dictionary" one that includes entries for the majority
of lexical items found in the corpus?


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jon Mills
Faculty of Humanities, University of Luton,
75 Castle Street, Luton, Bedfordshire, LU1 3AJ, UK
Tel: +44 (0)1582 489025 Fax: +44 (0)1582 489014
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~