Re: Corpora: Size of a representative corpus

Michael Klotz - englische Sprachwissenschaf (Mklotz@phil.uni-erlangen.de)
Fri, 21 Aug 1998 09:23:48 +0100

>Tony Berber Sardinha wrote:

>The question of how large (in tokens) a representative corpus
>must be came up in our classes and one of the possibilities
>we came up with would be to think about this issue as follows:

>'A representative corpus should include the majority of the types in
>the language as recorded in a comprehensive dictionary. Thus: (a)
>assuming that a dictionary entry is analogous to a type; (b)
>dictionary x is comprehensive (c) dictionary x has 100,000 entries
>(d) a majority is 1/2 + 1 A representative corpus would need to have
>as many tokens as necessary to include 50,001 types.'

>Since there are no references to this hypothesis in the literature
>(or is there?) we would like to know people's reactions to it: Would
>this be a proper criterion? What are the possible flaws in the
>argument?

It seems to me that the basic type-unit is not the lemma but what
Cruse calls the lexical unit, i.e. "a lexical form with a single
sense". This is all the more important, since different lexical units
that share a lexical form can behave differently e.g. with regards to
subcategorisation. For example, there is "be friendly to" (i.e.
behave in a friendly way) and "be friendly with" (i.e. be friends
with). In a representative corpus you would want to make sure that
both senses of "friendly" are covered. Once you take meaning into
account, your estimate will be much higher of course.

_________________________________________________
Dr. Michael Klotz (mklotz@phil.uni-erlangen.de)
Institut f. Anglistik und Amerikanistik
Lehrstuhl f. engl. Sprachwissenschaft
Friedrich-Alexander-Universit„t Erlangen-Nrnberg
Bismarckstr. 1
91054 Erlangen
GERMANY
Tel: 09131-852938 Fax: 09131-859362