Corpora: representativeness

Michael Rundell (ae345@dial.pipex.com)
Fri, 21 Aug 1998 11:31:33 +0100

Michael Klotz writes
>It seems to me that the basic type-unit is not the lemma but what
>Cruse calls the lexical unit, i.e. "a lexical form with a single
>sense".
This is absolutely right - the earlier focus just on "number of types" seems
way too simplistic - and anyway once you get past the first 10-15K most
common words, frequency statistics become unreliable and extremely variable
across different corpora of similar size but different content .
Consider a type like "bond": if you have a corpus made up of Wall St Jnl,
you will have 1000s of instances of bond - but they will *all* be about govt
bonds, junk bonds etc. If yr corpus is chemical abstracts you will also have
1000s of bonds - but this time "co-valent bonds", "molecular bonds" etc etc;
similarly if yr corpus is legal texts - more bonds, but just of one specific
type.
None of these corpora will have instances of the *other* kinds of bond, and
none will have instances either of more metaphorical uses ("lifetime bonds
of friendship" etc - you might need a fiction corpus to collect more of
those). This is why lexicographers are suspicious of the type of large
corpus (typically news text) that is cheap and easy to collect in volume -
but which can't give a v balanced picture of the full semantic/grammatical
spectrum. Each of the separate corpora mentioned above is representative -
to a degree - of its own world of discourse, but not of the language as a
whole. But most dictionary people now accept that representativeness of the
whole language isn't a realistic goal - but achieving a reasonable balance
of text-types and registers is still worth aiming for, and for this you have
to have some sort of top-down approach.

***************************************
Michael Rundell
Dictionary and Corpus Consultant
michael.rundell@dial.pipex.com
(44) 1227 766571
252 Wincheap Canterbury Kent CT1 3TY UK
Lexicography MasterClass:
http://ds.dial.pipex.com/town/lane/ae345/
***************************************