Corpora: Corpus vs collection: dictionaries defended

Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Fri, 4 Dec 1998 12:41:38 +0000 (GMT)

This topic is discussed at the start of an excellent book (jointly
written by myself and Guy Aston) from which I quote:

.... [some discussion of different ideas about the gaining of lexical
knowledge ommitted ]

While only experience can tell us what a word "is understood to mean",
such analytic methods tell us what a word "ought to mean". A modern
dictionary combines the strengths of both methods, by organizing evidence of
usage into an analytic framework of senses.
What, then, does the word `corpus' actually mean? We might do worse
than consider the five distinct senses listed in the second edition of
the Oxford English Dictionary as a starting point (see figure on
preceding page). Of these, two particularly refer to language. The
first is that of "A body or collection of writings or the like; the
whole body of literature on any subject". Thus we may speak of the
`Shakespearean corpus', meaning the entire collection of texts by
Shakespeare. The second is that of "the body of written or spoken
material upon which a linguistic analysis is based". This is the sense
of the word from which the phrase `corpus linguistics' derives, and in
which we use it throughout this book. The two senses can, of course,
overlap — as when, for example, the entire collection of a
particular author's work is subjected to linguistic analysis. But a
key distinction remains. In the words of John Sinclair, the linguist's
corpus is "a collection of pieces of language, selected and ordered
according to explicit linguistic criteria in order to be used as a
sample of the language" (Sinclair 1996). It is an object designed for
the purpose of linguistic analysis, rather than an object defined by
accidents of authorship or history.
As such, corpora can be contrasted with *archives* or *collections*
whose components are unlikely to have been assembled with such goals
in mind (see further Atkins et al 1992). Given this emphasis on
intended function, the composition of a corpus will depend on the
scope of the investigation. It may be chosen to characterize a
particular historical state or a particular variety of a particular
language, or it may be selected to enable comparison of a number of
historical states, varieties or languages. Varieties may be selected
on geographical (for example, British, American, or Indian English),
sociological (for example, by gender, social class, or age group), or
generic bases (for example written vs. spoken; legal or medical;
technical or popular; private or public correspondence). Generally the
texts to be included in a corpus are defined according to criteria
which are *external* to the texts themselves, relating to the
situation of their production or reception rather than any intrinsic
property they may have. Discovery of such intrinsic properties (if
any) may, indeed, be the purpose of the exercise.

....

[did I forget to mention the title of this excellent book? It's called
"The BNC Handbook', has isbn 0-7486 1055 3, and is published by
Edinburgh University Press, at a price I won't mention for fear of
being accused of commercial advertising]

----------------------------------------------------------------
Lou Burnard http://users.ox.ac.uk/~lou
----------------------------------------------------------------