Corpora: Balancing Acts

Jem Clear (jem@cobuild.collins.co.uk)
Thu, 9 Oct 1997 10:06:00 +0100

Jean Hudson asks the CORPORA list where can she find software that
will compare relative word frequencies from subsections of her corpus
and enable her to test it to make sure it is balanced and
representative of the English language as a whole.

Here again is that hoary old chestnut about the "balance" of one's corpus,
and I can't resist commenting on this topic once more. I have a favourite
analogy for corpus linguistics: it's like studying the sea. The output of
a language like English has much in common with the sea; e.g.
- both are very very large...
- and difficult to define precisely,
- subject to constant flux, currents, influences, never constant,
- part of everyday human and social reality.

Our corpus building is analogous to collecting bucketfuls of sea water
and carrying them back to the lab. It is not physically possible to
take measurements and make observations about all the aspects of the sea
we are interested in in vivo, so we collect samples to study in vitro.

The ideal sampling methodology would take account of **all** the
relevant factors of the population we want to study. But the blunt
truth is that we do not know, nor can we quantify, all the relevant
factors: not for the sea and certainly not for the English language.
I might ask "how much magnesium is there in sea water?" and I might
analyse my bucketfuls and tabulate the parts per million figures for
magnesium found, to answer that question. But if I ask "Are the
measurements I obtained representative of sea water?" I cannot
expect to get a satisfactory answer, since the express purpose of
collecting the bucketfuls in the first place was to attempt to
discover the characteristics (general or particular) of sea water,
and there is no other source of measurement of the magnesium
levels in sea water than the bucketfuls we have in the lab. I may,
of course, compare the magnesium levels from different buckets that
I have in my lab or from buckets that you have in your lab. Or I
may want to go out in a boat and collect one more bucketload to test
and compare. But I will never have the satisfaction of proving that
my bucketfuls of sea water are a facsimile of all sea water.

I hope that this little allegory is clear: though like all allegories
it may simply confuse the reader. When we measure word frequencies in
a corpus we are measuring just one feature of our sample. What sort of
evidence would Jean Hudson like to help her in her task? If Text
Sample A has the word "government" at rank 843 and Text Sample B (same
no. of tokens) has "government" at rank 1286, how does that help us to
balance the corpus? If we knew that the God's-truth, absolute,
definitive rank of the word "government" in the English language were
842 then we might be able to whoop with delight that Text Sample A
seems to be a more representative sample, but unfortunately we don't
have access to the tablets of stone on which the truth is recorded.

I wonder a great deal about this balance and representativeness
issue. How did it rise to such prominence as a litmus test for
corpora? The UK dictionary publishers are very much to blame. When the
British National Corpus was put together by OUP, Longman, and Chambers
there was an intensification of the "balance" issue. In 1991, at one
of the OED/Waterloo conferences sponsored by OUP, a debate was
organized with the motion "A corpus should consists of a balanced and
representative selection of texts". Randolph Quirk and Geoff Leech
proposed the motion and John Sinclair and Willem Meijs opposed it. The
motion was defeated. Jean, like many users of the Birmingham/Cobuild
Bank of English corpus, wants to find out how to ensure that the a
corpus will be balanced, and she hopes to compare word frequencies,
ranks, correlations, best-fits, smoothed approximations, chi-squares,
log-likelihood,... but it is a chimera which she is pursuing.

There are so many many features of the English language that we
know virtually nothing about, and our current corpora are so so
small. Single word frequencies are sort-of interesting as an
obvious and preliminary investigation for any corpus. But if
you want to look at even simple two-word combinations then it is
already clear that you will need much more data than is currently
readily available for analysis. In my opinion, the best thing Jean
could do is go and fetch another bucketful to slosh into the tank.

Jem Clear, Cobuild Ltd phone: +44 (0)121-414-3925
Westmere, 50 Edgbaston Park Rd, fax: +44 (0)121-414-6203
Birmingham, email: jem@cobuild.collins.co.uk
B15 2RX, UK