Re: Corpora: history of corpora

Oliver Mason (oliver@clg.bham.ac.uk)
Tue, 1 Dec 1998 14:29:58 +0000

> > Thesis and I would like to know when the following electronic corpora
> > were compiled:
>
> >"The Oxford Text Archive";
> >"International Computer Archive of Modern English".

I don't want to split hairs or start an ideological flame war, but I
personally wouldn't call those two `electronic corpora'. They're (as
implied by the name) archives, which *contain* (amongst other data)
corpora. A corpus is a special collection of textual material
collected according to a certain set of criteria, like the BNC or the
BoE, or Brown, COLT, Flob, LOB, whatever. They all made decisions
about the composition of their data in advance and selected it
accordingly.

Also, they are homogeneous in the way they are stored/accessed. For
the BNC you have got SARA, there's Lookup for the BoE, and CUP probably
have their own special software for their corpus.

Now, correct me if I'm wrong, but does the OTA do the same? Again, I
DON'T want to criticise anything here, it's just a terminological
distinction. I am worried that the term `corpus' gets watered down too
much it is basically used the same way as `archive'. An archive is
less focussed on doing things with its data, and mainly concerned with
storage, archival, and retrieval of its elements. If I want an
electronic copy of a certain book I would use the OTA, but for
concordance lines of some word I wouldn't.

Anybody else agrees, disagrees?

Oliver

-- 
//\\ computer officer | corpus research | department of english | school of  -
//\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
\\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
\\// mobile 07050 104504 | http://www-clg.bham.ac.uk | o.mason@bham.ac.uk\/  -