Re: Corpora: Corpus of scientific texts

GCW (williams@ensinfo.univ-nantes.fr)
Mon, 26 Oct 1998 07:47:47 +0100 (MET)

I'll take advantage of the 'reply' function to answer both David Lee and
Adam Kilgarriff.

First, as regards the BNC I do not doubt its usefullness both as a full
corpus and for subcorpora.

However, all depends on the definition of corpus adopted. If we take that
of EAGLES, then the selection criteria are all important. If we take that
of the Corpus Encoding Standard, we have a looser definition. Personally,
in 'sublanguage' research I prefer the former. The BNC is clearly a
well-balanced corpus, but we cannot say that in extracting one chunk that
the balance is maintained in that chunk. I accept that these are
individual texts, but they represent only the publications in a given
journal for a given period. If I take the entire years 'Lancet' I would
not call that a corpus, it represents only the wide variety of
publications within one journal. The many discourse communities involved
are busy discussing the theme in other journals, to demonstrate that DC it
is necessary to look at all the publications of that DC within a specified
genre. It is sufficient to read the work of Myers (Writing Biology. Univ.
of Wisconsin Press. 1990) to get an idea of the problems of publication
and the stylistic differences involved. My own teaching experience tells
me that I must clearly differentiate say biochemistry from Molecular
Biology, it is not only the lexis that changes.

To come on to Adam Kilgariff's contribution:

On Fri, 23 Oct 1998, Adam Kilgarriff wrote:

>
> Aren't 'technical scientific corpora' the easiest of all to produce?
> Increasingly, all the material is available online in a manner which
> invites you to download it, for free, direct, without a publisher
> intervening to create copyright problems.

In this case, who controls the input? If you take what happens to be
available on the net then you have little control over the selection
process. Then, are we talking about 'technical' science in the sense of
technical how-to-do-it manuels, or learned research papers. The latter are
rarely available on-line for copyright reason. Some scientists do put
texts on their websites, but this is for self-publicity purposes,
'creating a research space' in the terminology of Swales. You cannot cover
a sublanguage in this way.

> At an average article length
> of, say, 15,000 words, it will only take 55 downloads to get a
> million-word corpus, with as fine-grained a definition of sublangauge

Fine, but biology research articles average 3000-4000, covering a
Discourse Community in a given genre is likely to be a bigger enterprise.
>

Coming back to David Lees comments, building a specialised corus is not
reinventing the circle. If no one gets their hands dirty on building
highly specialised corpora on strict lines of selection criteria, then
ELRA is quickly going to find itself short of material. I've nothing
against re-using, but if you are not reinventing the wheel at least put
new tyres on it.

Best wishes

Geoffrey

williams@ensinfo.univ-nantes.fr

COLEX-Centre Ouest Lexique
Faculte des Sciences et des Techniques
2, rue de la HOUSSINIERE
44322 NANTES Cedex 3
France