Re: Corpora: Corpus of scientific texts

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Fri, 23 Oct 1998 10:27:37 +0100

Aren't 'technical scientific corpora' the easiest of all to produce?
Increasingly, all the material is available online in a manner which
invites you to download it, for free, direct, without a publisher
intervening to create copyright problems.

I'm referring to archives such as CMP-LG and comparable resources, see
http://xxx.lanl.gov (which covers physics, maths and computer science
and gives pointers to lots of other sites). Ha! you say, but
all the data is in postscript. Not true. Authors are encouraged to
submit (at least for CMP-LG) in latex rather than postscript, with the
truly stupendous CMP-LG software latexing the submitted document and
engaging in a dialogue with the person doing the submitting regarding
latex errors and reporting progress. The document is then available
to the world as text or as postscript. At an average article length
of, say, 15,000 words, it will only take 55 downloads to get a
million-word corpus, with as fine-grained a definition of sublangauge
as you could wish for (eg, xxx.lanl.gov has one archive for
Mathematics/Nonlinear sciences/cellular automata and lattice gases.)
And the data will be about as clean as you could hope for, latex being
a relatively sensible, all-ASCII formatting language.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%