Corpora: Do you REALLY want to help corpus users?

Diana Maria de Sousa Marques Pinto dos Santos (Diana.Santos@informatics.sintef.no)
Thu, 30 Jul 1998 10:57:53 +0200

Dear corpora friends,

I wholly subscribe Geoffrey Sampson's opinion: one should definitely learn
some programming in order to be able to do corpus linguitics, precisely
because nobody can _in advance_ know what sorts of tests, measures, and
empirical work they will use in the course of their research.

Having spent some time of my life giving support precisely to linguists
with no programming skills, I can corroborate that 90% of the cases you
need to do more (or differently) than what a standard program has to offer;
but in many cases it is easy programming. But it is also almost always the
case that people come back to you and say: I thought that after all what I
wanted was instead... or I had not thought about this problem, but NOW I
see... Of course, anyone who has done some work in empirical linguistics
know that work proceeds this way. And that's why one should be able not to
depend on others to do it.

While I am sympathetic for "old" linguists, or people who are already
engaged in heavy jobs (cf. previous mails), I cannot but be astonished to
hear that people say that they need more than two years to learn to write a
Perl script! I would say one semester in programming would be enough. I
believe it is a must that all programs in linguistics (or at least in
corpus/empirical linguistics) had one semester of programming. I mean
undergraduate, but of course also graduate. So I disagree with Marco Rocha
about Phd programs: in my view, it should have been an integral part of his
to learn to do the programming he needed afterwards.

But the reason for my writing this note is different. And is: why do Oliver
Mason and Ylva Berglund are asking about "user needs regarding corpus
processing software"? Hopefully not to engage in reinventing the wheel!

There are several programs out there that have been tested, took years to
write, and are in wide use, and that perform the (apparently) more usual
kinds of corpus manipulation.

I myself use CQP, from the IMS Corpus Workbench, at
http://www.ims.uni-stuttgart.de/CorpusToolbox/Features.html,
but there are all sorts of other programs running on diverse platforms and
having different "user-friendliness". There have been in addition studies
on evaluating corpus tools, e.g.:

Schulze et al. 1993 "Comparative Statee-of-the-Art Survey and Assessment
Study of General Interest Corpus-oriented Tools", Deliverable D-1b, DECIDE
M-LAP-Project 93-19. Available from the Xerox publications pages, somewhere
down
http://www.xrce.xerox.com/research/...:

So, if Oliver Mason and Ylva Berglund really want to help users, and want
to know what kinds of things users really need, I suggest they open a Web
site where they solve the problems of such users (with underlying standard
tools, like CQP, MonoConc and some Perl scripting). People send them their
corpus (or a piece of it) and the problem they want the program to address.
Then this program could be useful for other researchers as well.

Please do not create yet another big, monolythical piece of software, that
will take years to polish, and will not have helped many users, who still
would go to their computer literate friends in order to be able to use the
software!

Diana

**************************************************************************
Diana Santos Computational processing of Portuguese

SINTEF Telecom and Informatics Tel. (direct line) +47 22 06 73 12
Forskningsveien 1 Tel. +47 22 06 73 00
Box 124 Blindern Fax. +47 22 06 73 50
N-0314 Oslo Email: Diana.Santos@informatics.sintef.no
Norway http://www.informatics.sintef.no
**************************************************************************