RE: Corpora: Corpus Linguistics User Needs

Oliver Christ (oli@trados.com)
Thu, 30 Jul 1998 11:35:38 +0200

Hi,

in the general case, I would consider it a waste of skills to have linguists
and lexicographers program their own corpus query tools. It's more efficient to
let this task being done (with lots of feedback on requirements etc. by the
prospective users) by decent hackers, perhaps with a CL or corpus linguistics
background. Of course, you can do _something_ with perl or whatever - but when
it comes to processing really large corpora such as the BNC or BoE, working
with perl on the text files is, in my eyes, a waste of time and space (other
aspects are data compression, virtually merged corpora, GUI, dynamic corpus
expansion, efficient client/server access, web-based access to remotely stored
linguistic resources, high-level query language with interpreter, support for
bidirectional and far-east languages, Unicode, support for all those character
sets... - does a linguist want to hack all that?). Or say that you want to add
some information to an existing corpus - e.g. lemmatize the BNC or add
additional parts of speech using a different tagset - do you want to modify the
text files accordingly, messing around with the SGML markup, filling up your
disks with different (textual) versions etc? A professionally developed corpus
processing toolset can handle all these aspects while still providing the
necessary flexibility and efficiency during query processing.

I am sure that designing such a toolset is exactly the goal of Oliver and Ylva.
The more researchers of the "corpus linguistics community" contribute to their
call for design suggestions and requirements (and, perhaps, in the ideal case,
later on contribute to the development), the less there will be the need of
customizing the tools. Just look at the amazing success of the PSQL (PostGres)
DBMS or the Apache Web Server - perhaps it would also be possible to
concentrate design and development efforts in the corpus linguistics community
to design and build "The Toolset", similarly to PSQL or Apache. I think it's
about time to bundle all the efforts.

It's probably always important that users are able to customize their tools to
accommodate for tasks which were not envisaged at the time of design. It's a
good compromise to have a well-designed, modular query and data management
system, built in a modular way, with GUI and query language, but also with an
API in C or perl or whatever on several levels (and perhaps even an interface
for plug-ins?) so that it can be customized by those researchers who want to do
so and have the programming skills (or the people to do it for them ;-) ).

It's surprising that - until now - the only reaction to Oliver's and Ylva's
call was the question of "do we need or want that" ;-) However, it's a fact
that all of the more or less freely available corpus query systems did indeed
have a certain user group in the research (and commercial) environment and also
had (and still have) a certain success (especially those equipped with a GUI),
which proves that not every corpus linguistics group or individual researcher
is actually willilng to implement these tools on their own ;-)

Best,

Oli

TRADOS GmbH * Hacklaenderstr. 17 * D-70184 Stuttgart
TRADOS S.A./N.V. * 303 av de Tervueren * B-1150 Bruxelles
mailto:oli@trados.com * http://www.trados.com