Corpora: Re: CL User Needs, back to the original question

Hans van Halteren (hvh@let.kun.nl)
Wed, 05 Aug 1998 14:52:27 +0200

Yesterday, Oliver attempted to steer the discussion back to the original
question. We (the TOSCA group) would like to support this attempt, both
with reasons and with a (partial) wish list.

First, then, the reasons. We think that the availability of a non-trivial
corpus workbench would be A Good Thing. Obviously, it is quite easy to whip
up a word-count program when you need one, but if your wishes become a bit
more ambitious (see below) you are faced with rather more work. In fact,
building such a non-trivial corpus workbench (whether it is one big system
or a bag of co-operating small ones) is probably beyond the capacity of
any one research group and best done in wider co-operation (and I know there
are groups out there with plans for this, e.g. the group we are part of
ourselves, which is working on the already mentioned COSMAS system). However,
before the workbench can be built, it should first be clear what should be
in it, which was exactly what the original post was about. So, in order to
get some profit out of a future co-operatively built corpus workbench, send
in your need/wish lists.

On to our own needs and wishes. From the actual wish lists so far, it appears
that corpus linguistics still appears to be preoccupied with words: what
people
seem to be counting are words, or at most collocating words. In a corpus
workbench we would like to see attention for other aspects as well, e.g.
wordclass tagging, syntactic structure, discourse structure and speech
signals.
- the system must be able to handle more than one annotation scheme
- more specifically, it should be able to handle multiple levels of
annotation
at the same time
- there should be insightful presentation of all levels of annotation
(possibly
in separate windows)
- search/retrieval actions must be possible both for individual levels and
for
combinations of levels
- it should be possible to switch each level on and off (as not everybody is
interested in all levels)
- preferably it should also be possible for the user to mark/build new
structures
within the system
- the actual underlying annotation used to encode all these structures
should be
handled by the system (e.g. the user should express his searches in
terms of
linguistic notions rather than in the SGML tags used internally)
And more general
- the system should be user-friendly to both inexperienced and experienced
users
- even with all of this functionality, the system must be open-ended, i.e.
a user
wanting more should be able to access external programming facilities
All of this in no way implies that we do not want to hear more about your
wishes
on the word level, as those can all be generalized to other annotation levels.

All the best,
Hans van Halteren

on behalf of the TOSCA group (tosca@let.kun.nl, to which please reply)