Corpora: Re: Corpus Linguistics User Needs

Oliver Mason (oliver@clg.bham.ac.uk)
Tue, 4 Aug 1998 19:15:44 +0100

While Ylva and I seem to have caused a rather controversial debate on
whether or not corpus linguists should be able to program, this is not
at all what we intended. In order to get the discussion back on the
tracks we originally thought it would develop, let me briefly recap
what I think has been said so far:

As mentioned by others before, there are two basic camps, those who
think corpus linguists should be able to program and those who disagree
with that. There are of course varying degrees, and I think we all
agree that nobody ought to start from scratch to reinvent the
concordance wheel for the umptienth time (however, it seems to be
necessary to understand what currently available tools are doing, just
as it is necessary to understand what statistical procedures can be
used for before actually applying them).

A lot of minor day-to-day tasks require some basic computer literacy, eg
reformatting data files. For this some knowledge of perl/awk/sed would be
useful.

Now, what we were really after is a list of what has been called by
other contributors the ``bag of tricks'', some kind of what an ideal
corpus-toolbox should contain. This is an attempt to get an idea of
what kinds of procedures are in use on the cutting edge of corpus
linguistics.

To seed the discussion, here are some things I can think of off the top of
my head:
- corpus tokenisation/indexing
- part-of-speech tagging
- simple phrase recognition
- concordancing (with possible tags/phrases)
- sorting/filtering of concordance lines
- computation of collocations using different statistical methods

Some of these are already available in a variety of implementations, but
that does not really matter for the purpose of this discussion.

So, what are YOU using, or what would you like to use if it were available?

Oliver

PS
Maybe people would also be interested in talking about how a corpus
linguistics curriculum should look like, but that would be be a different
discussion...

-- 
//\\ computer officer | corpus research | department of english | school of  -
//\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
\\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
\\// mobile 07050 104504 | http://www-clg.bham.ac.uk | o.mason@bham.ac.uk\/  -