As mentioned by others before, there are two basic camps, those who
think corpus linguists should be able to program and those who disagree
with that. There are of course varying degrees, and I think we all
agree that nobody ought to start from scratch to reinvent the
concordance wheel for the umptienth time (however, it seems to be
necessary to understand what currently available tools are doing, just
as it is necessary to understand what statistical procedures can be
used for before actually applying them).
A lot of minor day-to-day tasks require some basic computer literacy, eg
reformatting data files. For this some knowledge of perl/awk/sed would be
useful.
Now, what we were really after is a list of what has been called by
other contributors the ``bag of tricks'', some kind of what an ideal
corpus-toolbox should contain. This is an attempt to get an idea of
what kinds of procedures are in use on the cutting edge of corpus
linguistics.
To seed the discussion, here are some things I can think of off the top of
my head:
- corpus tokenisation/indexing
- part-of-speech tagging
- simple phrase recognition
- concordancing (with possible tags/phrases)
- sorting/filtering of concordance lines
- computation of collocations using different statistical methods
Some of these are already available in a variety of implementations, but
that does not really matter for the purpose of this discussion.
So, what are YOU using, or what would you like to use if it were available?
Oliver
PS
Maybe people would also be interested in talking about how a corpus
linguistics curriculum should look like, but that would be be a different
discussion...
-- //\\ computer officer | corpus research | department of english | school of - //\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt - \\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\ - \\// mobile 07050 104504 | http://www-clg.bham.ac.uk | o.mason@bham.ac.uk\/ -