Re: Corpora: Corpus Linguistics User Needs

Bill Teahan (wjt@cs.waikato.ac.nz)
Thu, 30 Jul 1998 11:44:07 +1200

Coming up with ideas is easy, but sometimes writing specialized software to test
them
out is not. I've just spent the last four years for my Ph.D. research literally
doing
thousands of experiments on modelling English text (primarily for text
compression).
In most cases, most problems could be solved easily by using GAWK
(which is very simple to learn and use, much easier than PERL).

However, some ideas are much more difficult to test , and require substantial
effort to code.
I'm currently designing what is called an application programming interface (API)

for statistical models which will make it much easier to write your own code.
For example, you will be able to write a program to identify the language of the
text
(e.g. whether it is French, English etc.) or even compress it with only a few
lines of code.
The API will be based on start-of-the-art compression modelling techniques, but
it could
be based on any statistical modelling methods. I'll also be extending it to
include
Viterbi-based algorithms, so that it will be fairly simple to write programs that
do
spelling-correction, OCR text correction, part-of-speech tagging etc.
(Let me know if anyone is interested in this API, and I can post it to the list
for
discussion).

The same approach could be used to make it easier for linguists to write their
own
software. i.e. design an API specifically tailored for corpus-based research.
How much interest would there be out there in this? And what functions would
people find useful to put in this API?

Bill Teahan
Department of Computer Science
University of Waikato
Hamilton, New Zealand