Re: Corpora: Corpus Linguistics User Needs

Bill Teahan (wjt@cs.waikato.ac.nz)
Sat, 01 Aug 1998 11:44:13 +1200

A few people have asked about the API for statistical modelling I'm
currently writing. It's still in the development stages, but I can
give a short example to illustrate how it works. (See below. If some
linguist-non-programmer can understand this, let me know, because then
I'll know we are on the right track).

I'm not sure how useful this might be to a linguist however (although
it does give you an idea of what an API designed for linguists might
look like). It would be quite easy to extend this API (or develop a
new one) to encompass routines specifically tailored for analayzing
text corpora - it wouldn't take much work to add routines to return
statistical word-based information, for example (e.g. how many times a
word appears in a text database, how many times it appears with
another word or words, how many times it appears within n words of
another word etc. etc.). Let me know if anyone would be interested in
this and I might consider writing it.

Here's a brief description of the API.

There are three main types of objects:

1) model
A number associated with a statistical model e.g. one model
might
have been trained on French text, another on English text etc.
2) symbol
A number associated with a symbol in the alphabet (the API
treats the symbols as unsigned integers; it doesn't know
anything
about what the symbols stands for i.e. whether they are ASCII
characters,
letters in the English alphabet, hieroglyphics, or whatever).
3) context
A number associated with the prediction context; this context
is
is updated after each symbol has been processed, and is used to
make
a prediction for subsequent symbols.

In the API, there are routines to load a static model (from a file on
disk), create
dynamic models, create a context associated with a model, step through
the
probability distribution for predicting the next symbol given the
current
context etc.

It's easier to see how this works from a short example, rather than
laboriously describing each routine separately. The following is an
extract of C code that can be used to identify the language of a
string of text. To do this, you first need to train a whole lot of
models using text from different languages. Then to identify the
language, the program chooses the model which best predicts the string
of text (i.e. has a smaller entropy).

Here's the code. To a non-programmer, this may look complicated at
first,
but hopefully should become obvious once plenty of examples are provided
with
the API. (In the code below, routines that form part of the API have
been prefixed by
"SMI_" which is short for "Statistical Modelling Interface"):

float
entropy_text (unsigned int model, char *text)
/* Returns the entropy for predicting the text using the model. */
{
unsigned int context;
float ent, entropy;
int p;

entropy = 0.0;
context = SMI_create_context (model); /* creates a context
associated with the
model */

/* Now calculate the entropy for predicting each symbol in the text
based on
the current context */
for (p=0; (text [p]); p++) /* for each symbol */
{
ent = SMI_entropy_symbol (context, text [p]);
entropy += ent;
}
SMI_release_context (context); /* release the context for
re-use */

return (entropy);
}

unsigned int
identify_models (char *text)
/* Returns the model number associated with the model that predicts the
text best. */
{
unsigned int model, min_model, models;
float entropy, min_ent = 0.0;

model = 1;
models = SMI_numberof_models (); /* returns the number of models */
for (model = 1; model <= models; model++)
{
entropy = entropy_text (model, text);
if ((min_ent == 0.0) || (entropy < min_ent))
{
min_ent = entropy;
min_model = model;
}
}
return (min_model);
}

Experiments I did for my Ph.D. thesis based on using PPM text
compression models showed that the language could be identified using
this approach with almost 100% accuracy (it was even able to detect
the difference between samples of American and British English text
with 100% accuracy by training two models on text contained in the
Brown and LOB corpora).

Bill Teahan
Department of Computer Science
University of Waikato
Hamilton, New Zealand