Re: Corpora: Adequate size for a specialist corpora -- again

Ted E. Dunning (ted@hncais.com)
Tue, 7 Sep 1999 15:03:50 -0700 (PDT)

>>>>> "gc" == Gordon and Pam Cain <gpcain@rivernet.com.au> writes:

gc> I am doing research with a very small specialist corpus ...
gc> ~7700 tokens over 7 essays ... ~8500 tokens over 7 essays

Woof. That is small.

gc> 1. Is this sample size adequate for anything?

Yes. Absolutely.

gc> 2. If I focus on high-frequency lexis that is found across
gc> most of the texts (or most of the texts in one of the
gc> subcorpora) can I then use a smaller corpus to get valid
gc> results than for less-frequent items?

That depends very much on the phenomena that you are looking at. The
significance depends on more than just the number of observations. In
general, you should notice effects to do with high frequency lexical
items more easily than low frequency items, but if the difference in
the high frequency items is subtle and the difference in the low
frequency items is gross, then you may find the low frequency effect
first.

gc> 3. At what point should I stop thinking that there may be some
gc> validity in my results?

:-)

The key thing here is to use good statistical technique. Since you
are in exploratory mode, you *really* need to keep a held-out set to
test any hypotheses that you come up with. The combination of small
data sets and large number of potential features arises commonly in
machine learning and modelling applications and the experience from
that domain should be applied.

The problem can be approached in several ways:

a) You can do an exploratory analysis and combine the test for
individual features with an overall test for signficance. The idea
here is that if you test 1000 features for significant difference, you
would expect one or more to show up as signifcant at the p < 0.001
level. If you find one such item, then you have to accept that the
results are unsurprising. On the other hand, if 20/1000 features show
up at that significance level, then something is screwy and you can
guess that most of the features that you found are in fact real. You
still have a noise problem, but you also know that you have found
something.

This approach is seriously dependent on understanding your models,
features and data idiosyncracies. Such understanding is rare and
thus this approach is dangerous. There is also the tremendous
temptation to run this sort of analysis a number of times with
different feature sets without combining all of the results.

b) You can use a straightforward training/test paradigm. You search
for features on the training set and verify them on the test set.
Often, this paradigm is extended to a three way split where you have
training, test and verify sets. This allows features to be selected
using the training set, validated on the test set and then the overall
predictive value of the features to be evaluated on the verify data
set.

You may have difficulty using this method effectively with your data
set since it is sooo small. The splitting ratio will be critical to
the quality of your results, especially given the size of your data.

c) A third option is to use multiple training/test splits and combine
the results. This can be done in many different ways. For instance,
you can remove each element from the training in turn. Or you might
do an 80:20 split a number of times and combine the results.

This general class of techniques is very powerful in many cases, but
there are substantial subtleties in actually applying it and then
understanding the exact signficance of the results that you get. This
subtlety arises because the individual splits do not produce
independent results. In many ways, this option is the most desirable
since you get to make more use of the data that you have. In
practice, this sort of technique is often used, but what often happens
is that you get a good model which is difficult to properly evaluate.

gc> 4. As a last resort, should I collect more data from a
gc> different but similar assignment, and run parallel analyses on
gc> that data too?

This should not be considered a last resort. Without a separate
evaluation on a different assignment, the results that you derive
can't be said to be anything but very specific to the exact test
conditions.

You might try several splits between this two corpora in order to get
an idea of how well a technique is working. For instance, training on
one corpus and testing on the other would be interesting. Training on
half of each corpus and testing on the two remaining halves would also
be good.

gc> 5. Anyone know any brilliant articles on this topic for a
gc> self-educated corpus fan like me?

I like to plug my article on low count statistical tests in corpus
analysis (Dunning, CL vol 19 no 1), but it isn't going to help you
much with the cross validation problems. Dieterich has written
extensively and lucidly only this problem as have many others.

Key topics for you to investigate include cross-validation, bootstrap
and jack-knife techniques. The machine learning literature is very
helpful in this area, but the nomenclature will be somewhat foreign.

I hope this helps.