Re: Corpora: Statistical significance of tagging differences

Steve Finch (steve.finch@thomson.com)
Thu, 18 Mar 1999 15:49:27 -0500

"Ted E. Dunning" writes:
>
>
>
>>>>>> "sf" == Steve Finch <steve.finch@thomson.com> writes:
>
> sf> ... The problem is that we would like to generalise to other
> sf> corpora of X and Ys from our experiment, but what we are
> sf> actually measuring is a significance due to the compilation of
> sf> our corpus, rather than tagging performance.
>
[snip]
>
>On the other hand, when we are in a very pragmatic and commercial
>mode, then it is very often true that we most want to evaluate
>performance on a very well characterized corpus. For instance, a
>company that routes newswire to readers knows that they will be
>routing very similar newswire in the forseeable future. Similarly, a
>software vendor knows that their FAQ database is not likely to change
>all that very much over a reasonably short time period. Thus, either
>of these users of NLP technology can compare the performance of
>alternative approaches on their own data with reasonable confidence.
>

Agreed. For a particular corpus for a particular purpose,
within-corpus statistical differences can be important. I would also
argue that naively applied, such tests can also be misleading. For
example, a common way of generating a test set is to take a set of
*documents* and manually tag those, and compare taggers using a
statistical test over *tags*. This introduces not only the
unavoidable (and uncharacterised) dependencies between tags (and
tagging errors) within sentences, but also uncharacterised
dependencies between (tagging errors for) sentences within documents
(due perhaps to a uniform style, reused linguistic constructs, lexical
uniformity and so on). All of these dependencies violate (often
grossly, I believe, but it's an empirical matter) the independence
assumptions to which even nonparametric statistical tests are very
sensitive. Any p-values which standard algorithms calculate
consequently give a *false* sense of scientific validity, even if we
are dealing with a particular corpus for a particular purpose.

Of course, there are ways to give greater precision to the
significance tests. For example, we could judge on only whether a
tagger gets an entire sentence (or clause) right or wrong, and take
care to compose the test set of randomly selected sentences or
clauses. But then precision scores go down from 95%+ on tags to
10-30% on sentences: not so nice, and much harder to actually get
statistically significant differences.

>If somebody does a test and finds that their results are *not*
>statistically significant, then their results are almost certainly
>unimportant. This happens far more often than might be imagined and
>thus tests of significance should always be done on experimental
>results. At the very least, good arguments should be made as to why
>such tests would be superfluous.

Or their test is not powerful enough (e.g., they don't have a large
enough test corpus). Good tests are never superfluous, but a bad test
may be worse than no test at all. I think that engineers should be
somewhat conservative about saying what works better than what, even
if they don't need to be so conservative as scientists. I think that
vanilla McNemar applied to errors on taggings from a test set composed
of documents is a bad test for tagger superiority even for a
particular corpus for a particular corpus.

Cheers,

Steve

------------------------------------------------------------------------
Steven Finch
Thomson Labs/NLP | steve.finch@thomson.com
1375, Piccard Drive, | +1 301 548 4093 (voice)
Rockville, MD, 20850 | +1 301 527 4080 (paper)
------------------------------------------------------------------------
When you steal from one person, it's called plagiarism;
When you steal from many, it's research. -- Wilson Misner
------------------------------------------------------------------------