> I was wondering if anyone knows of the appropriate statistical tests which
> could be applied to determine whether the differences in tagging performace
> are statistically significant?
In similar circumstances as you describe I usually use McNemar's
chi-squared test to check significance. This means that
to compare the taggings of two taggers, you make a cross-tabulation
of taggerA-correct vs. taggerB-correct:
tagger A
Wrong Correct
Wrong n_00 | n_01
tagger B ------------------
Correct n_10 | n_11
And then you just use some statistics package that knows how to do McNemar's
test (S-Plus does) ..... ;-)
If you want to know more, read:
Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing
Supervised Classification Learning Algorithms. Neural
Computation, 10 (7) 1895-1924. Postscript preprint. (Revised December 30, 1997).
Available from: http://www.cs.orst.edu/~tgd/cv/pubs.html
I'm interested to know if people think this is a good test to compare taggers.
I haven't found any other reasonable one.
Regards,
--Jakub
------------------------------------------------------------------------------
Jakub Zavrel, B 330, Tilburg University, POBox 90153, 5000 LE Tilburg, NL
http://ilk.kub.nl/~zavrel/ tel/fax: +31-13-4663163/3110
------------------------------------------------------------------------------