Re: Corpora: Statistical significance of tagging differences

Ted Pedersen (tpederse@falcon.csc.calpoly.edu)
Fri, 19 Mar 1999 09:03:12 -0800 (PST)

> Jim,
>
> > it seems pertinent to note that chi-square is designed for checking
> > arrays with SMALL numbers (say, under about 100 per cell, if memory
> > serves). Furthermore, as Chris indicates too tenderly, with much larger
>
> Can you direct me to a book or article which says chi-square is designed for
> small numbers?
>
> Thanks,
> Paul Rayson.

I've found the following very useful in fighting through the ins and outs
of statistical tests of significance:

@book{ReadC88,
author={Read, T. and Cressie, N.},
title={Goodness of fit Statistics for Discrete Multivariate Data},
year = {1988},
address = {New York, NY},
publisher = {Springer-Verlag}}

They detail the long and storied history of Pearson's test, the Log
Likelihood Ratio, etc. and talk about the conditions (including the number
of cell issue that is alluded to above) under which these are likely to be
valid (and when not). It gives both formal statistical arguments as well
as helpful rules of thumb. This is the best treatment of these issues
that I've run across, and I've looked around at least a little bit. Any
further pointers would be of great interest.

Best Regards,
Ted

-- 
# Ted Pedersen                      http://www.csc.calpoly.edu/~tpederse #
# Department of Computer Science                tpederse@csc.calpoly.edu #
# California Polytechnic State University                                #
# San Luis Obispo, CA  93407                              (805) 756-6133 #