Re: Corpora: Statistical significance of tagging differences

Ted E. Dunning (ted@aptex.com)
Tue, 23 Mar 1999 14:20:49 -0800 (PST)

A more principled approach might be to specify that you want to find
effects which are statistically significant *and* which are larger
than some set size. I think that this is closer to what is desired.

This can be expressed in several different ways. One fairly natural
way is to use a scaled sum of the chi-squared score and average mutual
information. Another method is to use average mutual information with
an error bound which is determined by \alpha/2N where \alpha is
the chi-squared cutoff for the desired significance. This uses the
relationship

\chi^2 ~ 2 N MI = generalized log-likelihood ratio

Here MI is *average* mutual information,

MI = H(X,Y) - H(X) - H(Y) = \sum_ij \pi_ij log \pi_ij/\mu_ij

where \pi_ij = k_ij / N and \mu_ij = (k_ij / k_i*) (k_ij / k_*j).

In fact, the value \phi mentioned by Mr. Demetriou is just square root
of half of the mutual information.

>>>>> "gd" == George Demetriou <g.demetriou@dcs.shef.ac.uk> writes:

gd> George C. Canavos (1984), Applied Probability and Statistical
gd> Methods, Little Brown & Co.

...

gd> "However, it can be shown that for extremely large sample
gd> sizes, it is almost certain to reject the null hypothesis
gd> because one would not be able to specify H0 close enough to
gd> the true distribution. Thus the application of chi-square is
gd> questionable when extremely large sample sizes are involved."

..

gd> As a remedy, several statistics books propose the (not widely
gd> used) phi coefficient which compensates for the sample size:

gd> phi=square_root(chi-square/N) (N=sample size)