Re: Chi-Square

Ted Pedersen (pedersen@seas.smu.edu)
Wed, 12 Mar 1997 09:49:07 -0600 (CST)

>
> In order to show correlations of this type on a statistically sound
> basis I thought that the chi-square test might be appropriate. I
> would be grateful for any hints as to the appropriateness of this
> test for my purposes. Any information about literature describing
> the use of the chi-square test in connection with lexical
> distributions would be most welcome, as well.

There are extensive guidelines on the appropriateness of chi-squared
tests in

@book{ReadC88,
author={Read, T. and Cressie, N.},
title={Goodness of fit Statistics for Discrete Multivariate Data},
year = {1988},
address = {New York, NY},
publisher = {Springer-Verlag}}

They show many interesting points here, among them that both the
log-likelihood ratio G^2 and the chi-squared statistic X^2 are both
subject to breakdowns under different circumstances. It is not always true
that G^2 is more reliable than X^2 or vice versa.

Given that, I'd suggest the following general methodology when deciding
which test to use.

1) Compute values for both X^2 and G^2. If you don't see too much
difference between them then you are "probably" safe using either one.

2) If you do note a significant difference between X^2 and G^2 do not
automatically assume that G^2 is valid.

3) As a tie breaker perform Fisher's Exact test (if the data is
represented in 2x2 or other small dimension table) or an Exact conditional
test (if the data is in a larger table). The value of the exact test
should be the most accurate in that there are no asymptotic assumptions
made by these methods.

Overall, given the choice between the above I would rely on the value of
either of the exact testss. However, you can also use the exact test to
see which of the tests G^2 or X^2 was closer to this value. This can give
you some insight into which test is appropriate for your data.

Using the exact conditional test for identifying interesting bigrams is
discussed further in

@inproceedings{PedersenKB96,
author = {Pedersen, T. and Kayaalp, M. and Bruce, R.},
title = {Significant Lexical Relationships},
booktitle = {Proceedings of the 13th National Conference on
Artificial Intelligence},
address = {Portland, OR},
month = {August},
year = {1996},
pages = {455-460}}

and using Fisher's exact test for the same task is discussed in

@inproceedings{Pedersen96,
author = {Pedersen, T.},
title = {Fishing For Exactness},
booktitle = {Proceedings of the South Central SAS User's Group
(SCSUG-96) Conference},
year = {1996},
pages = {188--200},
month ={October},
address = {Austin, TX}}

Both are available at http://www.seas.smu.edu/~pedersen/

Good luck!
Ted

-- 
* Ted Pedersen                     pedersen@seas.smu.edu              * 
*                                  http://www.seas.smu.edu/~pedersen/ *
* Department of Computer Science and Engineering,                     *
* Southern Methodist University, Dallas, TX 75275      (214) 768-3712 *