Re: [Corpora-List] Chi-Square

From: Jin-Dong Kim (jdkim@is.s.u-tokyo.ac.jp)
Date: Sun Sep 17 2006 - 17:16:27 MET DST

  • Next message: ted pedersen: "Re: [Corpora-List] Chi-Square"

    One of the reasons of not using chi-square for text processing would
    be its requirment that each event has to be observed at least five
    times to get realiable statistics, which is not always the case in
    text processing.
    Dunning's log-likelihood is a kind of appoximation of chi-square which
    is known to perform reasonably well for not fequently observed events.
    It is also known to approach to chi-square when each event is observed
    frequently enough.

    Regards,

    Jin-Dong

    On 9/17/06, Marco Baroni <baroni@sslmit.unibo.it> wrote:
    > You can see the comparison of chi-square and log-likelihood ratio in this
    > famous paper, that I think was very influential in giving the Chi-square
    > test a bad name:
    >
    > T. Dunning, "Accurate Methods for the Statistics of Surprise and
    > Coincidence," Computational Linguistics 19(1), 1993.
    > http://citeseer.ist.psu.edu/dunning93accurate.html
    >
    > The paper is quite mathematical, but the basic idea and the empirical
    > comparison part should be quite clear... (although the alternative to
    > chi-square should be something like the log-likelihood ratio test, not MI,
    > that has the same problem of overestimation of the significance of the
    > co-occurrence of rare words that the chi-square test has...)
    >
    >
    > Regards,
    >
    > Marco
    >
    >



    This archive was generated by hypermail 2b29 : Sun Sep 17 2006 - 17:14:18 MET DST