Re: [Corpora-List] Chi-Square

From: ted pedersen (tpederse@d.umn.edu)
Date: Sun Sep 17 2006 - 17:44:36 MET DST

  • Next message: FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE: "[Corpora-List] Re: Chi-Square"

    On Mon, 18 Sep 2006, Jin-Dong Kim wrote:

    > One of the reasons of not using chi-square for text processing would
    > be its requirment that each event has to be observed at least five
    > times to get realiable statistics, which is not always the case in
    > text processing.
    > Dunning's log-likelihood is a kind of appoximation of chi-square which
    > is known to perform reasonably well for not fequently observed events.
    > It is also known to approach to chi-square when each event is observed
    > frequently enough.
    >
    > Regards,
    >
    > Jin-Dong
    >

    Greetings collocationalists,

    Just to elaborate a little, log-likelihood also has the "requirement"
    that each event be observed 5 times, although there are other requirements
    that both must adhere to as well (like the distribution of counts should
    not be too skewed, etc.). Of course we typically violate these with
    reckless abandon in NLP. :)

    Chi-squared and log-likelihood are quite closely related (members of the
    same family of test) so when one works reasonably well the other probably
    does too, and when one is unreliable the other might be too. Some of this
    is summarized in an earlier note to this list, and in fact some of
    preceding and following messages are also quite relevant:

    http://torvald.aksis.uib.no/corpora/1997-1/0160.html

    BTW, there is a url mentioned in that note that does not exist any longer,
    it has been replaced by http://www.d.umn.edu/~tpederse/pubs.html should
    that seem relevant.

    I strongly encourage anyone interested in these issues to look carefully
    at Read and Cressie (1988), which is cited more fully in the note above.
    Among other things, this lays out the history of the log-likelihood
    ratio and the Chi-squared test, and actually tells a rather dramatic
    story of how they have been in competition since the 1920's or so!

    I think Read and Cressie are in some ways trying to mend the rift between
    the two measures, and show that rather than these measures being enemies
    they are in fact members of the same family, and you can tell alot about
    one by looking at the other. Anyway, it's a nice book, highly recommened
    both for the technical content and the historical perspective it provides.

    Cordially,
    Ted

    --
    Ted Pedersen
    http://www.d.umn.edu/~tpederse
    



    This archive was generated by hypermail 2b29 : Sun Sep 17 2006 - 17:47:53 MET DST