RE: [Corpora-List] Chi-Square

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Sun Sep 17 2006 - 23:33:44 MET DST

  • Next message: Martin Volk: "[Corpora-List] Symposium on Parallel Treebanks"

    Crayton,

    I've had a go at explaining just this to non-mathematicians in a recent
    paper called "Language is never ever ever random", see
    http://www.kilgarriff.co.uk/publications.htm

    Here's the core reason (taken from the abstract)

    Language users never choose words randomly, and language is essentially
    non-random. Statistical hypothesis testing [eg chi-square] uses a null
    hypothesis, which
    posits randomness. Hence, when we look at linguistic phenomena in corpora,
    the null hypothesis will never be true. Moreover, where there is enough
    data, we shall (almost) always be able to establish that it is not true. In
    corpus studies, we frequently do have enough data, so the fact that a
    relation between two phenomena is demonstrably non-random, does not support
    the inference that it is not arbitrary.

    Adam

    Crayton Walker wrote:

    > A simple question about statistical measures.
    >
    > Could someone explain in very simple terms why we don't normally use
    > Chi-square as a measure of collocational significance? We tend to use
    > t-score and MI and not Chi-square. Why not? I am not a mathematician
    > so would appreciate it if you could keep it simple.
    >
    > Many thanks
    >
    > Crayton Walker
    >
    > University of Birmingham
    >



    This archive was generated by hypermail 2b29 : Sun Sep 17 2006 - 23:31:30 MET DST