[Corpora-List] chi square

From: Tina Waldman (wald@macam.ac.il)
Date: Tue Feb 13 2007 - 21:19:09 MET

  • Next message: Alexander Osherenko: "Re: [Corpora-List] Emotional Dialogue corpus"

    Dear all,

    I was asked to post the results I received following by request about comparing corpora using chi square.

    I want to thank Professor Butler and Gaetanelle Guilquin whose responses are posted below.

    You have another problem, which is that chi-square should be used only on RAW frequencies, not on normalised data. One way of getting around your problems might be to take the raw data and calculate the values in the cells of the following 2 x 2 table:

                                                    Corpus A Corpus B

    Number of running words N1 N2
    involved in collocations

    Number of running words N3 N4
    not involved in collocations

    Then:
    (N1 + N2) will be the total number of running words involved in collocations in the two corpora
    (N3 + N4) will be the total number of running words not involved in collocations in the two corpora
    (N1 + N3) will be the total number of running words in corpus A
    (N2 + N4) will be the total number of running words in corpus B
    (N1 + N2 + N3 + N4) will be the total number of running words in both corpora taken together

    You then calculate chi-square on the 2 x 2 table, remembering that strictly speaking Yates' correction is needed for such tables, though it is more important where frequencies are small, and so may make little difference in your case.

    You can calculate the chi square with one
    of the following chi square calculators

    http://www.georgetown.edu/faculty/ballc/webtools/web_chi.html

    http://www.psych.ku.edu/preacher/chisq/chisq.htm



    This archive was generated by hypermail 2b29 : Tue Feb 13 2007 - 21:18:31 MET