Re: [Corpora-List] calculation problem

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Thu Oct 20 2005 - 16:07:14 MET DST

  • Next message: Timothy Baldwin: "[Corpora-List] COLING/ACL 2006: First Call for Workshop Proposals"

    Dear Helene,

    Your choice is based on a reasonable "point" estimate of the "true"
    proportion of occurrences of the expression in the population (known as the
    maximum likelihood estimate).

    You could also obtain a range of "plausible" values for the estimate by
    running a binomial test (available in most statistical packages) with
    parameters 500 for k (successes) and 5000000 for N (trials). You would then
    get a confidence interval (typically, by default, the 95% confidence
    interval) for the plausible values that the proportion can have in the
    population.

    Multiplying these proportions by 1M would give you a range of plausible
    frequencies of occurrences in the smaller corpus.

    In concrete, using the statistical package R (http://www.r-project.org/):

    > binom.test(500,5000000)

            Exact binomial test

    data: 500 and 5e+06
    number of successes = 500, number of trials = 5e+06, p-value < 2.2e-16
    alternative hypothesis: true probability of success is not equal to 0.5 <-
    ignore this and p-value above
    95 percent confidence interval:
      0.0000914261 0.0001091614
    sample estimates:
    probability of success
                      1e-04

    > 0.0000914261*1000000
    [1] 91.4261

    > 0.0001091614*1000000
    [1] 109.1614

    Thus, you could say that you are 95% confident that the value in the
    smaller corpus ranges btw. approx. 91 and 109.

    Of course, in all cases you have to assume that the two corpora can be seen
    as random samples from the same population, which is almost never the
    case, but there can be more or less serious violations of the assumption.

    Hth,

    Marco

    STENGERS, Helene wrote:
    >
    >
    > Hello dear list members,
    >
    >
    > I have an arithmetic question. If a particular expression occurs let's
    > say 500 times in a 5 million word corpus, can I assume that there will
    > be 100 of these expressions in a one million corpus or is there a
    > statistical (probability)formula which I should apply?
    >
    > Cheers,
    >
    > Helene Stengers
    >
    >

    -- 
    Marco Baroni
    SSLMIT, University of Bologna
    http://sslmit.unibo.it/~baroni
    



    This archive was generated by hypermail 2b29 : Thu Oct 20 2005 - 16:37:35 MET DST