Re: [Corpora-List] calculation problem

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Thu Oct 20 2005 - 16:07:14 MET DST

Next message: Timothy Baldwin: "[Corpora-List] COLING/ACL 2006: First Call for Workshop Proposals"

Previous message: Alexander Osherenko: "Re: [Corpora-List] calculation problem"
In reply to: STENGERS, Helene: "[Corpora-List] calculation problem"
Next in thread: radev@umich.edu: "Re: [Corpora-List] calculation problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Helene,

Your choice is based on a reasonable "point" estimate of the "true"
proportion of occurrences of the expression in the population (known as the
maximum likelihood estimate).

You could also obtain a range of "plausible" values for the estimate by
running a binomial test (available in most statistical packages) with
parameters 500 for k (successes) and 5000000 for N (trials). You would then
get a confidence interval (typically, by default, the 95% confidence
interval) for the plausible values that the proportion can have in the
population.

Multiplying these proportions by 1M would give you a range of plausible
frequencies of occurrences in the smaller corpus.

In concrete, using the statistical package R (http://www.r-project.org/):

> binom.test(500,5000000)

Exact binomial test

data: 500 and 5e+06
number of successes = 500, number of trials = 5e+06, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5 <-
ignore this and p-value above
95 percent confidence interval:
0.0000914261 0.0001091614
sample estimates:
probability of success
1e-04

> 0.0000914261*1000000
[1] 91.4261

> 0.0001091614*1000000
[1] 109.1614

Thus, you could say that you are 95% confident that the value in the
smaller corpus ranges btw. approx. 91 and 109.

Of course, in all cases you have to assume that the two corpora can be seen
as random samples from the same population, which is almost never the
case, but there can be more or less serious violations of the assumption.

Hth,

Marco

STENGERS, Helene wrote:
>
>
> Hello dear list members,
>
>
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula which I should apply?
>
> Cheers,
>
> Helene Stengers
>
>

-- 
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni

Next message: Timothy Baldwin: "[Corpora-List] COLING/ACL 2006: First Call for Workshop Proposals"
Previous message: Alexander Osherenko: "Re: [Corpora-List] calculation problem"
In reply to: STENGERS, Helene: "[Corpora-List] calculation problem"
Next in thread: radev@umich.edu: "Re: [Corpora-List] calculation problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Oct 20 2005 - 16:37:35 MET DST