Re: [Corpora-List] calculation problem

From: radev@umich.edu
Date: Thu Oct 20 2005 - 18:43:13 MET DST

Next message: Alexander Osherenko: "Re: [Corpora-List] calculation problem"

Previous message: Juan Huerta: "Re: [Corpora-List] calculation problem"
In reply to: STENGERS, Helene: "[Corpora-List] calculation problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Here is the basic idea - let p_i be a parameter of your model which
tells you how often the word w_i appears in the underlying
distribution. The likelihood of your observation P(data|p_i), namely
500 times out of 5 million, then is a function of p_i.

Different values of p_i could have generated the data that you
observed. You need to compute the probability of the data given all
possible values of p_i. You will therefore obtain a probability
distribution for p_i over the interval [0..1]. To get the distribution
of occurrences p' of w_i in the new corpus, you will have to integrate
pmf=p(p'|p_i) over i from 0 to 1.

In the case of a multinomial distribution with a uniform prior over
[0..1], one particular value of p_i, equal to 500/5000000=0.0001, will
end up being the maximum likelihood estimate p_i_ML of p_i.

STENGERS, Helene wrote:
>
>
>
>
> Hello dear list members,
>
>
> I have an arithmetic question. If a particular expression occurs let's
> say 500 times in a 5 million word corpus, can I assume that there will
> be 100 of these expressions in a one million corpus or is there a
> statistical (probability)formula which I should apply?
>
> Cheers,
>
> Helene Stengers
>
>
>
>
>

-- 
Dragomir R. Radev                                         radev@umich.edu
Associate Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev

Next message: Alexander Osherenko: "Re: [Corpora-List] calculation problem"
Previous message: Juan Huerta: "Re: [Corpora-List] calculation problem"
In reply to: STENGERS, Helene: "[Corpora-List] calculation problem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Oct 21 2005 - 00:51:08 MET DST