Re: [Corpora-List] calculation problem

From: radev@umich.edu
Date: Thu Oct 20 2005 - 18:43:13 MET DST

  • Next message: Alexander Osherenko: "Re: [Corpora-List] calculation problem"

    Here is the basic idea - let p_i be a parameter of your model which
    tells you how often the word w_i appears in the underlying
    distribution. The likelihood of your observation P(data|p_i), namely
    500 times out of 5 million, then is a function of p_i.

    Different values of p_i could have generated the data that you
    observed. You need to compute the probability of the data given all
    possible values of p_i. You will therefore obtain a probability
    distribution for p_i over the interval [0..1]. To get the distribution
    of occurrences p' of w_i in the new corpus, you will have to integrate
    pmf=p(p'|p_i) over i from 0 to 1.

    In the case of a multinomial distribution with a uniform prior over
    [0..1], one particular value of p_i, equal to 500/5000000=0.0001, will
    end up being the maximum likelihood estimate p_i_ML of p_i.

    D.

    STENGERS, Helene wrote:
    >
    >
    >
    >
    > Hello dear list members,
    >
    >
    > I have an arithmetic question. If a particular expression occurs let's
    > say 500 times in a 5 million word corpus, can I assume that there will
    > be 100 of these expressions in a one million corpus or is there a
    > statistical (probability)formula which I should apply?
    >
    > Cheers,
    >
    > Helene Stengers
    >
    >
    >
    >
    >

    -- 
    Dragomir R. Radev                                         radev@umich.edu
    Associate Professor of Information, Electrical Engineering and
    Computer Science, and Linguistics, the University of Michigan, Ann Arbor
    Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev
    



    This archive was generated by hypermail 2b29 : Fri Oct 21 2005 - 00:51:08 MET DST