Re: [Corpora-List] calculation problem

From: Juan Huerta (huerta@us.ibm.com)
Date: Thu Oct 20 2005 - 17:50:00 MET DST

  • Next message: radev@umich.edu: "Re: [Corpora-List] calculation problem"

    The answer is correct, but I'd like to offer a slightly different
    explanation:

     
    The maximum likelihood estimation of the occurence frequency of that word
    in corpus 1 is = 500/5,000,000 = rate_0

    Assuming that the distribution of the words and expresions is similar in
    both corpora,
    The maximum likelihood estimation of the frequency of occurrence of that
    word in corpus 2 is = rate_0 * 1000,000 = 100

    This is regardless of particular word distrubution assumptions. The only
    condition is that
    the corpus 1 (the 5 million) and the corpus 2 (the 1 million) follow the
    same distribution (i.e.,
    they are more or less of the same nature).

    -Juan

    Sent by: owner-corpora@lists.uib.no
    To: CORPORA@UIB.NO
    cc:
    Subject: Re: [Corpora-List] calculation problem

    Hello Helene,

    if you assume that occurences in your corpus are distributed uniformly
    (actually the simplest probability distribution ever), you can take this
    100
    number

    Otherwise, if you use another distribution that better describes behaviour
    of the occurences it will influence the number of occurences in the 1
    million corpus and will be probably not 100.

    Cheers,

    Alexander

    > --- Ursprüngliche Nachricht ---
    > Von: "STENGERS, Helene" <Helene.Stengers@ehb.be>
    > An: CORPORA@UIB.NO
    > Betreff: [Corpora-List] calculation problem
    > Datum: Wed, 19 Oct 2005 14:14:55 +0200 (Romance (zomertijd))
    >
    >
    >
    >
    > Hello dear list members,
    >
    >
    > I have an arithmetic question. If a particular expression occurs let's
    > say 500 times in a 5 million word corpus, can I assume that there will
    > be 100 of these expressions in a one million corpus or is there a
    > statistical (probability)formula which I should apply?
    >
    > Cheers,
    >
    > Helene Stengers
    >
    >

    -- 
    10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail
    +++ GMX - die erste Adresse für Mail, Message, More +++
    



    This archive was generated by hypermail 2b29 : Thu Oct 20 2005 - 22:02:46 MET DST