RE: [Corpora-List] Dice coefficient

From: Piao, Songlin (s.piao@lancaster.ac.uk)
Date: Wed Apr 19 2006 - 11:46:26 MET DST

  • Next message: Fanny Meunier: "[Corpora-List] Lancaster corpus of abuse"

    Hi Markus,

    You must be working on word alignment, but I am not sure if you are using sentence aligned corpora.

    >that frequency count is used instead, which is problematic
    >in word alignment since that would presuppose that Ns=Nt

    If you are using sentence-aligned corpora, you can get the frequencies for ws and wt by counting the aligned sentence pairs in which each of them occurs. In this case, Ns=Nt=total_number_of_aligned_sentence_pairs. As to the co-occurrence frequency for (ws, wt), you can get it by counting the aligned sentence pairs in which both of them occur.

    If you are not using aligned corpora, you can substitute the aligned sentence pairs with certain corresponsing text segments, such as paragraphs or sections.

    Hope this helps.

    Scott Piao
    --------------------
    Computing Department
    Lancaster University
    UK

    -----Original Message-----
    From: owner-corpora@lists.uib.no on behalf of Markus Saers
    Sent: Wed 19/04/2006 08:50
    To: CORPORA@uib.no
    Subject: [Corpora-List] Dice coefficient
     
    Hello,

    My name is Markus Saers, and I am currently implementing an anlignment tool
    as part of a course in Java for NLP. When trying to implement the Dice
    coefficient, I ran into some problems that I was hoping someone could help
    me with.

    The only definition of the Dice coefficient that I have seen looks like
    this:

    Dice = 2 * p(ws, wt) / ( p(ws) + p(wt) )

    Where p(ws, wt) is the probability of the source word co-occurring with the
    target word, p(ws) is the probability of the source word and p(wt) is the
    probability of the target word.

    Although it is stated as probabilities, some info that I gathered on the net
    seems to suggest that frequency count is used instead, which is problematic
    in word alignment since that would presuppose that Ns=Nt (where Ns is the
    number of source words and Nt is the number of target words).

    The second problem arise when probabilities ARE used. p(ws) and p(wt) are
    easy to estimate, but how is p(ws, wt) estimated?

    Best regards
    Markus Saers
    PhD student, Uppsala University



    This archive was generated by hypermail 2b29 : Wed Apr 19 2006 - 11:45:57 MET DST