Re: [Corpora-List] question about Wordsmith tools (log-likelihood)

From: Stefan Evert (stefan.evert@uos.de)
Date: Fri Sep 22 2006 - 23:31:51 MET DST

  • Next message: Yuri Tambovtsev: "[Corpora-List] Language Classification by Numbers"

    Dear Luciana,

    calculating 'd' precisely for span-based collocations is a tricky
    problem indeed, especially if you want to do it in a mathematically
    sound way. I've tried to work this out in my PhD thesis (which you
    can get from my homepage purl.org/stefan.evert - look under
    "Publications"), but the description has become fairly technical and
    complicated.

    A reasonably good approximation is achieved by the following
    procedure, which calculates the four entries of the contingency table
    from the number of cooccurrences (a), the marginals (r1 = first row
    and c1 = first column), and the sample size (n).

    a = number of cooccurrences of w1 and w2 within the chosen span size
    c1 = first column marginal = unigram frequency of w2

    The next two values may be different from what you would do intuitively:

    r1 = first row marginal = number of "slots" where w2 could cooccur
    with w1 = span size * unigram frequency of w1
    n = sample size = total number of tokens in the corpus (yes, you were
    right, d will be close to the total number of tokens)

    I think that the calculation of r1 merits some further explanation.
    In your case, where a 1:1 span is used, there are two positions
    around each instance of w1 where an instance of w2 could in principle
    cooccur with it, so the total number of "slots" is 2 * f(w1). If you
    increase the span size, the number of slots increases
    correspondingly, so for a 3:3 span, it would be 6 * f(w1); for a one-
    sided 0:5 span, it would be 5 * f(w1).

    Once you've got all this information, it's straightforward to
    calculate the contingency table:

    a = a (as defined above)
    b = r1 - a
    c = c1 - a
    d = n - r1 - c1 + a

    Hope this helps to clarify things a little,
    Stefan

    PS: If you look closely at these equations, you'll notice that
    changing the span size will also change d, but only by a
    comparatively small amount. r1 and b, on the other hand, are much
    more sensitive to span size.

    On 20 Sep 2006, at 22:50, Luciana Diniz wrote:

    > I'm trying to make sense of the log likelihood formula (in the
    > Wordsmith
    > Tools manual), and I'm not sure what "d" means in:
    >
    > "d := frequency of pairs involving neither w1 nor w2"
    >
    > Does it mean the frequency of the all possible collocates (with span
    > 1:1) minus the frequency of the word 1 (isolated frequency) minus the
    > frequency of word 2 (isolated frequency)?
    > If this is the case, would "d" be very close to the total number of
    > words in the corpus?
    >
    > Also, if this is the case, what if I choose a different span? Would
    > this
    > change the value of "d"?
    >
    > I'm very confused and I'd really appreciate it if somebody could
    > help me
    > :)
    >
    > Thank you!
    > Luciana.
    >



    This archive was generated by hypermail 2b29 : Fri Sep 22 2006 - 23:29:49 MET DST