Re: [Corpora-List] question about Wordsmith tools (log-likelihood)

From: Stefan Evert (stefan.evert@uos.de)
Date: Fri Sep 22 2006 - 23:31:51 MET DST

Next message: Yuri Tambovtsev: "[Corpora-List] Language Classification by Numbers"

Previous message: Joakim Nivre: "[Corpora-List] NODALIDA 2007: CALL FOR WORKSHOP PROPOSALS"
In reply to: Luciana Diniz: "[Corpora-List] question about Wordsmith tools (log-likelihood)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Luciana,

calculating 'd' precisely for span-based collocations is a tricky
problem indeed, especially if you want to do it in a mathematically
sound way. I've tried to work this out in my PhD thesis (which you
can get from my homepage purl.org/stefan.evert - look under
"Publications"), but the description has become fairly technical and
complicated.

A reasonably good approximation is achieved by the following
procedure, which calculates the four entries of the contingency table
from the number of cooccurrences (a), the marginals (r1 = first row
and c1 = first column), and the sample size (n).

a = number of cooccurrences of w1 and w2 within the chosen span size
c1 = first column marginal = unigram frequency of w2

The next two values may be different from what you would do intuitively:

r1 = first row marginal = number of "slots" where w2 could cooccur
with w1 = span size * unigram frequency of w1
n = sample size = total number of tokens in the corpus (yes, you were
right, d will be close to the total number of tokens)

I think that the calculation of r1 merits some further explanation.
In your case, where a 1:1 span is used, there are two positions
around each instance of w1 where an instance of w2 could in principle
cooccur with it, so the total number of "slots" is 2 * f(w1). If you
increase the span size, the number of slots increases
correspondingly, so for a 3:3 span, it would be 6 * f(w1); for a one-
sided 0:5 span, it would be 5 * f(w1).

Once you've got all this information, it's straightforward to
calculate the contingency table:

a = a (as defined above)
b = r1 - a
c = c1 - a
d = n - r1 - c1 + a

Hope this helps to clarify things a little,
Stefan

PS: If you look closely at these equations, you'll notice that
changing the span size will also change d, but only by a
comparatively small amount. r1 and b, on the other hand, are much
more sensitive to span size.

On 20 Sep 2006, at 22:50, Luciana Diniz wrote:

> I'm trying to make sense of the log likelihood formula (in the
> Wordsmith
> Tools manual), and I'm not sure what "d" means in:
>
> "d := frequency of pairs involving neither w1 nor w2"
>
> Does it mean the frequency of the all possible collocates (with span
> 1:1) minus the frequency of the word 1 (isolated frequency) minus the
> frequency of word 2 (isolated frequency)?
> If this is the case, would "d" be very close to the total number of
> words in the corpus?
>
> Also, if this is the case, what if I choose a different span? Would
> this
> change the value of "d"?
>
> I'm very confused and I'd really appreciate it if somebody could
> help me
> :)
>
> Thank you!
> Luciana.
>

Next message: Yuri Tambovtsev: "[Corpora-List] Language Classification by Numbers"
Previous message: Joakim Nivre: "[Corpora-List] NODALIDA 2007: CALL FOR WORKSHOP PROPOSALS"
In reply to: Luciana Diniz: "[Corpora-List] question about Wordsmith tools (log-likelihood)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Sep 22 2006 - 23:29:49 MET DST