Re: Corpora: Baum-training impossible for a large corpus?

Ted E. Dunning (ted@aptex.com)
Fri, 25 Sep 1998 13:28:18 -0700

>>>>> "vV" == van Veenendaal R <s0650692@let.rug.nl> writes:

vV> Hello,

vV> To finish my studies here in Groningen (Humanities Computing)
vV> I'm working on a POS-Tagger. ... I have to train a Hidden
vV> Markov Model for Dutch using the Eindhoven corpus. Because of
vV> the size of the corpus Prolog encounters too small fractions (
vV> < 4.9e-324) and returns "+nan" (not a number) after working
vV> through about 10 sentences (prob_word_1 * prob_word_2 * ... *
vV> prob_word_n).

...

vV> -I've tried to use -log values, but there are problems when
vV> you've got to sum the normal probabilities (what to do with
vV> the -log values when you can't un-log them temporarily because
vV> of the +nan problem?).

The basic answer to these questions are that

a) you have to use log probabilities if you hope to get anything to
work.

b) you have to write an addition routine. the basic outline is:

add(x,y) = if (x < y-log(10^30)) then y
else if (x-log(10^30) > y) then x
else
let base = min(x,y)
base + log (exp(x-base) + exp(y-base))

This is generally reasonably fast since the cases where either x or y
dominate the sum are relatively common.

If you are adding very many items then you need to worry about
round-off. There are several ways to deal with this. One effective
strategy is to sort the values to be added in ascending order and then
add things up in pairs. Repeat the pairwise adding until you have
only one element. There are other strategies which are cheaper, but
most strategies which truly minimize round-off either more complex,
less general, or only suited to particular problem structures.

I hope that this helps.