Re: Corpora: Baum-training impossible for a large corpus?

Ted E. Dunning (ted@aptex.com)
Fri, 25 Sep 1998 13:28:18 -0700

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Wolfgang Lezius: "Corpora: German morphology and part-of-speech tagging"
Previous message: Alain Lecomte: "Corpora: LACL98"
Maybe in reply to: van Veenendaal R.: "Corpora: Baum-training impossible for a large corpus?"

>>>>> "vV" == van Veenendaal R <s0650692@let.rug.nl> writes:

vV> Hello,

vV> To finish my studies here in Groningen (Humanities Computing)
vV> I'm working on a POS-Tagger. ... I have to train a Hidden
vV> Markov Model for Dutch using the Eindhoven corpus. Because of
vV> the size of the corpus Prolog encounters too small fractions (
vV> < 4.9e-324) and returns "+nan" (not a number) after working
vV> through about 10 sentences (prob_word_1 * prob_word_2 * ... *
vV> prob_word_n).

...

vV> -I've tried to use -log values, but there are problems when
vV> you've got to sum the normal probabilities (what to do with
vV> the -log values when you can't un-log them temporarily because
vV> of the +nan problem?).

The basic answer to these questions are that

a) you have to use log probabilities if you hope to get anything to
work.

b) you have to write an addition routine. the basic outline is:

add(x,y) = if (x < y-log(10^30)) then y
else if (x-log(10^30) > y) then x
else
let base = min(x,y)
base + log (exp(x-base) + exp(y-base))

This is generally reasonably fast since the cases where either x or y
dominate the sum are relatively common.

If you are adding very many items then you need to worry about
round-off. There are several ways to deal with this. One effective
strategy is to sort the values to be added in ascending order and then
add things up in pairs. Repeat the pairwise adding until you have
only one element. There are other strategies which are cheaper, but
most strategies which truly minimize round-off either more complex,
less general, or only suited to particular problem structures.

I hope that this helps.

Next message: Wolfgang Lezius: "Corpora: German morphology and part-of-speech tagging"
Previous message: Alain Lecomte: "Corpora: LACL98"
Maybe in reply to: van Veenendaal R.: "Corpora: Baum-training impossible for a large corpus?"