Re: Corpora: Baum-training impossible for a large corpus?

Bernard Merialdo (merialdo@eurecom.fr)
Fri, 25 Sep 1998 11:49:05 +0200 (MET DST)

there are two "classic" answers to that:

- you can normalize the alpha coefficients of the forward pass
so that they sum to one at each time. then it is enough
to use the normalization factors with the betas on the way back
to get the proper computations:

- you can use log values and compute the log of the sum
by the equation:
if a > b, log(a+b) = log(a) + log(1 + exp(log(b) - log(a)))
you just need to estimate log(1+exp(t)) with t < 0.

van Veenendaal R. writes:
> I have to construct a POS-Tagger, so training for 1's and 0's isn't enough. I
> have to train a Hidden Markov Model for Dutch using the Eindhoven corpus.
> Because of the size of the corpus Prolog encounters too small fractions ( <
> 4.9e-324) and returns "+nan" (not a number) after working through about 10
> sentences (prob_word_1 * prob_word_2 * ... * prob_word_n).
>
> Training a Hidden Markov Model for a language has been done before, so my
> question is: "What is it that I do wrong?"
>
> -I've tried to use -log values, but there are problems when you've got to sum
> the normal probabilities (what to do with the -log values when you can't un-log
> them temporarily because of the +nan problem?).
> -I've also tried to split the corpus, but how do I combine the trainingresults
> of the parts to fully 'new' estimates?
>
> Please help me?!
>
> Remco van Veenendaal
> email: s0650692@let.rug.nl
> www: hagen.let.rug.nl/remco
>
>

-- 
____________________________________________________________________________
                                   |
   Bernard Merialdo                |    e-mail : merialdo@eurecom.fr
   Professor                       |
   Multimedia Communications Dept  |
   Institut EURECOM                |    tel : +33 (0)4 93 00 26 29
   2229 Route des Cretes           |    sec : +33 (0)4 93 00 26 26
   B.P. 193                        |    fax : +33 (0)4 93 00 26 27
   06904 Valbonne Cedex - FRANCE   |
           http://www.eurecom.fr/Multimedia/Staff/merialdo.html
____________________________________________________________________________