Corpora: Baum-training impossible for a large corpus?

van Veenendaal R. (s0650692@let.rug.nl)
Thu, 24 Sep 1998 21:50:58 +0200 (METDST)

Hello,

To finish my studies here in Groningen (Humanities Computing) I'm working on a
POS-Tagger. I've implemented the algorithm from Jelinek, "Statistical Methods
for Speech Recognition", 1997 (Viterbi, Baum-Welch/Forward-Backward) and have
tested the program (written entirely in Sicstus Prolog) with the small example
of Charniak, "Statistical Language Learning", 1993, page 64 figure 4.9. There
are no problems and everything works fine.

BUT:
I have to construct a POS-Tagger, so training for 1's and 0's isn't enough. I
have to train a Hidden Markov Model for Dutch using the Eindhoven corpus.
Because of the size of the corpus Prolog encounters too small fractions ( <
4.9e-324) and returns "+nan" (not a number) after working through about 10
sentences (prob_word_1 * prob_word_2 * ... * prob_word_n).

Training a Hidden Markov Model for a language has been done before, so my
question is: "What is it that I do wrong?"

-I've tried to use -log values, but there are problems when you've got to sum
the normal probabilities (what to do with the -log values when you can't un-log
them temporarily because of the +nan problem?).
-I've also tried to split the corpus, but how do I combine the trainingresults
of the parts to fully 'new' estimates?

Please help me?!

Remco van Veenendaal
email: s0650692@let.rug.nl
www: hagen.let.rug.nl/remco