Corpora: Baum-training impossible for a large corpus?

van Veenendaal R. (s0650692@let.rug.nl)
Thu, 24 Sep 1998 21:50:58 +0200 (METDST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: David Elworthy: "Re: Corpora: Baum-training impossible for a large corpus?"
Previous message: C Hogan: "Corpora: references for lexicon induction from parallel corpora"
Next in thread: David Elworthy: "Re: Corpora: Baum-training impossible for a large corpus?"

Hello,

To finish my studies here in Groningen (Humanities Computing) I'm working on a
POS-Tagger. I've implemented the algorithm from Jelinek, "Statistical Methods
for Speech Recognition", 1997 (Viterbi, Baum-Welch/Forward-Backward) and have
tested the program (written entirely in Sicstus Prolog) with the small example
of Charniak, "Statistical Language Learning", 1993, page 64 figure 4.9. There
are no problems and everything works fine.

BUT:
I have to construct a POS-Tagger, so training for 1's and 0's isn't enough. I
have to train a Hidden Markov Model for Dutch using the Eindhoven corpus.
Because of the size of the corpus Prolog encounters too small fractions ( <
4.9e-324) and returns "+nan" (not a number) after working through about 10
sentences (prob_word_1 * prob_word_2 * ... * prob_word_n).

Training a Hidden Markov Model for a language has been done before, so my
question is: "What is it that I do wrong?"

-I've tried to use -log values, but there are problems when you've got to sum
the normal probabilities (what to do with the -log values when you can't un-log
them temporarily because of the +nan problem?).
-I've also tried to split the corpus, but how do I combine the trainingresults
of the parts to fully 'new' estimates?

Please help me?!

Remco van Veenendaal
email: s0650692@let.rug.nl
www: hagen.let.rug.nl/remco

Next message: David Elworthy: "Re: Corpora: Baum-training impossible for a large corpus?"
Previous message: C Hogan: "Corpora: references for lexicon induction from parallel corpora"
Next in thread: David Elworthy: "Re: Corpora: Baum-training impossible for a large corpus?"