Re: Corpora: Baum-training impossible for a large corpus?

David Elworthy (dahe@cre.canon.co.uk)
Fri, 25 Sep 1998 10:19:16 +0100

van Veenendaal R. wrote:
>
> Hello,
>
> To finish my studies here in Groningen (Humanities Computing) I'm working on a
> POS-Tagger. I've implemented the algorithm from Jelinek, "Statistical Methods
> for Speech Recognition", 1997 (Viterbi, Baum-Welch/Forward-Backward) and have
> tested the program (written entirely in Sicstus Prolog) with the small example
> of Charniak, "Statistical Language Learning", 1993, page 64 figure 4.9. There
> are no problems and everything works fine.
>
> BUT:
> I have to construct a POS-Tagger, so training for 1's and 0's isn't enough. I
> have to train a Hidden Markov Model for Dutch using the Eindhoven corpus.
> Because of the size of the corpus Prolog encounters too small fractions ( <
> 4.9e-324) and returns "+nan" (not a number) after working through about 10
> sentences (prob_word_1 * prob_word_2 * ... * prob_word_n).
>
> Training a Hidden Markov Model for a language has been done before, so my
> question is: "What is it that I do wrong?"
>
> -I've tried to use -log values, but there are problems when you've got to sum
> the normal probabilities (what to do with the -log values when you can't un-log
> them temporarily because of the +nan problem?).
> -I've also tried to split the corpus, but how do I combine the trainingresults
> of the parts to fully 'new' estimates?
>
> Please help me?!

You can apply a technique for numerical stabilisation which keeps the
magnitude within a reasonable range, without affecting the underlying
statistical process. For details, have a look at Cutting, Kupiec,
Pedersen and Sibun's paper "A practical part-of-speech tagger" in
Applied ACL (ANLP) proceedings for 1992.

_______________________________________________________________________
David Elworthy <dahe@cre.canon.co.uk>
Canon Research Centre Europe Ltd., Guildford, Surrey, UK
URL: http://www.cre.canon.co.uk/
Phone: +44 1483 448844; Fax: +44 1483 448845