Corpora: Part of Speech Tagging<unknown-words>

VASUPRADA KANDRAKONT (css073s@uohyd.ernet.in)
Fri, 5 Nov 1999 09:29:59 -0500 (GMT)

Hi everybody,
I'm doing a project in POS tagging.For this I'm using the statistical
methods. I've built a Hidden Markov Model using the SUSANNE corpus and am
using the Viterbi Algorithm to find out the best tag sequence.But I have a
problem of sparse data. Can anyone tell me what should be done with the
unknown words<words not found in the corpus>. One method is to use the
features like word endings and capital letter starting. But what about the
state transition matrix.
If anyone knows any literature on the net about this, please let me know.

I'm in a plan to upgrade my system,using a corpus of larger size.The
corpus I'm using right now is of size 1,30,000words. Can anyone tell me
where I can get a downloadable corpus(free of cost).

Thankyou,
Vasuprada Kandrakota
Dept. of Computer Science,
University of Hyderabad,
Hyderabad-INDIA 500 046