RE: Corpora: Part of Speech Tagging<unknown-words>

Christopher Tribble (ctribble@sri.lanka.net)
Fri, 5 Nov 1999 14:25:45 +0530

re Vasuprada Kandrakota's request for sources of free corpus data

If you subscribe to the UK's Guardian International you can build a good
nespaper corpus for free by registering for the email edition. You'll be
sent the full text of the following sections each week:
international-news, us-news, uk-news, features, culture, and sport. With
around 50,000 words an issue you will soon accumulate a useful set of texts
(already blocked into quite useful thematic groups).

Bestest

Chris Tribble

--
		Dr Christopher Tribble
Sri Lanka	21 Wijerama Mawatha, Colombo 7
		TEL  +94 75 332 309
UK		122, Queen Alexandra Mansions, Judd Street
		London WC1 H 9DQ
		TEL +44 171 833 4271
UK Mailing	c/o FCO (Sri Lanka)
		The British Council, Sri Lanka
		King Charles Street, London SW1A 2AH
E-mail		ctribble@sri.lanka.net
Home Page	http://ourworld.compuserve.com/homepages/Christopher_Tribble

> -----Original Message----- > From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On > Behalf Of VASUPRADA KANDRAKONTA(98MCMT04) > Sent: Friday, November 05, 1999 8:00 PM > To: corpus list > Subject: Corpora: Part of Speech Tagging<unknown-words> > > > Hi everybody, > I'm doing a project in POS tagging.For this I'm using the statistical > methods. I've built a Hidden Markov Model using the SUSANNE corpus and am > using the Viterbi Algorithm to find out the best tag sequence.But I have a > problem of sparse data. Can anyone tell me what should be done with the > unknown words<words not found in the corpus>. One method is to use the > features like word endings and capital letter starting. But what about the > state transition matrix. > If anyone knows any literature on the net about this, please let me know. > > I'm in a plan to upgrade my system,using a corpus of larger size.The > corpus I'm using right now is of size 1,30,000words. Can anyone tell me > where I can get a downloadable corpus(free of cost). > > Thankyou, > Vasuprada Kandrakota > Dept. of Computer Science, > University of Hyderabad, > Hyderabad-INDIA 500 046 > > > >