Re: Corpora: POS tagger for Tamil

Gregory Aist (aist+@andrew.cmu.edu)
Wed, 24 Mar 1999 10:24:55 -0500 (EST)

Excerpts from mail: 23-Mar-99 Corpora: POS tagger for Tamil by Arash
Zeini@uni-koeln.de
> I am a student of Indology and Tamil Studies and I am trying to find out
> how I could create a POS tagger for Tamil. To answer this question
> theoretically or practically is part of my M.A. thesis.
> We have or will have a corpus consisting of modern Tamil literature very
> soon. Currently we are encoding the texts according to the CES with the
> level 1 encoding, which encodes the overall structure of the texts.
> There hasn't been much done for Tamil in this direction as far as I know
> and we don't have any already annotated corpus that we could use as
> training corpus.
> I have written a little macro that can recognize Tamil verbs in their
> easiest and simplest conjugation to some extend.

I don't know any Tamil, or anything about the morphology of Tamil. But
I infer from the above statement that Tamil has somewhat complicated
verb conjugation. (therefore requiring some morphological analysis
prior to tagging.)

Kemal Oflazer has done work on part of speech tagging in Turkish, and
some of the tools he's used may be appropriate. From
http://www.cs.bilkent.edu.tr/~ko/pubs.html I find:

Kemal Oflazer, Morphological Analysis , chapter in Syntactic Wordclass
Tagging Hans van Halteren, Editor, Kluwer Academic Publishers, 1998.

Kemal Oflazer and Gvkhan T|r, Morphological Disambiguation by Voting
Constraints in Proceedings of ACL'97/EACL'97, The 35th Annual Meeting of
the Association for Computational Linguistics, July, 7-12, 1997, Madrid
Spain.(postscript copy)

Another possibility to look at is PC-KIMMO, a two-level morphological
analyser. See http://www.sil.org/pckimmo/.

> And I would like to limit
> the question of POS tagging currently only to the verbs.

One question I have is: How ambiguous is (written) Tamil with respect to
part of speech? i.e. are there (frequent) cases of words such as
English "present" which can be (for example) both a noun and a verb? If
written Tamil is unambiguous you may not need a statistical
disambiguation step -- just the morphological analysis.

Best wishes,
Greg

Gregory Aist, aist@cs.cmu.edu Ph.D. student, LTI, Carnegie Mellon
Project LISTEN: kids read, computer listens. http://www.cs.cmu.edu/~listen
Postal address: LTI, CMU, 4910 Forbes Ave., Pittsburgh PA 15213-3720 USA