Re: [Corpora-List] POS Tagger for German / Java

From: Yannick Versley (versley@sfs.uni-tuebingen.de)
Date: Wed Jan 10 2007 - 09:58:11 MET

Next message: ben dbabis samira: "[Corpora-List] redundancy removal techniques"

Previous message: Niladri Sekhar Dash: "Re: [Corpora-List] history of corpus linguistics"
In reply to: Michael Sonntag: "[Corpora-List] POS Tagger for German / Java"
Next in thread: Ciarán Ó Duibhín: "Re: [Corpora-List] POS Tagger for German / Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

> I am currently working on a system for toponym recognition in natural
> german (web-based) text documents, as my master thesis.
> The system uses a POS tagger for extracting good NE candidates for a
> gazetteer.

based on my experience (also with a system for toponym resolution, but not in
Java), I think it would be easiest to use tnt (or any other existing
POS-tagger) by writing the input to a file, running tnt over it and reading
back tnt's output.
If you want to train your own tagger, either with qtag or with another toolkit
(e.g. the Stanford POS tagger, which is available under
http://nlp.stanford.edu/software/tagger.shtml ),
you will want to make sure that you
1. use a large corpus, e.g. Negra or TiGer (the qtag page says that it uses
25k tokens of training data. Negra has 400k tokens and TiGer probably has
around 1M).
2. use a large lexicon. This is especially important for the NE/NN
distinction, as it is not easy to get this from only surface forms.
If you can, take a large full-form lexicon (you could try to use the lexicon
data from the WCDG parser, freely available at
http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage ,
or any other that you are able to get your hands on).
You should also try to get most of the information you have in your gazetteer
into the tagger lexicon, but you need to be careful with ambiguous names
(e.g. Sonntag/NN and Sonntag/NE, Sommer/NN,NE or Bush/NN,NE in English).
Using a large lexicon is also good if you use a pre-trained tagger like tnt
where you can add more lexical entries.

Cheers,
Yannick Versley
> Now, here my question arises
> 1. Do you know of any good POS tagger for German language, best Java-based?
> (I need only the NE-tagged tokens.)
> 2. I used tnt, but that one is based on perl/C, and it is not easy to
> integrate into my java framework.
> 3. I also used qtag. But it comes only with a, for my task too small data
> base (lexicon and matrix).
>
> So, is there any POS tagger out there that is easy to use and up for the
> task?
>
> Cheers & thx for listening in, yours
> Mike Sonntag

-- 
Yannick Versley
Seminar für Sprachwissenschaft, Abt. Computerlinguistik
Wilhelmstr. 19, 72074 Tübingen
Tel.: (07071) 29 77352

Next message: ben dbabis samira: "[Corpora-List] redundancy removal techniques"
Previous message: Niladri Sekhar Dash: "Re: [Corpora-List] history of corpus linguistics"
In reply to: Michael Sonntag: "[Corpora-List] POS Tagger for German / Java"
Next in thread: Ciarán Ó Duibhín: "Re: [Corpora-List] POS Tagger for German / Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jan 10 2007 - 10:19:05 MET