Re: [Corpora-List] POS Tagger for German / Java

From: Yannick Versley (versley@sfs.uni-tuebingen.de)
Date: Wed Jan 10 2007 - 09:58:11 MET

  • Next message: ben dbabis samira: "[Corpora-List] redundancy removal techniques"

    Hi,

    > I am currently working on a system for toponym recognition in natural
    > german (web-based) text documents, as my master thesis.
    > The system uses a POS tagger for extracting good NE candidates for a
    > gazetteer.

    based on my experience (also with a system for toponym resolution, but not in
    Java), I think it would be easiest to use tnt (or any other existing
    POS-tagger) by writing the input to a file, running tnt over it and reading
    back tnt's output.
    If you want to train your own tagger, either with qtag or with another toolkit
    (e.g. the Stanford POS tagger, which is available under
    http://nlp.stanford.edu/software/tagger.shtml ),
    you will want to make sure that you
    1. use a large corpus, e.g. Negra or TiGer (the qtag page says that it uses
    25k tokens of training data. Negra has 400k tokens and TiGer probably has
    around 1M).
    2. use a large lexicon. This is especially important for the NE/NN
    distinction, as it is not easy to get this from only surface forms.
    If you can, take a large full-form lexicon (you could try to use the lexicon
    data from the WCDG parser, freely available at
    http://nats-www.informatik.uni-hamburg.de/view/CDG/DownloadPage ,
    or any other that you are able to get your hands on).
    You should also try to get most of the information you have in your gazetteer
    into the tagger lexicon, but you need to be careful with ambiguous names
    (e.g. Sonntag/NN and Sonntag/NE, Sommer/NN,NE or Bush/NN,NE in English).
    Using a large lexicon is also good if you use a pre-trained tagger like tnt
    where you can add more lexical entries.

    Cheers,
    Yannick Versley
    > Now, here my question arises
    > 1. Do you know of any good POS tagger for German language, best Java-based?
    > (I need only the NE-tagged tokens.)
    > 2. I used tnt, but that one is based on perl/C, and it is not easy to
    > integrate into my java framework.
    > 3. I also used qtag. But it comes only with a, for my task too small data
    > base (lexicon and matrix).
    >
    > So, is there any POS tagger out there that is easy to use and up for the
    > task?
    >
    > Cheers & thx for listening in, yours
    > Mike Sonntag

    -- 
    Yannick Versley
    Seminar für Sprachwissenschaft, Abt. Computerlinguistik
    Wilhelmstr. 19, 72074 Tübingen
    Tel.: (07071) 29 77352
    



    This archive was generated by hypermail 2b29 : Wed Jan 10 2007 - 10:19:05 MET