Re: [Corpora-List] Phrase extraction

From: Diana Maynard (d.maynard@dcs.shef.ac.uk)
Date: Tue Oct 25 2005 - 12:03:59 MET DST

  • Next message: ismi\.touati: "[Corpora-List] To segment HTML document?"

    Hi Helge
    I am sure there are some Norwegian tagers out there somewhere, but I don't
    know if any of them are free.

    If you don't have a suitable training corpus, and don't want to create one
    manually, then
    depending how ambiguous the language in question is with respect to POS, and
    how accurate you need your results, you might be able to generate a rough and
    ready POS tagger using just a monolingual (or bilingual) online Norwegian
    dictionary and a tagger such as the Brill tagger. I've done this for various
    languages by simply replacing the tagger's lexicon with a lexicon of the
    target language (using a few scripts to reformat it appropriately to match the
    Brill one) and using the default ruleset for the closest language to your
    target (in terms of family and behaviour). Then just run the tagger as usual
    on your corpus. You won't get perfect results but you might get something good
    enough for your purposes, depending what you want to do ultimately.
    I've generated a Hindi tagger with around 70% accuracy in this way (using GATE
    and the Hepple tagger, which is like the Brill tagger) with nothing more than
    a basic Hindi-English bilingual dictionary. I've done the same for Western
    languages and got much better results.

    See http://www.dcs.shef.ac.uk/~diana/publications.html
      for a paper which discusses using this technique to adapt an English NE
    system to the Cebuano language.

    D. Maynard and V. Tablan and K. Bontcheva and H. Cunningham and Y. Wilks.
    Rapid customisation of an Information Extraction system for surprise languages.
    Special issue of ACM Transactions on Asian Language Information
    Processing: Rapid Development of Language Capabilities: The Surprise Languages,
    2003.

    Of course there are lots of other ways, most of which will probably be more
    time-consuming but will get you better results.

    Regards
    Diana

    Helge Thomas Karset Hellerud wrote:
    > Hello,
    >
    > PoS (Part of Speech) tagging is often used to extract phrases from text
    > (like Noun Phrases). But that approach assumes you have a PoS tagger
    > available. My document collection is in Norwegian, but I don't have a
    > Norwegian tagger.
    >
    > 1) Is there a way to create a simple PoS tagger to recognize verbs,
    > nouns and adjectives (in Norwegian)?
    >
    > 2) If not, do anyone have other approaches to extract phrases (like a
    > statistical approach?)
    >
    > Thanks in advance.
    >
    > Helge
    >



    This archive was generated by hypermail 2b29 : Tue Oct 25 2005 - 12:13:34 MET DST