Re: [Corpora-List] Phrase extraction

From: Diana Maynard (d.maynard@dcs.shef.ac.uk)
Date: Wed Oct 26 2005 - 10:29:37 MET DST

  • Next message: Lou Burnard: "Re: [Corpora-List] Wordsmith Collocation-EQUO"

    Apologies to those who noticed the broken link - I accidentally reset the
    permissions - it should be fixed now!
    I should emphasise that the solutions proposed in this paper were very ad hoc
    - more a sneaky way of getting results fast rather than a "nice" solution! But
    useful as a means to an end.
    Diana

    Anna Feldman wrote:
    > Dear Diana,
    >
    > I'm very interested in the kind of work you are doing, but
    > unfortunately, the link to your publications page is broken. Could you
    > please check?
    >
    > Thanks,
    >
    > Anna Feldman
    >
    >
    >
    > On Tue, 25 Oct 2005, Diana Maynard wrote:
    >
    >> Hi Helge
    >> I am sure there are some Norwegian tagers out there somewhere, but I
    >> don't know if any of them are free.
    >>
    >> If you don't have a suitable training corpus, and don't want to create
    >> one manually, then
    >> depending how ambiguous the language in question is with respect to
    >> POS, and how accurate you need your results, you might be able to
    >> generate a rough and ready POS tagger using just a monolingual (or
    >> bilingual) online Norwegian dictionary and a tagger such as the Brill
    >> tagger. I've done this for various languages by simply replacing the
    >> tagger's lexicon with a lexicon of the target language (using a few
    >> scripts to reformat it appropriately to match the Brill one) and using
    >> the default ruleset for the closest language to your target (in terms
    >> of family and behaviour). Then just run the tagger as usual on your
    >> corpus. You won't get perfect results but you might get something good
    >> enough for your purposes, depending what you want to do ultimately.
    >> I've generated a Hindi tagger with around 70% accuracy in this way
    >> (using GATE and the Hepple tagger, which is like the Brill tagger)
    >> with nothing more than a basic Hindi-English bilingual dictionary.
    >> I've done the same for Western languages and got much better results.
    >>
    >> See http://www.dcs.shef.ac.uk/~diana/publications.html
    >> for a paper which discusses using this technique to adapt an English
    >> NE system to the Cebuano language.
    >>
    >> D. Maynard and V. Tablan and K. Bontcheva and H. Cunningham and Y. Wilks.
    >> Rapid customisation of an Information Extraction system for surprise
    >> languages.
    >> Special issue of ACM Transactions on Asian Language Information
    >> Processing: Rapid Development of Language Capabilities: The Surprise
    >> Languages,
    >> 2003.
    >>
    >> Of course there are lots of other ways, most of which will probably be
    >> more time-consuming but will get you better results.
    >>
    >> Regards
    >> Diana
    >>
    >>
    >>
    >> Helge Thomas Karset Hellerud wrote:
    >>
    >>> Hello,
    >>>
    >>> PoS (Part of Speech) tagging is often used to extract phrases from text
    >>> (like Noun Phrases). But that approach assumes you have a PoS tagger
    >>> available. My document collection is in Norwegian, but I don't have a
    >>> Norwegian tagger.
    >>>
    >>> 1) Is there a way to create a simple PoS tagger to recognize verbs,
    >>> nouns and adjectives (in Norwegian)?
    >>>
    >>> 2) If not, do anyone have other approaches to extract phrases (like a
    >>> statistical approach?)
    >>>
    >>> Thanks in advance.
    >>>
    >>> Helge
    >>>
    >>
    >>



    This archive was generated by hypermail 2b29 : Wed Oct 26 2005 - 10:42:07 MET DST