Re: [Corpora-List] Phrase extraction

From: Diana Maynard (d.maynard@dcs.shef.ac.uk)
Date: Wed Oct 26 2005 - 10:29:37 MET DST

Next message: Lou Burnard: "Re: [Corpora-List] Wordsmith Collocation-EQUO"

Previous message: Linda Bawcom: "[Corpora-List] Wordsmith Collocation-EQUO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Apologies to those who noticed the broken link - I accidentally reset the
permissions - it should be fixed now!
I should emphasise that the solutions proposed in this paper were very ad hoc
- more a sneaky way of getting results fast rather than a "nice" solution! But
useful as a means to an end.
Diana

Anna Feldman wrote:
> Dear Diana,
>
> I'm very interested in the kind of work you are doing, but
> unfortunately, the link to your publications page is broken. Could you
> please check?
>
> Thanks,
>
> Anna Feldman
>
>
>
> On Tue, 25 Oct 2005, Diana Maynard wrote:
>
>> Hi Helge
>> I am sure there are some Norwegian tagers out there somewhere, but I
>> don't know if any of them are free.
>>
>> If you don't have a suitable training corpus, and don't want to create
>> one manually, then
>> depending how ambiguous the language in question is with respect to
>> POS, and how accurate you need your results, you might be able to
>> generate a rough and ready POS tagger using just a monolingual (or
>> bilingual) online Norwegian dictionary and a tagger such as the Brill
>> tagger. I've done this for various languages by simply replacing the
>> tagger's lexicon with a lexicon of the target language (using a few
>> scripts to reformat it appropriately to match the Brill one) and using
>> the default ruleset for the closest language to your target (in terms
>> of family and behaviour). Then just run the tagger as usual on your
>> corpus. You won't get perfect results but you might get something good
>> enough for your purposes, depending what you want to do ultimately.
>> I've generated a Hindi tagger with around 70% accuracy in this way
>> (using GATE and the Hepple tagger, which is like the Brill tagger)
>> with nothing more than a basic Hindi-English bilingual dictionary.
>> I've done the same for Western languages and got much better results.
>>
>> See http://www.dcs.shef.ac.uk/~diana/publications.html
>> for a paper which discusses using this technique to adapt an English
>> NE system to the Cebuano language.
>>
>> D. Maynard and V. Tablan and K. Bontcheva and H. Cunningham and Y. Wilks.
>> Rapid customisation of an Information Extraction system for surprise
>> languages.
>> Special issue of ACM Transactions on Asian Language Information
>> Processing: Rapid Development of Language Capabilities: The Surprise
>> Languages,
>> 2003.
>>
>> Of course there are lots of other ways, most of which will probably be
>> more time-consuming but will get you better results.
>>
>> Regards
>> Diana
>>
>>
>>
>> Helge Thomas Karset Hellerud wrote:
>>
>>> Hello,
>>>
>>> PoS (Part of Speech) tagging is often used to extract phrases from text
>>> (like Noun Phrases). But that approach assumes you have a PoS tagger
>>> available. My document collection is in Norwegian, but I don't have a
>>> Norwegian tagger.
>>>
>>> 1) Is there a way to create a simple PoS tagger to recognize verbs,
>>> nouns and adjectives (in Norwegian)?
>>>
>>> 2) If not, do anyone have other approaches to extract phrases (like a
>>> statistical approach?)
>>>
>>> Thanks in advance.
>>>
>>> Helge
>>>
>>
>>

Next message: Lou Burnard: "Re: [Corpora-List] Wordsmith Collocation-EQUO"
Previous message: Linda Bawcom: "[Corpora-List] Wordsmith Collocation-EQUO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Oct 26 2005 - 10:42:07 MET DST