Re: [Corpora-List] frequent meanings of a word

From: Saif Mohammad (uvgotsaif@gmail.com)
Date: Fri Mar 17 2006 - 20:49:16 MET

  • Next message: Santos Diana: "RE: [Corpora-List] Re: [Corpora-list] Incidence of MWEs"

    Hi Mimi,

    While manually determining the intended senses (and thereby sense
    dominance) may be more accurate, it is time-intensive. I would like to
    bring to your attention the following automatic methods that determine
    word sense dominance from unannotated text:

    (1) "Finding predominant senses in untagged text" McCarthy, D.,
    Koeling, R., Weeds, J. and Carroll, J. In Proceedings of the 42nd
    Annual Meeting of the Association for Computational Linguistics. 2004,
    Barclona, Spain. pp 280-287.

    (2) "Determining Word Sense Dominance Using a Thesaurus", Saif
    Mohammad and Graeme Hirst, To appear in Proceedings of the 11th
    conference of the European chapter of the Association for
    Computational Linguistics (EACL-2006), April 2006, Trento, Italy.

    I am not sure what sense inventory you are using, but
    it should be noted that both these approaches are somewhat tied to
    specific sense inventories. The McCarthy et al. method does a marriage
    of distributional and semantic measures of similarity to determine
    sense dominance, and so relies on WordNet. The second approach
    (proposed by me and Hirst) relies on using a number of ambiguous words
    to together unambiguously represent a sense. And so, we use a
    published thesaurus as the sense inventory (categories roughly
    correspond to coarse senses). Both these approaches can be used to get
    domain-specific sense dominance, as well.

    If you are using WordNet as the sense inventory, and if all you need
    is the domain-free predominant sense for each word or a rough ranking
    as per frequencies, then that can be obtained directly from WordNet
    itself, wherein the senses for a word are listed in the order of their
    dominance in the SemCor corpus. Note that senses that are not found in
    SemCor are listed at random and the SemCor corpus is relatively small
    (about 250,000 words).

    Good luck,
    -Saif

    On 3/17/06, Ziwei Huang <aexzh1@nottingham.ac.uk> wrote:
    > Hello, I have a methodological question that needs your kind help:
    >
    > I need to look at the frequently used meanings of about 200 different words in BNC corpus, and wonder whether there is a quick/easy way to do that?
    >
    > My intended way is to randomly select 500 instances of a word from the whole corpus, then
    > look at 50 instances (say, the 1st, 11st, 21st ... 491st instances) and list the meanings (and their frequencies) of the word; then move to the next 50 instances (2nd, 12nd... 492nd) to see whether any new meanings have come out; if there is any then move to the next 50 instances, and repeat that until no more new meanings emerge.
    >
    > Can someone kindly tell me whether this approach is acceptable to describe the frequently used meanings of a word (or the 'default' meanings of a word in actual use), and whether there is any reference/source for this (or any other easy and quick) methodology?
    >
    > Many thanks!
    >
    > Mimi
    >
    >
    > This message has been checked for viruses but the contents of an attachment
    > may still contain software viruses, which could damage your computer system:
    > you are advised to perform your own checks. Email communications with the
    > University of Nottingham may be monitored as permitted by UK legislation.
    >
    >
    >

    --
    Saif Mohammad
    University of Toronto
    http://www.cs.toronto.edu/~smm
    



    This archive was generated by hypermail 2b29 : Fri Mar 17 2006 - 20:49:10 MET