RE: [Corpora-List] SemCor: extrapolating Brown data to larger corpora

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Tue Feb 14 2006 - 07:25:53 MET

  • Next message: Timothy Baldwin: "[Corpora-List] Australia: Coling-ACL 2006 Workshop on "Linguistic Distances" -- CFP"

    Mark,

    I'd be very skeptical of any such extrapolation. The senses that happen to
    come up when the numbers are so small (usually single figures) are just
    arbitrary, and don't sustain extrapolation, even before we agitate about the
    match between SEMCOR and big-corpus text type.

    And we should assume everything is Zipfian. I've been puzzling over the
    implications of this for years and have done some modeling: see "How
    dominant is the commonest sense of a word" at
    http://lexmasterclass.com/people/Publications/2004-K-TSD-CommonestSense.pdf

    (In: Text, Speech, Dialogue 2004. Lecture Notes in Artificial Intelligence
    Vol. 3206. Sojka, Kopecek and Pala, Eds. Springer Verlag: 103-112.)

    Diana McCarthy and colleagues explore the issue in their ACL paper (best
    paper award, ACL 2004 Barcelona). The premise for their work is that you're
    better off establishing what domain you are in, and assigning all instances
    of a word to the sense associated with that domain, than trying to do
    local-context-based WSD.

    Of course, everything depends on how similar the two corpora are. Let's
    make that the big research question for the new half-decade!

     Regards,

      Adam

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Mark Davies
    Sent: 13 February 2006 23:06
    To: corpora@hd.uib.no
    Subject: [Corpora-List] SemCor: extrapolating Brown data to larger corpora

    A graduate student here is working with SemCor
    (http://multisemcor.itc.it/semcor.php), and she's looking at how well
    the data from the Brown-based SemCor corpus might potentially compare
    with that of a larger corpus, like the BNC.

    For example, [crack] as a verb has 17 tokens in SemCor, distributed
    among the seven different WordNet senses as follows (if I'm reading the
    cntlst files from SemCor 1.6 correctly):

    WordNet Tokens
    sense
    ------ ------
    1 5
    2 4
    3 2
    4 2
    5 2
    6 1
    7 1
    ----- -----
    TOTAL 17

    The question is whether in a 100 million word corpus, we would get more
    or less the same distribution. For example, might Senses 6-7
    (hypothetically) be the most common, even though they each only occur
    once in the Brown/SemCor corpus?

    Has anyone attempted to compare the results of SemCor with a
    randomly-selected subset of tokens from a much larger corpus, such as
    the BNC -- even for just a small subset of words (particularly verbs)?
    Also, are there any statistical tests that might be used to see whether
    we have a sufficiently robust for a given word for WSD with SemCor?
    (It's obviously a function of frequency - you'd probably get more
    reliable results with a high-frequency word like [break] than a lower
    frequency word like [smear]).

    Also, we're not really looking for basic articles on WSD (or literature
    on Senseval, etc), but rather just the issue at hand -- the
    extrapolatability (??) of SemCor to a larger corpus.

    Sorry if this an FAQ-like question. If so, simple references to
    existing literature would be appreciated.

    Thanks,

    Mark Davies

    =================================================

    Mark Davies
    Assoc. Prof., Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906

    http://davies-linguistics.byu.edu

    ** Corpus design and use // Linguistic databases **
    ** Historical linguistics // Language variation **
    ** English, Spanish, and Portuguese **

    =================================================



    This archive was generated by hypermail 2b29 : Tue Feb 14 2006 - 07:41:08 MET