[Corpora-List] SemCor: extrapolating Brown data to larger corpora

From: Mark Davies (Mark_Davies@byu.edu)
Date: Tue Feb 14 2006 - 00:06:15 MET

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] SemCor: extrapolating Brown data to larger corpora"

    A graduate student here is working with SemCor
    (http://multisemcor.itc.it/semcor.php), and she's looking at how well
    the data from the Brown-based SemCor corpus might potentially compare
    with that of a larger corpus, like the BNC.

    For example, [crack] as a verb has 17 tokens in SemCor, distributed
    among the seven different WordNet senses as follows (if I'm reading the
    cntlst files from SemCor 1.6 correctly):

    WordNet Tokens
    sense
    ------ ------
    1 5
    2 4
    3 2
    4 2
    5 2
    6 1
    7 1
    ----- -----
    TOTAL 17

    The question is whether in a 100 million word corpus, we would get more
    or less the same distribution. For example, might Senses 6-7
    (hypothetically) be the most common, even though they each only occur
    once in the Brown/SemCor corpus?

    Has anyone attempted to compare the results of SemCor with a
    randomly-selected subset of tokens from a much larger corpus, such as
    the BNC -- even for just a small subset of words (particularly verbs)?
    Also, are there any statistical tests that might be used to see whether
    we have a sufficiently robust for a given word for WSD with SemCor?
    (It's obviously a function of frequency - you'd probably get more
    reliable results with a high-frequency word like [break] than a lower
    frequency word like [smear]).

    Also, we're not really looking for basic articles on WSD (or literature
    on Senseval, etc), but rather just the issue at hand -- the
    extrapolatability (??) of SemCor to a larger corpus.

    Sorry if this an FAQ-like question. If so, simple references to
    existing literature would be appreciated.

    Thanks,

    Mark Davies

    =================================================

    Mark Davies
    Assoc. Prof., Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906

    http://davies-linguistics.byu.edu

    ** Corpus design and use // Linguistic databases **
    ** Historical linguistics // Language variation **
    ** English, Spanish, and Portuguese **

    =================================================



    This archive was generated by hypermail 2b29 : Tue Feb 14 2006 - 00:21:54 MET