RE: [Corpora-List] SemCor: extrapolating Brown data to larger corpora

From: Ramesh Krishnamurthy (r.krishnamurthy@aston.ac.uk)
Date: Tue Feb 14 2006 - 12:25:46 MET

  • Next message: Joel Tetreault: "Re: [Corpora-List] Corpus Linguistics conferences iCal?"

    Hi Mark,
    There's also the problem of comparing
    USA-1962 (Brown) written data with UK-1994 (BNC) written and spoken data,
    collected according to different design criteria...

    Best
    Ramesh

    At 06:25 14/02/2006, you wrote:
    >Mark,
    >
    >I'd be very skeptical of any such extrapolation. The senses that happen to
    >come up when the numbers are so small (usually single figures) are just
    >arbitrary, and don't sustain extrapolation, even before we agitate about the
    >match between SEMCOR and big-corpus text type.
    >
    >And we should assume everything is Zipfian. I've been puzzling over the
    >implications of this for years and have done some modeling: see "How
    >dominant is the commonest sense of a word" at
    >http://lexmasterclass.com/people/Publications/2004-K-TSD-CommonestSense.pdf
    >
    >(In: Text, Speech, Dialogue 2004. Lecture Notes in Artificial Intelligence
    >Vol. 3206. Sojka, Kopecek and Pala, Eds. Springer Verlag: 103-112.)
    >
    >Diana McCarthy and colleagues explore the issue in their ACL paper (best
    >paper award, ACL 2004 Barcelona). The premise for their work is that you're
    >better off establishing what domain you are in, and assigning all instances
    >of a word to the sense associated with that domain, than trying to do
    >local-context-based WSD.
    >
    >Of course, everything depends on how similar the two corpora are. Let's
    >make that the big research question for the new half-decade!
    >
    > Regards,
    >
    > Adam
    >
    >-----Original Message-----
    >From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    >Behalf Of Mark Davies
    >Sent: 13 February 2006 23:06
    >To: corpora@hd.uib.no
    >Subject: [Corpora-List] SemCor: extrapolating Brown data to larger corpora
    >
    >A graduate student here is working with SemCor
    >(http://multisemcor.itc.it/semcor.php), and she's looking at how well
    >the data from the Brown-based SemCor corpus might potentially compare
    >with that of a larger corpus, like the BNC.
    >
    >For example, [crack] as a verb has 17 tokens in SemCor, distributed
    >among the seven different WordNet senses as follows (if I'm reading the
    >cntlst files from SemCor 1.6 correctly):
    >
    >WordNet Tokens
    >sense
    >------ ------
    >1 5
    >2 4
    >3 2
    >4 2
    >5 2
    >6 1
    >7 1
    >----- -----
    >TOTAL 17
    >
    >The question is whether in a 100 million word corpus, we would get more
    >or less the same distribution. For example, might Senses 6-7
    >(hypothetically) be the most common, even though they each only occur
    >once in the Brown/SemCor corpus?
    >
    >Has anyone attempted to compare the results of SemCor with a
    >randomly-selected subset of tokens from a much larger corpus, such as
    >the BNC -- even for just a small subset of words (particularly verbs)?
    >Also, are there any statistical tests that might be used to see whether
    >we have a sufficiently robust for a given word for WSD with SemCor?
    >(It's obviously a function of frequency - you'd probably get more
    >reliable results with a high-frequency word like [break] than a lower
    >frequency word like [smear]).
    >
    >Also, we're not really looking for basic articles on WSD (or literature
    >on Senseval, etc), but rather just the issue at hand -- the
    >extrapolatability (??) of SemCor to a larger corpus.
    >
    >Sorry if this an FAQ-like question. If so, simple references to
    >existing literature would be appreciated.
    >
    >Thanks,
    >
    >Mark Davies
    >
    >=================================================
    >
    >Mark Davies
    >Assoc. Prof., Linguistics
    >Brigham Young University
    >(phone) 801-422-9168 / (fax) 801-422-0906
    >
    >http://davies-linguistics.byu.edu
    >
    >** Corpus design and use // Linguistic databases **
    >** Historical linguistics // Language variation **
    >** English, Spanish, and Portuguese **
    >
    >=================================================

    Ramesh Krishnamurthy
    Lecturer in English Studies
    School of Languages and Social Sciences
    Aston University, Birmingham B4 7ET, UK
    Tel: +44 (0)121-204-3812
    Fax: +44 (0)121-204-3766
    http://www.aston.ac.uk/lss/english/



    This archive was generated by hypermail 2b29 : Tue Feb 14 2006 - 13:16:24 MET