RE: [Corpora-List] Comparing learner frequencies with native frequencies

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Tue Mar 07 2006 - 07:38:32 MET

  • Next message: Bayan Shawar: "RE: [Corpora-List] Comparing learner frequencies with native frequencies"

    Dominic, Abdou,

    First, the problem is analogous to collocate-finding, so the same range of
    stats such as MI and log likelihood can be used. As with collocate-finding,
    there's a balance to be struck between pure, mathematical surprisingness,
    and the fact that commoner phenomena are, all else being equal, more
    interesting than rarer ones. Not-too-technical survey available at
    http://www.kilgarriff.co.uk/Publications/1996-K-AISB.pdf

    Second, "burstiness" - words occurring frequently in particular documents
    but not much otherwise. If you don't make provision for it, many of the
    words thrown up will be 'topic' words used a lot in a few texts but not
    interestingly different between the two text types. There are plenty of
    ways to address it; survey above describes an "adjusted frequency" metric, I
    compared Brown and LOB using document counts and the non-parametric
    Mann-Whitney test
    http://www.kilgarriff.co.uk/Publications/1996-K-CHumBergen-Chisq.txt.

    Where the docs are all different lengths, it's trickier; an elegant general
    solution is given by

    P. Savicky, J. Hlavacova. Measures of Word Commonness. Journal of
    Quantitative Linguistics, Vol. 9, 2003, No. 3, pp. 215-231.

    They (1) divide the corpus into same-length "pseudodocuments", and count the
    document frequency of the term in each pseudodoc; (2) to avoid problems
    cuased by the arbitrary cuts between docs, they consider all possible start-
    and end-points for the pseudodocs, and average. We're implementing the
    approach for text-type comparison in the Sketch Engine
    http://www.sketchengine.co.uk (and would be interested to use your data as a
    test set).

    Third, "other differences between the subcorpora": unless the two corpora
    are very well matched in all ways but the text-type distinction you are
    interested in, what very often happens is that the stats identify some
    different dimension of difference between the corpora and that aspect swamps
    out the one you wanted to find. LOB/Brown was a nice test set because the
    corpora are carefully set up to be matched. Even so, non-linguistic US vs UK
    differences like cricket vs baseball were nicely thrown up by the stats!

    All the best,

    Adam

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Dominic Glennon
    Sent: 06 March 2006 11:09
    To: corpora@lists.uib.no
    Subject: [Corpora-List] Comparing learner frequencies with native
    frequencies

    Dear corporistas,

    I'm trying to compare word frequencies in our native speaker corpus and
    our learner corpus. Having normalised the frequencies in both corpora to
    frequencies per 10 million words, a simple subtraction still heavily skews
    the results towards high-frequency words. I've tried taking the log of
    both normalised frequencies before subtracting to get around the Zipfian
    nature of word frequency distribution - this gives better results, but is
    it well-motivated? I'd be grateful for any help you could give me, or any
    pointers to previous work done in this area. Many thanks,

    Dom

    Dominic Glennon
    Systems Manager
    Cambridge University Press
    01223 325595

    Search the web's favourite learner dictionaries for free at Cambridge
    Dictionaries Online:
    <http://dictionary.cambridge.org>



    This archive was generated by hypermail 2b29 : Tue Mar 07 2006 - 07:37:54 MET