RE: [Corpora-List] Comparing learner frequencies with native frequencies

From: Rayson, Paul (rayson@exchange.lancs.ac.uk)
Date: Tue Mar 07 2006 - 11:20:12 MET

  • Next message: Yuri Tambovtsev: "[Corpora-List] Haida shows the least use of labial consonants in speech"

    Hi Dominic,

    To add a few more pointers to what Adam and Bayan have already posted:
    I'd recommend the log-likelihood measure. There is arguably a case for
    using F-Exact when dealing with two corpora of very different sizes, or
    when comparing very low frequency words but the practical significance
    (i.e. what conclusions can you draw anyway) of those comparisons should
    be considered over statistical significance.

    As Adam says, there are a variety of statistics that can be used. For a
    comparison of chi-squared and log-loglikehood see:

    Rayson P., Berridge D. and Francis B. (2004). Extending the Cochran rule
    for the comparison of word frequencies between corpora. In Volume II of
    Purnelle G., Fairon C., Dister A. (eds.) Le poids des mots: Proceedings
    of the 7th International Conference on Statistical analysis of textual
    data (JADT 2004), Louvain-la-Neuve, Belgium, March 10-12, 2004, Presses
    universitaires de Louvain, pp. 926 - 936.
    http://www.comp.lancs.ac.uk/computing/users/paul/publications/rbf04_jadt
    .pdf

    and for a description of the method used in Wmatrix, see:

    Rayson, P. and Garside, R. (2000). Comparing corpora using frequency
    profiling. In proceedings of the workshop on Comparing Corpora, held in
    conjunction with the 38th annual meeting of the Association for
    Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1
    - 6.
    http://www.comp.lancs.ac.uk/computing/users/paul/publications/rg_acl2000
    .pdf

    and

    http://ucrel.lancs.ac.uk/llwizard.html

    The log-likelihood statistic is also used in Mike Scott's WordSmith
    tools to find keywords:
    http://www.lexically.net/wordsmith/

    I agree with Adam about taking account of word 'burstiness'. Either you
    need to incorporate range/dispersion in an adjusted frequency measure or
    examine them by hand for the keywords you identify.

    Finally, in your case, you need to consider differences in spelling (due
    to learner errors) which will affect any comparison you do, but perhaps
    that is what you are looking to find from such a comparison anyway?

    Regards,
    Paul.

    Dr. Paul Rayson
    Director of UCREL
    Computing Department, Infolab21, South Drive, Lancaster University,
    Lancaster, LA1 4WA, UK.
    Web: http://www.comp.lancs.ac.uk/computing/users/paul/
    Tel: +44 1524 510357 Fax: +44 1524 510492

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Dominic Glennon
    Sent: 06 March 2006 11:09
    To: corpora@lists.uib.no
    Subject: [Corpora-List] Comparing learner frequencies with native
    frequencies

    Dear corporistas,

    I'm trying to compare word frequencies in our native speaker corpus and
    our learner corpus. Having normalised the frequencies in both corpora to
    frequencies per 10 million words, a simple subtraction still heavily
    skews
    the results towards high-frequency words. I've tried taking the log of
    both normalised frequencies before subtracting to get around the Zipfian
    nature of word frequency distribution - this gives better results, but
    is
    it well-motivated? I'd be grateful for any help you could give me, or
    any
    pointers to previous work done in this area. Many thanks,

    Dom

    Dominic Glennon
    Systems Manager
    Cambridge University Press
    01223 325595

    Search the web's favourite learner dictionaries for free at Cambridge
    Dictionaries Online:
    <http://dictionary.cambridge.org>



    This archive was generated by hypermail 2b29 : Tue Mar 07 2006 - 13:04:23 MET