Corpora: corpus of IT

From: Kim Tan (kimmy1003@hotmail.com)
Date: Sat Dec 23 2000 - 05:00:41 MET

  • Next message: 889190129: "Corpora: EVCA-I"

    Hi all,

    I know everyone's in the festive mood but I just had to thank the foll.
    people for responding personally to my query before 2001...

    Adam Kilgarriff
    Paul Rayson
    Michael Oakes
    John Sinclair
    Jilani Warsi
    Khairul

    Among the possible ways of identifying words that are characteristic of a
    text ( for 2 corpora, one > specialized and the other > general i.e. )are
    using the non-parametric Mann-Whitney test to find words with the most
    consistently different frequencies and the log-likelihood or G-square. Adam
    suggested chopping both corpora into same-size chunks, producing a word freq
    list for each chunk, and then using the Mann-Whitney test to find words with
    the most consistently different frequencies. The Log-likelihood or G-square
    can be performed automatically using Mike Scott's WordSmith package ( I use
    Excel with the formulas provided in the article "Comparing Corpora using
    Frequency Profiling" by Rayson and Garside. They suggested producing a freq.
    list for both corpora and for each word in the 2 freq. lists , the
    loglikelihood statistic is calculated. The largest LL representing the word
    which has the most significant relative freq. difference is the most
    indicative ( or characteristic ) of one corpus as compared to the other
    corpus. I'm still surveying the statistical methods and one of the main
    problems I encountered was matching the words in the two corpora (the IT
    specific corpus and the general ME corpus)before I actually apply the
    statistical measures ( so far I've tried Loglikelihood) . I did the matching
      manually , there must be an automatic way of going about it. Someone
    suggested using Dbase 3...

    Can I also draw your attention to the work by Yang Hui-Zhong whose article
    was published in the Journal of Literary and Linguistic Computing (I'm still
    trying to locate the article myself).I was told that Yang compared the
    frequency of words across a range of texts and established 2 measures, the
    "peak ratio" and the " range ratio". A word with a high PR in certain texts
    and a low RR is almost certain to be a technical term. A high RR and low PR
    indicates a word of general utility etc. it's worth looking into... I also
    find Adam's good downloadable technical report "Comparing corpora" Report
    ITRI-96-08 most useful where he touches on a survey of statistical
    approaches...

    Happy Christmas, New Year

    KIM
    Nat. Univ. of Malaysia

    _________________________________________________________________________
    Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.



    This archive was generated by hypermail 2b29 : Sat Dec 23 2000 - 04:58:00 MET