RE: [Corpora-List] Comparing learner frequencies with native frequencies

From: Bayan Shawar (bshawar@yahoo.com)
Date: Tue Mar 07 2006 - 10:29:20 MET

  • Next message: m.b.villada.moiron@rug.nl: "[Corpora-List] 2nd CFP -- COLING-ACL Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties"

    Dear Dominic,
         Paul Rayson, in Lancaster University developed a
    Wmatrix tool, which compares between two files,
    corpora in three levels, pos, semantic, and lexical.
    It accepts even non equal size files and produce
    logliklihood.

    Rayson, P. (2003). Matrix: a statistical method and
    software tool for linguistic analysis through corpus
    comparison. Ph.D. thesis. Lancaster University.

    I also used this tool within my research for
    comparison issues,
    Abu Shawar B and Atwell E. 2005. A chatbot system as a
    tool to animate acorpus. ICAME Journal. Vol. 29, pp.
    5-23.

    Hopefully this is useful,
    Bayan Abu Shawar

    --- Adam Kilgarriff <adam@lexmasterclass.com> wrote:

    > Dominic, Abdou,
    >
    > First, the problem is analogous to
    > collocate-finding, so the same range of
    > stats such as MI and log likelihood can be used. As
    > with collocate-finding,
    > there's a balance to be struck between pure,
    > mathematical surprisingness,
    > and the fact that commoner phenomena are, all else
    > being equal, more
    > interesting than rarer ones. Not-too-technical
    > survey available at
    >
    http://www.kilgarriff.co.uk/Publications/1996-K-AISB.pdf
    >
    >
    > Second, "burstiness" - words occurring frequently in
    > particular documents
    > but not much otherwise. If you don't make provision
    > for it, many of the
    > words thrown up will be 'topic' words used a lot in
    > a few texts but not
    > interestingly different between the two text types.
    > There are plenty of
    > ways to address it; survey above describes an
    > "adjusted frequency" metric, I
    > compared Brown and LOB using document counts and the
    > non-parametric
    > Mann-Whitney test
    >
    http://www.kilgarriff.co.uk/Publications/1996-K-CHumBergen-Chisq.txt.
    >
    >
    > Where the docs are all different lengths, it's
    > trickier; an elegant general
    > solution is given by
    >
    > P. Savicky, J. Hlavacova. Measures of Word
    > Commonness. Journal of
    > Quantitative Linguistics, Vol. 9, 2003, No. 3, pp.
    > 215-231.
    >
    > They (1) divide the corpus into same-length
    > "pseudodocuments", and count the
    > document frequency of the term in each pseudodoc;
    > (2) to avoid problems
    > cuased by the arbitrary cuts between docs, they
    > consider all possible start-
    > and end-points for the pseudodocs, and average.
    > We're implementing the
    > approach for text-type comparison in the Sketch
    > Engine
    > http://www.sketchengine.co.uk (and would be
    > interested to use your data as a
    > test set).
    >
    > Third, "other differences between the subcorpora":
    > unless the two corpora
    > are very well matched in all ways but the text-type
    > distinction you are
    > interested in, what very often happens is that the
    > stats identify some
    > different dimension of difference between the
    > corpora and that aspect swamps
    > out the one you wanted to find. LOB/Brown was a
    > nice test set because the
    > corpora are carefully set up to be matched. Even so,
    > non-linguistic US vs UK
    > differences like cricket vs baseball were nicely
    > thrown up by the stats!
    >
    > All the best,
    >
    > Adam
    >
    >
    > -----Original Message-----
    > From: owner-corpora@lists.uib.no
    > [mailto:owner-corpora@lists.uib.no] On
    > Behalf Of Dominic Glennon
    > Sent: 06 March 2006 11:09
    > To: corpora@lists.uib.no
    > Subject: [Corpora-List] Comparing learner
    > frequencies with native
    > frequencies
    >
    > Dear corporistas,
    >
    > I'm trying to compare word frequencies in our native
    > speaker corpus and
    > our learner corpus. Having normalised the
    > frequencies in both corpora to
    > frequencies per 10 million words, a simple
    > subtraction still heavily skews
    > the results towards high-frequency words. I've tried
    > taking the log of
    > both normalised frequencies before subtracting to
    > get around the Zipfian
    > nature of word frequency distribution - this gives
    > better results, but is
    > it well-motivated? I'd be grateful for any help you
    > could give me, or any
    > pointers to previous work done in this area. Many
    > thanks,
    >
    > Dom
    >
    > Dominic Glennon
    > Systems Manager
    > Cambridge University Press
    > 01223 325595
    >
    > Search the web's favourite learner dictionaries for
    > free at Cambridge
    > Dictionaries Online:
    > <http://dictionary.cambridge.org>
    >
    >
    >

                    
    ___________________________________________________________
    NEW Yahoo! Cars - sell your car and browse thousands of new and used cars online! http://uk.cars.yahoo.com/



    This archive was generated by hypermail 2b29 : Tue Mar 07 2006 - 10:28:53 MET