Re: [Corpora-List] Questions about collocations and collocation extraction tools

From: Serge HEIDEN (Slh@ens-lsh.fr)
Date: Wed Aug 02 2006 - 11:08:19 MET DST

  • Next message: Martin Wynne: "Re: [Corpora-List] Questions about collocations and collocation extraction tools"

    Nicholas,

    Le Tuesday, August 01, 2006 11:15 PM [GMT+1=CET],
    Nicholas Anagnostou <nanagnos@cis.strath.ac.uk> a écrit :

    >> I am a student at Strathclyde University, Graduate School of
    >> Informatics, and I am working on a dissertation project titled "Using
    >> collocation frequencies in determining the relative reading
    >> complexity of texts". A core part of my project is extracting
    >> collocations from a corpus, in this case the BNC Baby. I have some
    >> questions regarding collocations and I would be more than grateful
    >> if you could share your expertise.
    >>
    >> 1. I wish to compare software tools that can be used for
    >> collocation extraction. I wanted to include QWICK and TACT but I
    >> haven't been able to locate them on the Internet. Are they publicly
    >> available anymore or not? If not, is there a way to get them?

    I suggest that you have a look at :
    - TAPoRware (http://taporware.mcmaster.ca/) which may be designed
    in the continuity of TACT (not sure of that) ;
    - http://www.collocations.de/software.html

    >> 2. I've found that the de facto standard for measuring the
    >> statistical association between words, in order to discover
    >> collocations, is the log-likelihood. Do you agree with that? Can the
    >> log-likelihood be used for collocations consisting of more than two
    >> words?

    Again, http://www.collocations.de/ should give you a good starting point for
    the panorama of all the available measures bestiary.

    >> 3. I need to compile a collocation frequency list as general (not
    >> genre- or sublanguage- specific) as possible. Do you consider the BNC
    >> Baby to be a corpus general enough for this task or do I need to use
    >> another corpus?

    Althought the BNC Baby does'nt claim to be representative of the whole
    BNC, it may suffer of the same typological 'text types' bias analyzed by
    David Lee in his PhD dissertation. The article http://llt.msu.edu/vol5num3/lee/default.html
    should give you an idea of the way he analyzes the metadata of the BNC
    texts to discuss genre, register, text type, domain and style representativity
    of the BNC. He designed the "BNC Index" to reclassify all the BNC texts
    with a didactic perspective.
    This could help you to design a corpus more oriented toward "representativity"
    or "genericity/specificity" tradeoff. And especially if your "reading complexity"
    analysis goal has also a didactic perspective.
    Finally, I would suggest to consider that corpus compiling is time consuming
    and that your corpus design strategy should include an "available time to
    complete the work" component to control choices, independently of any
    "soundness" principle.

    >> 4. I need to specify frequency thresholds for the collocations (or
    >> the collocation candidates to be more precise). Is f >= 3 considered
    >> to be an adequate cut-off? I know that I have to filter out the
    >> hapax and dis legomena, but from which frequency onwards does a
    >> collocation become statistically significant?

    I would suggest to draw a control line from the reading complexity question
    to that kind of optimization threshold. Statistical significance is
    something difficult to manipulate in corpus linguistics. If you use that, I would
    suggest to bind it to the ultimate question you ask to the data.

    >> I won't ask if there is a generally acceptable definition of a
    >> collocation, because it would be like sending flame mail to the
    >> list. :) Please forgive any signs of ignorance in the questions, I
    >> am taking my first steps in the field.

    Each applicative context has its own definition. I propose to justify this
    by the fact that a "distance" between two "words" is something too
    simple and biased (what is the significance of a distance in linguistics ?
    what is a word in linguistics ? what is a context for two "words" to meet
    in linguistics ?) to grasp any particular linguistic phenomenon. It
    probably grasps a combination of MANY dependent phenomenons.
    I would suggest to use the collocation definition given by the reading
    complexity measure field you use.

    Best,

        [Serge]

    _____________________________________________________________
    Serge Heiden, slh@ens-lsh.fr, https://weblex.ens-lsh.fr
    ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
    15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883



    This archive was generated by hypermail 2b29 : Wed Aug 02 2006 - 11:41:43 MET DST