[Corpora-List] Questions about collocations and collocation extraction tools

From: Nicholas Anagnostou (nanagnos@cis.strath.ac.uk)
Date: Tue Aug 01 2006 - 23:15:20 MET DST

  • Next message: Elias Ponvert: "[Corpora-List] Deadline Extension: Texas Linguistics Society"

    Dear all,

    I am a student at Strathclyde University, Graduate School of
    Informatics, and I am working on a dissertation project titled "Using
    collocation frequencies in determining the relative reading complexity
    of texts". A core part of my project is extracting collocations from a
    corpus, in this case the BNC Baby. I have some questions regarding
    collocations and I would be more than grateful if you could share your
    expertise.

      1. I wish to compare software tools that can be used for collocation
    extraction. I wanted to include QWICK and TACT but I haven't been able
    to locate them on the Internet. Are they publicly available anymore or
    not? If not, is there a way to get them?

      2. I've found that the de facto standard for measuring the statistical
    association between words, in order to discover collocations, is the
    log-likelihood. Do you agree with that? Can the log-likelihood be used
    for collocations consisting of more than two words?

      3. I need to compile a collocation frequency list as general (not
    genre- or sublanguage- specific) as possible. Do you consider the BNC
    Baby to be a corpus general enough for this task or do I need to use
    another corpus?

      4. I need to specify frequency thresholds for the collocations (or the
    collocation candidates to be more precise). Is f >= 3 considered to be
    an adequate cut-off? I know that I have to filter out the hapax and dis
    legomena, but from which frequency onwards does a collocation become
    statistically significant?

    I won't ask if there is a generally acceptable definition of a
    collocation, because it would be like sending flame mail to the list. :)
    Please forgive any signs of ignorance in the questions, I am taking my
    first steps in the field.

    Thanks in advance and kind regards
    Nicholas Anagnostou



    This archive was generated by hypermail 2b29 : Tue Aug 01 2006 - 23:35:32 MET DST