Re: [Corpora-List] question as to MI and t score

From: Serge HEIDEN (slh@ens-lsh.fr)
Date: Tue Dec 20 2005 - 12:31:58 MET

  • Next message: Dusko Vitas: "[Corpora-List] Call for Papers: Intex/NooJ Workshop"

    Sorry to catch lately a now tepid thread.

    I agree with Stefan that being able to speak about
    some strength of some collocation thing doesn't give
    you much insight in what a collocation may be, and
    at least may be useful for.
    May I suggest to broaden the model you manipulate a bit :
    - in the things observed and counted : your collocation
    candidate words seem to be plain lexical items. You know
    that word frequencies vary a lot with the linguistic role they
    play in texts. It may be useful to place each candidate word
    on a continuum from, say, grammatical items to lexical items.
    Even if all words are on the lexical side, they may have
    different interesting positions on this virtual axis. And this
    can give you informations to better interpret your model.
    Alas, today, and don't know of any formal model taking
    that kind of information into account. Any ideas ?
    - in the context they are counted to be together : you give
    us no information on the type of texts involed. This, also,
    can drasticaly change frequencies and interpretation of them.
    Maybe it would be useful to place the effective contexts observed
    on a continuum from, say, media/genre/style/register type
    of context as a whole to a phrase/syntagm type.
    To vary the size of the context may be another way to focalize
    on a specific interpretation of a collocation model. You
    can make words meet in varying sized - and moving -
    windows in texts, or build contexts from typographic
    heuristics like "hard" ponctuations ('.', '!', etc).
    For example, in textometric tools, we generally start with small
    contexts (with window sizes or word based n-grams)
    to analyze candidate syntagms. After this, we can reconsider
    what was initialy two candidate words as one candidate
    compound word in more 'discourse' oriented cooccurrent analysis.
    - in the way you take care of being together : in french
    scientific litterature, we name collocates things being
    somewhat in proximity on the syntagmatic axis, and
    cooccurrents things being together without knowing
    anything of their proximity. You could also take into
    account the orientation of the meetings : X is before Y being
    counted, or taken into account, differently than X is
    after Y, on the syntagmatic axis.
    - finally, in the way different couples could be compared :
    and this takes us back to your initial question, how to compare
    two couples ? May I suggest to try to compare ALL
    couples together at the same time ? This way of doing
    things is what some optimist guys call 'semantic maps'.
    Today, I have only some very pragmatic propositions
    to give in that area (see http://weblex.ens-lsh.fr/biblio
    /slh/SergeHeidenCooccurrencesJADT2004Final.htm, as
    an example introduction. Sorry, it is in french). There
    are so many different parameters involved to build a specific
    cooccurrent graph that we try to analyze all of it before
    moving a single parameter.

    Best,

        [Serge]

    Stefan Evert wrote:
    >> Working out exactly what
    >> upper bounds on this difference one can assume with how much
    >> confidence is almost as difficult as a mathematical problem as
    >> interpreting the differences is as a linguistic problem (what does
    >> it really mean if the difference in collocational strength is at
    >> most "1.7"??).
    >>>
    >>> Imagine you have called up collocation listings for the node word
    >>> lemmas "play" and "fight". In both lists, the association with for
    >>> example the collocates "role" and "battle" has the exactly the same
    >>> MI / t score. Can I assume that both collocations, i.e. "play a
    >>> role" and "fight a battle" have the same "collocational strength",
    >>> or is that a wrong assumption?
    >>>
    >>> Thanks,
    >>> Helene

    _____________________________________________________________
    Serge Heiden, slh@ens-lsh.fr, https://weblex.ens-lsh.fr
    ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
    15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)632010638



    This archive was generated by hypermail 2b29 : Tue Dec 20 2005 - 13:12:18 MET