Re: [Corpora-List] Incidence of MWEs

From: Gaël Dias (ddg@di.ubi.pt)
Date: Tue Mar 14 2006 - 22:49:40 MET

  • Next message: Sylviane Granger: "[Corpora-List] NLP and CALL workshop (TALN 2006): call for participation"

    Dear David,

    If you are interesting by MWEs in Parsing you can read the work done by
    J. Nivre and J. Nilsson:

    J. Nivre and J. Nilsson (2004) Multiword Units in Syntactic Parsing.
    Workshop on Methodologies and Evaluation of Multiword Units in
    Real-world Applications (MEMURA Workshop) associated with the 4th
    International Conference On Languages Resources and Evaluation. Dias,
    G., Lopes, J.G.L. & Vintar, S. (eds), Lisbon, Portugal. May 25. pp.
    17-24. ISBN: 2951740816. EAN: 0782951740815.

    You can get it at
    http://memura2004.di.ubi.pt/main-memura-proceedings-vInternet.pdf

    If you know French, a very good book on MWE is:

    G. Gross (1996) Les expressions figées en Français. Ophrys. Paris.

    Best,

    Gaël.

    David Brooks wrote:

    > Adam Kilgarriff wrote:
    >
    >>> I was wondering if anyone has estimated the incidence of multi-word
    >>> expressions in language.
    >>
    >>
    >> Wonderful, enormous, bottomless question!
    >
    >
    > In a fit of ignorance I figured on reaping whatever information was
    > retrievable from the query. However, I should certainly be more specific.
    >
    > I'm interested in the effect of MWEs on parser evaluation. Specifically,
    > I want to describe the problems it poses for grammar induction.
    >
    > I presume that idiosyncratic MWEs are somehow treated differently to
    > compositional MWEs, in that the latter could easily be incorporated into
    > a treebank. Phrasal verbs also seem to be tagged in treebanks, but I'm
    > intrigued as to the treatment of phrases like "kick the bucket", and
    > perhaps more importantly: "at first", "of course", and other MWEs that
    > almost represent "stop-phrases" (as opposed to stop-words). Some of
    > these are syntactically valid, but does that mean they would be
    > annotated (in phrase-structure terms) in a compositional manner, or
    > would the idiomatic reading be preferred?
    >
    >> * are you counting types or tokens? (Exercise: what is the proportion
    >> of multiwords in the mini-corpus comprising the single sentence,
    >> "Apple pie
    >> is apple pie." )
    >> * what sublanguages do you include - all, some, none? ("mid off" is a
    >> MWE for anyone who knows cricket but not for anyone who doesn't) *
    >> how much variation (morphological, syntactic, lexical, modifiers)
    >> can there be, with it still being the same MWE (or, an MWE at all)
    >> (Rosamund Moon's example, are "shake in one's shoes", "quake in one's
    >> boots"
    >> and "quake in one's Doc Marten's" all the same MWE?)
    >> * is non-compositionality a part of the definition?
    >> * are frequencies or statistics part of the definition? (Theorists
    >> might not want them to be, but without statistics and thresholds, you
    >> won't
    >> be able to compute a useful answer, and if you do use them, the answer
    >> you
    >> get will depend critically on which statistics and which thresholds
    >> you use
    >> so you had better make principled decisions about them)
    >
    >
    > In answer to those questions:
    > 1) I'd count tokens;
    > 2) I'd include all sublanguages (since they will presumably be annotated
    > correctly);
    > 3) the notion of variation is presumably intrinsically linked with
    > non-compositionality;
    > 4) non-compositionality is a requirement in my definition;
    > 5) from an inductive standpoint, I assume that statistics are necessary
    > to identify these phrases in a corpus. I further assume that statistics
    > are used in parsing, so should also be used in MWE identification.
    >
    > Cheers,
    > D

    -- 
    ---------------------------------------------------------
    Gaël Harry Dias, PhD		| Assistant Professor
    Human Language Technology Group | Vice Chair of the Dept.
    Computer Science Department     | [www.di.ubi.pt/~ddg]
    Beira Interior University       | [ddg@di.ubi.pt]
    6201-001 - Covilhã              | [Tel: +351 275 319 891]
    PORTUGAL                        | [Fax: +351 275 319 899]
    ---------------------------------------------------------
    



    This archive was generated by hypermail 2b29 : Tue Mar 14 2006 - 22:53:26 MET