Re: [Corpora-List] Incidence of MWEs

From: David Brooks (D.J.Brooks@cs.bham.ac.uk)
Date: Tue Mar 14 2006 - 14:39:00 MET

  • Next message: Chris Butler: "Re: [Corpora-List] Incidence of MWEs"

    Adam Kilgarriff wrote:
    >>I was wondering if anyone has estimated the incidence of multi-word
    >>expressions in language.
    >
    > Wonderful, enormous, bottomless question!

    In a fit of ignorance I figured on reaping whatever information was
    retrievable from the query. However, I should certainly be more specific.

    I'm interested in the effect of MWEs on parser evaluation. Specifically,
    I want to describe the problems it poses for grammar induction.

    I presume that idiosyncratic MWEs are somehow treated differently to
    compositional MWEs, in that the latter could easily be incorporated into
    a treebank. Phrasal verbs also seem to be tagged in treebanks, but I'm
    intrigued as to the treatment of phrases like "kick the bucket", and
    perhaps more importantly: "at first", "of course", and other MWEs that
    almost represent "stop-phrases" (as opposed to stop-words). Some of
    these are syntactically valid, but does that mean they would be
    annotated (in phrase-structure terms) in a compositional manner, or
    would the idiomatic reading be preferred?

    > * are you counting types or tokens? (Exercise: what is the proportion
    > of multiwords in the mini-corpus comprising the single sentence, "Apple pie
    > is apple pie." )
    > * what sublanguages do you include - all, some, none? ("mid off" is a
    > MWE for anyone who knows cricket but not for anyone who doesn't)
    > * how much variation (morphological, syntactic, lexical, modifiers)
    > can there be, with it still being the same MWE (or, an MWE at all)
    > (Rosamund Moon's example, are "shake in one's shoes", "quake in one's boots"
    > and "quake in one's Doc Marten's" all the same MWE?)
    > * is non-compositionality a part of the definition?
    > * are frequencies or statistics part of the definition? (Theorists
    > might not want them to be, but without statistics and thresholds, you won't
    > be able to compute a useful answer, and if you do use them, the answer you
    > get will depend critically on which statistics and which thresholds you use
    > so you had better make principled decisions about them)

    In answer to those questions:
    1) I'd count tokens;
    2) I'd include all sublanguages (since they will presumably be annotated
    correctly);
    3) the notion of variation is presumably intrinsically linked with
    non-compositionality;
    4) non-compositionality is a requirement in my definition;
    5) from an inductive standpoint, I assume that statistics are necessary
    to identify these phrases in a corpus. I further assume that statistics
    are used in parsing, so should also be used in MWE identification.

    Cheers,
    D

    -- 
    David Brooks
    http://www.cs.bham.ac.uk/~djb
    



    This archive was generated by hypermail 2b29 : Tue Mar 14 2006 - 14:41:38 MET