Re: [Corpora-List] Incidence of MWEs

From: Ken Litkowski (ken@clres.com)
Date: Tue Mar 14 2006 - 20:09:46 MET

  • Next message: Gaël Dias: "Re: [Corpora-List] Incidence of MWEs"

    This discussion has the earmarks of the theoretical, speculative, and
    the hyperstatistical. I think a practical and newly available method
    can be used.

    Lexicographers, and most notably, OUP, have compiled their lists of what
    they think constitute MWEs. In the electronic XML version of the Oxford
    Dictionary of English (ODE), there is an NLP element that incorporates a
    quite thorough list of all the variants, including placeholders in
    phrases like "give [someone] a hard time". James McCracken, in a fit of
    genius, has created on Online ODE and has a rudimentary disambiguation
    of all content (non-boring) words in the dictionary. To do this, James
    first created an index of all the variants, "squeezing" the phrases
    together (e.g., "byandlarge"). As the first step in disambiguating, he
    searches for longest phrases, starting from 5 words and continuing down
    to 2 words, under the assumption that a phrasal reading is preferred to
    a compositional reading. Upon walking through the Perl script that does
    this (in about a half hour for the entire dictionary), my first reaction
    after "wow" was what proportion of the definitions consist of these
    phrases. Haven't done this yet, but it is simple, just requiring a
    couple of modifications in the script to make the necessary counts. The
    Perl script also is written in such a way that the same subroutines can
    be applied to free text. This is all available to interested
    researchers who would like to investigate these issues. (And also, it's
    important to say that Adam Kilgarriff was a guiding spirit to James'
    initial forays.)

    Based strictly on a casual perusal of the resulting Online ODE, I would
    say that a 2% figure is much more likely than a 30% (or even 70%) figure.

            Ken

    David Brooks wrote:

    > Dear Corpora-folk,
    >
    > I was wondering if anyone has estimated the incidence of multi-word
    > expressions in language. I know that empirical estimates are tied to
    > particular corpora, but does anyone have an account of MWEs for
    > particular corpora, so that "ball-park" figures of the proportion of
    > MWEs can be estimated?
    >
    > Better yet, can anyone give me a good reference for the incidence of MWEs?
    >
    > Regards,
    > David

    -- 
    Ken Litkowski                     TEL.: 301-482-0237
    CL Research                       EMAIL: ken@clres.com
    9208 Gue Road
    Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
    



    This archive was generated by hypermail 2b29 : Tue Mar 14 2006 - 21:01:37 MET