RE: [Corpora-List] Incidence of MWEs

From: Amsler, Robert (Robert.Amsler@hq.doe.gov)
Date: Wed Mar 15 2006 - 16:11:21 MET

  • Next message: Diana Inkpen: "[Corpora-List] EACL 2006 Workshop on CROSS-LANGUAGE KNOWLEDGE INDUCTION"

    I have found published dictionary's judgments as to what constitute MWEs
    to be both dated and biased against declaring MWEs to exist. Until I
    actually went through a number of texts to extract MWEs by hand and
    compared those MWEs I found against those listed in dictionaries I used
    to think the lexicographic coverage was adequate and followed the rule
    that "if you can predict its meaning from its constituent parts, it
    doesn't need a separate entry" to be correct. What I found was that not
    only didn't the rule seem to be applied consistently, but that MWEs
    appeared to be a much neglected area of lexicography with many more
    undocumented MWEs being used in text than were in the dictionaries. It
    was as though dictionaries reviewed their MWE entries far less often and
    less diligently than they did their isolated word entries.

    There are probably good reasons against dictionary publishers declaring
    MWEs to exist. Namely, MWEs greatly increase the size of a dictionary
    for a small gain in clarity, perhaps only useful to Speakers of English
    as a Foreign Language (and practitioners of computational linguists,
    information retrieval and artificial intelligence). The "prediction"
    rule used to discount MWEs needing entries seems to beg the question of
    what algorithm can predict these and what does that algorithm predict.
    There is a big difference between believing you are excluding MWEs
    because they are understandable without definitions and having an
    algorithm that can generate the definition you would have written from
    the separate dictionary entries for the component words.

    Take an MWE such as "pencil sharpener". Most dictionaries don't define
    this since according to the prediction rule, it could be assumed to be
    just "a sharpener for pencils". However, that denies the fact that we
    all know pencil sharpeners are a specific category of manufactured
    product and if you look for a photo of a pencil sharpener it will have
    one of several distinct models. We also know details about how pencil
    sharpener's work. In contrast, things like a "stick sharpener" or a
    "crayon sharpener" are novel creations without long-standing precedent
    (I just checked the web, and, sigh, they both exist, but a "stick
    sharpener" isn't a tool for sharpening sticks, it is a knife sharpener
    whose shape resembles a stick, i.e., a thin cylindrical file.")

    A pencil sharpener would be something like "an electrical, mechanical or
    manual device with sharpened blades into which pencils can be inserted
    and which when operated creates a tapered conical pointed tip on the
    pencil which initializes or renews its ability to be used as a writing
    implement"

    Here is where I would say computational linguistics has to take its
    leave of lexicography (or at least published lexicographic practice) and
    declare "pencil sharpener" to be a useful and necessary MWE. I would
    even go so far as to say that every MWE for which an explicit definition
    can be written, should have an explicit definition and that ONLY when
    the explicit definitions show no differentiation should they be
    eliminated in favor of entries for the separate word elements. That is,
    REVERSE the "prediction" rule to assume you cannot predict the meaning
    of an MWE until you fail to find anything to say in its definition that
    is not formulaic.

    I don't believe published dictionaries contain sufficient information to
    correctly understand the MWEs they fail to explicitly list. I don't
    believe published dictionaries actually think about MWEs consistently or
    conscientiously.



    This archive was generated by hypermail 2b29 : Wed Mar 15 2006 - 16:27:31 MET