RE: [Corpora-List] Incidence of MWEs

From: Adam Kilgarriff (adam@lexmasterclass.com)
Date: Thu Mar 16 2006 - 18:30:11 MET

  • Next message: Timothy Baldwin: "[Corpora-List] COLING/ACL 2006 Final Call For Interactive Presentations (** Extended Deadline **)"

    Bob Amsler says:

    > I have found published dictionary's judgments as to what constitute MWEs
    > to be both dated and biased against declaring MWEs to exist.
    > ...
    > Take an MWE such as "pencil sharpener". Most dictionaries don't ...

    UK dictionaries on my shelf do list "pencil sharpener" (Oxford D of E 98,
    LDOCE 95, Macmillan E D 02). US ones (Random House 1987, M-W online) don't.
    Moral is clear.

    US dictionaries are ***way, way*** behind UK dictionaries in corpus use. UK
    dictionary publishers lead the world in corpus development and use (with NLP
    lagging behind). OUP and Longman were prime movers in developing the BNC,
    and OUP is now on the point of launching its billion-word corpus of English.
    Collins-COBUILD was the great pioneer in the 1980s. Macmillan was first
    user of my very own word sketches (corpus analysis software).

    That's all English: for German, Langenscheidt have been working with Uli
    Heid's group at Univ Stuttgart to improve MWE coverage in their
    dictionaries.

    There are theoretical limitations to paper dictionaries - they cannot
    usefully convey complex rules to their users. (To do so requires a
    sophisticated metalanguage. Dictionary-user research is conclusive:
    ordinary dictionary users don't read the manual. So there is no point
    offering a sophisticated metalanguage. Worse, it confuses or scares.)

    > I don't believe published dictionaries actually think about MWEs
    > consistently or conscientiously.

    Bob, I hope you don't believe it any longer!

    Adam

    PS - I have just been pointed to a recent and excellent thesis-length
    treatment of the original question:
    Bego~na Villada Moiron, "Data-driven identification of fixed expressions and
    their modifiability" http://odur.let.rug.nl/~begona/

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Amsler, Robert
    Sent: 15 March 2006 15:11
    To: Corpora List
    Subject: RE: [Corpora-List] Incidence of MWEs

    I have found published dictionary's judgments as to what constitute MWEs
    to be both dated and biased against declaring MWEs to exist. Until I
    actually went through a number of texts to extract MWEs by hand and
    compared those MWEs I found against those listed in dictionaries I used
    to think the lexicographic coverage was adequate and followed the rule
    that "if you can predict its meaning from its constituent parts, it
    doesn't need a separate entry" to be correct. What I found was that not
    only didn't the rule seem to be applied consistently, but that MWEs
    appeared to be a much neglected area of lexicography with many more
    undocumented MWEs being used in text than were in the dictionaries. It
    was as though dictionaries reviewed their MWE entries far less often and
    less diligently than they did their isolated word entries.

    There are probably good reasons against dictionary publishers declaring
    MWEs to exist. Namely, MWEs greatly increase the size of a dictionary
    for a small gain in clarity, perhaps only useful to Speakers of English
    as a Foreign Language (and practitioners of computational linguists,
    information retrieval and artificial intelligence). The "prediction"
    rule used to discount MWEs needing entries seems to beg the question of
    what algorithm can predict these and what does that algorithm predict.
    There is a big difference between believing you are excluding MWEs
    because they are understandable without definitions and having an
    algorithm that can generate the definition you would have written from
    the separate dictionary entries for the component words.

    Take an MWE such as "pencil sharpener". Most dictionaries don't define
    this since according to the prediction rule, it could be assumed to be
    just "a sharpener for pencils". However, that denies the fact that we
    all know pencil sharpeners are a specific category of manufactured
    product and if you look for a photo of a pencil sharpener it will have
    one of several distinct models. We also know details about how pencil
    sharpener's work. In contrast, things like a "stick sharpener" or a
    "crayon sharpener" are novel creations without long-standing precedent
    (I just checked the web, and, sigh, they both exist, but a "stick
    sharpener" isn't a tool for sharpening sticks, it is a knife sharpener
    whose shape resembles a stick, i.e., a thin cylindrical file.")

    A pencil sharpener would be something like "an electrical, mechanical or
    manual device with sharpened blades into which pencils can be inserted
    and which when operated creates a tapered conical pointed tip on the
    pencil which initializes or renews its ability to be used as a writing
    implement"

    Here is where I would say computational linguistics has to take its
    leave of lexicography (or at least published lexicographic practice) and
    declare "pencil sharpener" to be a useful and necessary MWE. I would
    even go so far as to say that every MWE for which an explicit definition
    can be written, should have an explicit definition and that ONLY when
    the explicit definitions show no differentiation should they be
    eliminated in favor of entries for the separate word elements. That is,
    REVERSE the "prediction" rule to assume you cannot predict the meaning
    of an MWE until you fail to find anything to say in its definition that
    is not formulaic.

    I don't believe published dictionaries contain sufficient information to
    correctly understand the MWEs they fail to explicitly list. I don't
    believe published dictionaries actually think about MWEs consistently or
    conscientiously.



    This archive was generated by hypermail 2b29 : Thu Mar 16 2006 - 18:30:56 MET