Re: [Corpora-List] Incidence of MWEs

From: Yorick Wilks (yorick@dcs.shef.ac.uk)
Date: Tue Mar 14 2006 - 15:03:51 MET

  • Next message: Brigitte Grau: "[Corpora-List] special issue of TAL journal on Question Answering"

    A student of mine, Chun-Yu Kit, who did a thesis at Sheffield about
    six years ago, applied Rissanen's MDL (Minimum Description length)
    algorithm to English corpora as the first stage in a machine learning
    project to derive grammars. What MDL does is to decide what selection
    of English phrases (taken from a large corpus) and put in a phrase
    lexicon will minimise the length of the whole object (corpus +
    lexicon) taken together. This algorithm is extraordinarily effective
    in selecting out, unsupervised, unseeded, a plausible phrase
    inventory for the language based only on cooccurrence in the corpus
    plus this very nice algorithm.
    Yorick Wilks

    On 14 Mar 2006, at 13:30, Chris Butler wrote:

    > Dear David,
    >
    > As Adam Kilgarriff makes clear, the answer depends crucially on
    > exactly what
    > you're looking for, and the decisions you make about what to
    > include. For
    > estimates and discussion, you might like to look at the following:
    >
    > Altenberg, B (1998) On the phraseology of spoken English: the
    > evidence of
    > recurrent word combinations. In A P Cowie (ed.) Phaseology. (Oxford
    > Studies
    > in Leixcography and Lexicology), pp101-122. Oxford: Oxford
    > University Press.
    >
    > Biber, D et al (1999) Longman Grammar of Spoken and Written English,
    > pp990-1024.
    >
    > Butler, C S (1997) Repeated word combinations in spoken and written
    > text:
    > come implications for Functional Grammar. In C S Butler, J H
    > Connolly, R A
    > Gatward and R M Vismans (eds.) A Fund of Ideas: Recent Developments in
    > Functional Grammar. (Studies in Language and Language use 31),
    > pp60-77.
    > Amsterdam: IFOTT, University of Amsterdam.
    >
    > Wray, A M (2002) Formulaic language and the Lexicon. Cambridge:
    > Cambridge
    > University Press, especially Chapters 2 and 3.
    >
    > Best wishes,
    >
    > Chris Butler
    > Honorary Professor, Centre for Applied Language Studies, University
    > of Wales
    > Swansea
    >
    > ----- Original Message -----
    > From: "David Brooks" <D.J.Brooks@cs.bham.ac.uk>
    > To: "Corpora List" <corpora@uib.no>
    > Sent: Tuesday, March 14, 2006 12:42 PM
    > Subject: [Corpora-List] Incidence of MWEs
    >
    >
    >
    >> Dear Corpora-folk,
    >>
    >> I was wondering if anyone has estimated the incidence of multi-word
    >> expressions in language. I know that empirical estimates are tied to
    >> particular corpora, but does anyone have an account of MWEs for
    >> particular corpora, so that "ball-park" figures of the proportion of
    >> MWEs can be estimated?
    >>
    >> Better yet, can anyone give me a good reference for the incidence
    >> of MWEs?
    >>
    >> Regards,
    >> David
    >> --
    >> David Brooks
    >> http://www.cs.bham.ac.uk/~djb
    >>
    >>
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Mar 14 2006 - 15:03:21 MET