Re: [Corpora-List] Re: [Corpora-list] Incidence of MWEs

From: Kit Chun Yu (ctckit@cityu.edu.hk)
Date: Sat Mar 18 2006 - 02:39:15 MET

  • Next message: Will Fitzgerald: "Re: [Corpora-List] Re: [Corpora-list] Incidence of MWEs"

    why not think about this kind of issues form the perspective of
    tokenization for NLP?
    (a very old paper: Webster & Kit, "Tokenization as the initial phase in
    NLP", COLING-92 1106-1110.)
    a very simple idea: anything that are not to be further decomposed into
    any smaller fragments are simply treated as a token.
    what is a token (or atomic text unit, which may have its own internal
    structure) seems to be application-dependent.
    we may have mono-word and multi-word tokens, incl. continuous and
    discontinuous (or noncontiguous) ones (or MWEs).
    accordingly, we can have something like this for tagging: <t ..> <w
    ..>... </w> <w..>... </w> ... </t>
    we may need some more sophisticated tagging for discontinuous ones, of
    course.
    just to put in my two cents.
    best,

    Chunyu Kit, PhD
    Assistant Professor in Computational Linguistics

    Dept. of Chinese, Translation & Linguistics
    City University of Hong Kong
    83 Tat Chee Ave., Kowloon

    E-mail:ctckit@cityu.edu.hk
    http://personal.cityu.edu.hk/~ctckit/
    Fax: (+852)2788 8706, 2788 8732
    Tel: (+852)2788 9310 (O), 9380 1738 (M)
         (+86)136 5881 2972 (China Mobile)



    This archive was generated by hypermail 2b29 : Sat Mar 18 2006 - 03:39:14 MET