Re: Corpora: non-alphabetic language databases

From: Robert Luk (csrluk@yahoo.com)
Date: Mon Dec 04 2000 - 04:22:08 MET

  • Next message: guo xiaotian: "Corpora: word list"

    Hi all,

    > My understanding is this: order of database entry is
    > not based on any phonetic system, nor on any
    > arrangement of radicals or character components, but
    > on a standard (for Chinese, usually one of Big-5 or
    > GB (Guo-Biao)) which maps each character on to an
    > arbitrary pair of ASCII characters. With the advent
    > of the Unicode standard, a one-to-one mapping is
    > also now possible, but implementations are rare.

    For information, the Big5 Internal code is arranged
    into 2 basic subsets. Characters in the first subset
    is
    arranged according the radicals and the number of
    strokes (i.e. the traditional Chinese dictionary
    search technique) and so is the second subset.
    However, when
    the sorting is based on the Internal code, the correct
    ordering requires some characters in the first subset
    to be interleaved with the characters in the second
    subset.

    For GB, the first subset is arranged in phonetic
    spelling but the second subset is arranged in
    traditional radicals + stroke numbers.

    Typically, the first subset is the more frequently
    used characters and the second subset is the less
    frequently used subset. This was designed in the
    days when PCs are not powerful (late 70s - early 80s).

    There are other subset of characters in the
    Internal code. For example, some internal code space
    is left for the user-defined characters and this
    could be in completely random order.

    There is a paper about ordering and database system
    adaptation for Chinese computing, please take a look
    at:

    Lu, Q., K. H. Lee and Yen-hui Hung, "DBMS Supporting
    Multiple Codesets and Collations", 1997 International
    Conference on Computer Processing of Oriental
    Languages, Hong Kong, March 30 - April 2, 1997

    There might be more papers in this area that I
    am not aware.

    As for Unicode, it does solve the character loss
    problem (I think) because there is a code for every
    GB/Big5 character. But the issue that is not addressed
    is how to convert simplified to/from traditional
    Chinese characters. This issue has some significance
    since people who can read simplified Chinese
    characters may not be able to read some traditional
    Chinese characters, and vice versa.

    Best,

    Robert Luk
    Dept. of Computing

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Shopping - Thousands of Stores. Millions of Products.
    http://shopping.yahoo.com/



    This archive was generated by hypermail 2b29 : Mon Dec 04 2000 - 09:54:56 MET