Re: Corpora: non-alphabetic language databases

From: Robert Luk (csrluk@yahoo.com)
Date: Mon Dec 04 2000 - 04:22:08 MET

Next message: guo xiaotian: "Corpora: word list"

Previous message: Priscilla Rasmussen: "Corpora: NAACL-01 CFP for Workshop on Automatic Summarization 2001"
Maybe in reply to: Avryl2@aol.com: "Corpora: non-alphabetic language databases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

> My understanding is this: order of database entry is
> not based on any phonetic system, nor on any
> arrangement of radicals or character components, but
> on a standard (for Chinese, usually one of Big-5 or
> GB (Guo-Biao)) which maps each character on to an
> arbitrary pair of ASCII characters. With the advent
> of the Unicode standard, a one-to-one mapping is
> also now possible, but implementations are rare.

For information, the Big5 Internal code is arranged
into 2 basic subsets. Characters in the first subset
is
arranged according the radicals and the number of
strokes (i.e. the traditional Chinese dictionary
search technique) and so is the second subset.
However, when
the sorting is based on the Internal code, the correct
ordering requires some characters in the first subset
to be interleaved with the characters in the second
subset.

For GB, the first subset is arranged in phonetic
spelling but the second subset is arranged in
traditional radicals + stroke numbers.

Typically, the first subset is the more frequently
used characters and the second subset is the less
frequently used subset. This was designed in the
days when PCs are not powerful (late 70s - early 80s).

There are other subset of characters in the
Internal code. For example, some internal code space
is left for the user-defined characters and this
could be in completely random order.

There is a paper about ordering and database system
adaptation for Chinese computing, please take a look
at:

Lu, Q., K. H. Lee and Yen-hui Hung, "DBMS Supporting
Multiple Codesets and Collations", 1997 International
Conference on Computer Processing of Oriental
Languages, Hong Kong, March 30 - April 2, 1997

There might be more papers in this area that I
am not aware.

As for Unicode, it does solve the character loss
problem (I think) because there is a code for every
GB/Big5 character. But the issue that is not addressed
is how to convert simplified to/from traditional
Chinese characters. This issue has some significance
since people who can read simplified Chinese
characters may not be able to read some traditional
Chinese characters, and vice versa.

Best,

Robert Luk
Dept. of Computing

__________________________________________________
Do You Yahoo!?
Yahoo! Shopping - Thousands of Stores. Millions of Products.
http://shopping.yahoo.com/

Next message: guo xiaotian: "Corpora: word list"
Previous message: Priscilla Rasmussen: "Corpora: NAACL-01 CFP for Workshop on Automatic Summarization 2001"
Maybe in reply to: Avryl2@aol.com: "Corpora: non-alphabetic language databases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Dec 04 2000 - 09:54:56 MET