Re: Corpora: control chars

From: Tom Emerson (tree@basistech.com)
Date: Fri Jun 07 2002 - 16:00:54 MET DST

  • Next message: ted pedersen: "Corpora: measures of semantic distance in wordnet"

    Gil Graf writes:
    > is there any encoding, except utf16, which uses the
    > control range (0-31) in a way different than ASCII ?
    > more specifically, is it safe to cut off text at 10
    > (normally newline) or 32 (normally space) bytes?

    The question presumes you are looking at characters in terms of 8-bit
    bytes instead of abstract character units consisting of one or more
    bytes.

    There are some C0 code points you may want to keep:

    0x09 Horizontal Tab
    0x0A Line Feed
    0x0D Carriage Return

    I presume you are using a multibyte character encoding in your data:
    in that case all instances I can think of (including UTF-8) share the
    C0 range. The two- and four-byte encodings of Unicode also have the C0
    code points, but at a byte-level these may have leading or trailing
    0x00 depending on the endianness of the machine you are on.

    If you are working with C and are using the wchar_t type, then it is
    possible that the system is using UTF-32/UCS-4 as the underlying
    character type, in which case the encoding is less of an issue and you
    can think only in terms of codepoint.

    HTH(tm),

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Sr. Computational Linguist                         http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Fri Jun 07 2002 - 16:20:36 MET DST