Re: [Corpora-List] Sorting upper-ASCII chars in Unix

From: Serge HEIDEN (slh@ens-lsh.fr)
Date: Mon Nov 24 2003 - 22:27:59 MET

  • Next message: Gertjan van Noord: "[Corpora-List] unix sort and locale"

    Dear William,

    | I have been trying to use the Unix sort function to sort files which
    | contain upper-ASCII characters (i.e. ASCII code > 127) on a machine with
    | locale, language and charset set to US English. Lower-ASCII characters
    | and some upper-ASCII characters sort fine, but some upper-ASCII
    | characters (specifically some non-alphanumeric ones) are left in
    | semi-random order.
    |
    | How should the relevant environmental variables be set to permit sorting
    | files in straight ASCII order?

    A typical Unix manual will tel you that "lines are ordered according to the
    collating sequence of the current locale".
    A locale is defined by a langage AND a charset. For example on my Unix box,
    I have :
    - en_GB.ISO8859-1
    - en_GB.ISO8859-15
    - en_GB.ISO8859-15@euro
    - ...
    Each locale defines its own collating sequence (and a lot of other things).
    A collating sequence defines how one or groups of character code elements
    are ordered.
    If we suppose that you select a locale which associates en_US (for american
    english language lexical collating sequence) with a charset containing codes above
    127, the question is "What collating sequence interpretation, the person who
    designed the locale has given to character codes above 127 ?"
    Said differently "what collating sequence meaning has he given to characters
    USUALLY not used in a specific language" ?
    There are several answers to this question. One could be that no specific
    collating sequence order has been defined for codes above 127 for the
    en_US.* locale, which looks like what you have on your Unix box. The result
    is that the order depends on the initial orderings and on the sort algorithm used
    (usually quicksort). You should verify the locale definition on your Unix box.

    I propose four solutions :
    - buy a Unix where locale design and definition is precisely documented (I don't know any)
    and pray for a coherent locale definition for codes above 127 in en_US ;
    - use a locale from another language than en_US which USES the character codes
    above 127 you use AND don't use different collating sequence than english.
    For example fr_FR uses characters up to 255 in the ISO-Latin1 charset ;
    - design your own locale : any Unix should help you to do so ;
    - use a sort implementation that don't use any locale library and knows to deal
    with your charset (8bit, 16bit, etc).

    Cheers,

        [slh]

    _____________________________________________________________________
    Serge Heiden, slh@ens-lsh.fr, https://weblex.ens-lsh.fr
    ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
    15, parvis René Descartes 69342 Lyon BP7000 Cedex 07, tél. +33 4 37 37 63 12, fax. +33 4 37 37 62 65



    This archive was generated by hypermail 2b29 : Tue Nov 25 2003 - 10:08:48 MET