Re: [Corpora-List] Encoding of apostrophes and quotes

From: Ciarán Ó Duibhín (ciaran@oduibhin.freeserve.co.uk)
Date: Fri Jul 07 2006 - 21:03:44 MET DST

  • Next message: John F. Sowa: "Re: [Corpora-List] Encoding of apostrophes and quotes"

    In reply to my questions, a number of people have said they think it is
    reasonable that Unicode should assign the same codepoint to apostrophe and
    right single quote, on the grounds that many people will be unwilling to
    make the distinction.

    The reason I asked is that Unicode differentiates between characters and
    glyphs, and describes itself as a coded character set, not a coded glyph
    list. But outside of symbols which are clearly alphabetic, Unicode seems
    ready to encode glyphs not characters, on "practical" grounds. In
    particular, where a glyph is ambiguous between a lexical and a non-lexical
    function (apostrophe vs right single quote), Unicode encodes the glyph, not
    the characters. What this means is that such a basic processing operation
    as tokenization is not possible on
    Unicode-encoded text (without markup).

    An attraction of Unicode for me is the reduction in the need for
    character-level markup, thanks to the greatly-increased character
    repertoire. I'm concerned that Unicode is not living up to its promise for
    text processing here, with its readiness to deviate from the character-glyph
    model at the least difficulty. I just thought I'd see if this view had any
    support among the corpus community, who would be among the most likely (I
    thought) to benefit from a more consistent encoding of characters rather
    than glyphs in Unicode, but it seems not.

    I accept that encoders in general may have limited willingness to
    distinguish characters with similar appearances, even when they have very
    different functions, but I don't see that as an argument for denying the use
    of an encoding distinction to those who are prepared to take the trouble
    over it in preparing their corpus, in return for the processing benefits.

    Ciarán Ó Duibhín.



    This archive was generated by hypermail 2b29 : Fri Jul 07 2006 - 21:04:51 MET DST