Re: [Corpora-List] Encoding of apostrophes and quotes

From: Ron Artstein (artstein@essex.ac.uk)
Date: Fri Jun 30 2006 - 18:58:32 MET DST

  • Next message: Seth Grimes: "[Corpora-List] Text Analytics e-mail list"

    > As someone who has always taken the above statements to be true,
    > I have been amazed and disappointed to learn that Unicode advise
    > the encoding of apostrophes and right single quotes as the same
    > character (U+2019).

    My understanding is that Unicode tends to unify characters that
    always look the same. Since an apostrophe and a closing quote use
    identical glyphs whatever the font, they get the same character;
    in contrast, a comma and a baseline quote may have identical glyphs
    in some fonts but distinct glyphs in other fonts, so they get
    separate characters.

    One thing that has always baffled me was why Unicode decided to
    assign the two characters U+05F3 Hebrew punctuation geresh and
    U+05F4 Hebrew punctuation gershayim. Geresh (dual: gershayim) is
    the Hebrew name for a punctuation mark similar to an apostrophe
    which is used for marking abbreviations; in modern usage these have
    identical glyphs to single and double quotes. I haven't found an
    explanation why U+05F3 and U+05F4 are distinct from standard
    punctuation marks, and whether they're intended just for
    abbreviations or also for quotes.

    My guess is that separate code points were needed because Hebrew
    apostrophes and quotes are quite distinct in shape from Latin ones;
    a mixed font could share code points (and glyphs) for most
    punctuation marks, but using the Latin glyphs for quotes and
    apostrophes in Hebrew would look very odd. If this is indeed the
    rationale behind the code points U+05F3 and U+05F4, then these
    characters should be used for both apostrophes and quotes in
    Hebrew.

    -Ron.



    This archive was generated by hypermail 2b29 : Fri Jun 30 2006 - 23:37:10 MET DST