Re: [Corpora-List] Encoding of apostrophes and quotes

From: Ron Artstein (artstein@essex.ac.uk)
Date: Fri Jun 30 2006 - 18:58:32 MET DST

Next message: Seth Grimes: "[Corpora-List] Text Analytics e-mail list"

Previous message: Markus Saers: "Re: [Corpora-List] Encoding of apostrophes and quotes"
In reply to: Ciarán Ó Duibhín: "[Corpora-List] Encoding of apostrophes and quotes"
Next in thread: Roger Shlomo Harris: "Re: [Corpora-List] Encoding of apostrophes and quotes"
Reply: Roger Shlomo Harris: "Re: [Corpora-List] Encoding of apostrophes and quotes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> As someone who has always taken the above statements to be true,
> I have been amazed and disappointed to learn that Unicode advise
> the encoding of apostrophes and right single quotes as the same
> character (U+2019).

My understanding is that Unicode tends to unify characters that
always look the same. Since an apostrophe and a closing quote use
identical glyphs whatever the font, they get the same character;
in contrast, a comma and a baseline quote may have identical glyphs
in some fonts but distinct glyphs in other fonts, so they get
separate characters.

One thing that has always baffled me was why Unicode decided to
assign the two characters U+05F3 Hebrew punctuation geresh and
U+05F4 Hebrew punctuation gershayim. Geresh (dual: gershayim) is
the Hebrew name for a punctuation mark similar to an apostrophe
which is used for marking abbreviations; in modern usage these have
identical glyphs to single and double quotes. I haven't found an
explanation why U+05F3 and U+05F4 are distinct from standard
punctuation marks, and whether they're intended just for
abbreviations or also for quotes.

My guess is that separate code points were needed because Hebrew
apostrophes and quotes are quite distinct in shape from Latin ones;
a mixed font could share code points (and glyphs) for most
punctuation marks, but using the Latin glyphs for quotes and
apostrophes in Hebrew would look very odd. If this is indeed the
rationale behind the code points U+05F3 and U+05F4, then these
characters should be used for both apostrophes and quotes in
Hebrew.

-Ron.

Next message: Seth Grimes: "[Corpora-List] Text Analytics e-mail list"
Previous message: Markus Saers: "Re: [Corpora-List] Encoding of apostrophes and quotes"
In reply to: Ciarán Ó Duibhín: "[Corpora-List] Encoding of apostrophes and quotes"
Next in thread: Roger Shlomo Harris: "Re: [Corpora-List] Encoding of apostrophes and quotes"
Reply: Roger Shlomo Harris: "Re: [Corpora-List] Encoding of apostrophes and quotes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Jun 30 2006 - 23:37:10 MET DST