[Corpora-List] Encoding of apostrophes and quotes

From: Ciarán Ó Duibhín (ciaran@oduibhin.freeserve.co.uk)
Date: Fri Jun 30 2006 - 02:48:49 MET DST

  • Next message: Timothy Baldwin: "[Corpora-List] COLING/ACL Newsletter No. 5"

    Would list members agree with the following statements:

    1. Even though they look the same, apostrophe and single right quote behave
    as different characters and require different encoding.

    2. An apostrophe is generally used to indicate elision or (in English)
    possession:
    don't, 'tis, sayin', John's, James', c'est, geht's. In tokenization, the
    apostrophe is not to be dropped, but is retained as part of the token; and a
    token break may be considered somewhere in its vicinity.

    3. A right single quote is used, in conjunction with a left single quote, to
    delimit a stretch of text. In tokenization, such marks (like punctuation
    in general) become separate tokens, and in many applications (such as
    word-lists) they are simply dropped.

    As someone who has always taken the above statements to be true, I have been
    amazed and disappointed to learn that Unicode advise the encoding of
    apostrophes and right single quotes as the same character (U+2019). Their
    explanation is that people in general will find it too difficult to
    understand the difference.

    If I had followed this advice and used U+2019 for both apostrophe and right
    single quote, all the corpus analysis which I have successfully undertaken
    would have been made impossibly difficult. In fact, even the simplest text
    processing exercise becomes impossible, see
    http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm.

    I would be interested to know what people think of Unicode's advice, and how
    they deal with this situation in practice.

    Ciarán Ó Duibhín.

    For completeness, though it doesn't affect the point above, I ought to add
    that Unicode *do* make a distinction between what they call "punctuation
    apostrophes" (the kind I have been talking about), and "letter apostrophes".
    They assign a character (U+02BC) to the latter, to be used in cases where an
    apostrophe look-alike is used to represent a sound (often, the glottal
    stop).



    This archive was generated by hypermail 2b29 : Fri Jun 30 2006 - 02:52:30 MET DST