RE: [Corpora-List] Encoding of apostrophes and quotes

From: Hardie, Andrew (a.hardie@lancaster.ac.uk)
Date: Fri Jun 30 2006 - 14:00:19 MET DST

  • Next message: Rayson, Paul: "[Corpora-List] Call for participation: Workshop on Historical Text Mining"

    Another use of apostrophe to add to the pile: I've often encountered two in a row used instead of a double quote.
     
    Other alphabetic systems provide us with good examples of what happens when there are two Unicode characters that look identical or very similar but are supposed to be separate things: they get mixed up, both by typists and by software designers. For instance, in Urdu texts the pair alef maksura (0649) and farsi yeh (06cc) often get confused, as do farsi yeh (06cc) and yeh barree (06d2) in some positions, as do kaf (0643) and keheh (06A9). In Devanagari and similar alphabets I have likewise encountered confusion between visarga (0903, etc) and colon (003a), and between danda (0964) and the vertical line (007c). Note that these aren't even identical in appearance, just near identical, and they get confused. So I also think the Unicode Standard is right not to demand that a much finer distinction be made with the apostrophe/single quote.
     
    Andrew.
     
     
    Andrew Hardie
    Department of Linguistics
    Bowland College
    Lancaster University
    Lancaster LA1 4YT
     
    a.hardie@lancaster.ac.uk <mailto:a.hardie@lancaster.ac.uk>
     

    ________________________________

    From: owner-corpora@lists.uib.no on behalf of Marco Baroni
    Sent: Fri 30/06/2006 07:55
    To: Ciarán Ó Duibhín; CORPORA@UIB.NO
    Subject: Re: [Corpora-List] Encoding of apostrophes and quotes

    I think that, if the people who produce the texts we parse do not make a
    distinction coherently, we might as well forget about it, as it will just
    create more noise (I myself have just found out now how to produce a single
    quote on my keyboard -- never typed a single quote character before...)



    This archive was generated by hypermail 2b29 : Fri Jun 30 2006 - 14:36:55 MET DST