Re: [Corpora-List] Encoding of apostrophes and quotes

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Fri Jun 30 2006 - 08:55:08 MET DST

  • Next message: Eric Atwell: "Re: [Corpora-List] Encoding of apostrophes and quotes"

    Hi there.

    First of all, I am really glad that for once we discuss this kind of
    "low-level" processing issues that are so fundamental to getting high
    quality language data, but that are often not taken seriously as dignified
    research topics...

    > As someone who has always taken the above statements to be true, I have been
    > amazed and disappointed to learn that Unicode advise the encoding of
    > apostrophes and right single quotes as the same character (U+2019). Their
    > explanation is that people in general will find it too difficult to
    > understand the difference.

    I think that, if the people who produce the texts we parse do not make a
    distinction coherently, we might as well forget about it, as it will just
    create more noise (I myself have just found out now how to produce a single
    quote on my keyboard -- never typed a single quote character before...)

    If I get a text to tokenize, unless I have a lot of reliable information
    about how it was produced (which in my experience is never the case), I
    just merge all single quote/apostrophe-like characters, and then use
    various heuristics to decide which ones are apostrophes, which ones are
    single quotes, and which ones mark an accent on the previous vowel (since
    this is another way in which the apostrophe is used in electronic Italian).

    Add to that that a lot of standard tools to process Western European text
    (such as the IMS treetaggers) expect latin1 input, and thus they will not
    be able to make the distinction anyway (last time I checked, at least...)

    My pessimistic 2 cents.

    Regards,

    Marco

    -- 
    Marco Baroni
    SSLMIT, University of Bologna
    http://sslmit.unibo.it/~baroni
    

    Leadership is a form of evil. No one needs to lead you to do something that is obviously good for you.

    (Scott Adams)



    This archive was generated by hypermail 2b29 : Fri Jun 30 2006 - 08:54:45 MET DST