Re: [Corpora-List] European Constitution in parallel

From: Joerg Tiedemann (tiedeman@let.rug.nl)
Date: Mon Apr 25 2005 - 12:25:55 MET DST

  • Next message: Joerg Tiedemann: "Re: [Corpora-List] European Constitution in parallel"

    follow-up ....

    I just realized that there are some additional problems with character
    encodings. Latvian and Lithuanian should be supported by
    ISO-8859-4 according to information I found. However, I got serious
    trouble when converting from UTF-8 to ISO for these languages. Did the
    alphabet change recently or is the ISO standard just useless?

    Now, I changed the Latvian and Lithuanian texts from the EUconst corpus to
    UTF-8 in the CWB index. Looks good but is difficult to query for
    diacritics. Check:
    http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lt
    http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lv

    Let me know if there is a 8-bit code that can be (is) used for these
    2 languages.

    Jörg

    ***********/\/\/\/\/\/\/\/\/\/\/\************************************
    ** Jörg Tiedemann tiedeman@let.rug.nl **
    ** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    ** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    *************************************/\/\/\/\/\/\/\/\/\/\/\**********



    This archive was generated by hypermail 2b29 : Mon Apr 25 2005 - 12:40:33 MET DST