Re: [Corpora-List] European Constitution in parallel

From: Andrius Utka (a.utka@hmf.vdu.lt)
Date: Mon Apr 25 2005 - 13:57:51 MET DST

  • Next message: Carlos Rodriguez: "[Corpora-List] Teaching corpora for romance languages"

    Dear Joerg,
    As far as I know Lithuanian uses ISO 8859-13. Not sure about Latvian.
    Best,
    Andrius

    >
    >follow-up ....
    >
    >I just realized that there are some additional problems with character
    >encodings. Latvian and Lithuanian should be supported by
    >ISO-8859-4 according to information I found. However, I got serious
    >trouble when converting from UTF-8 to ISO for these languages. Did the
    >alphabet change recently or is the ISO standard just useless?
    >
    >Now, I changed the Latvian and Lithuanian texts from the EUconst corpus
    >to
    >UTF-8 in the CWB index. Looks good but is difficult to query for
    >diacritics. Check:
    >http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lt
    >http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lv
    >
    >Let me know if there is a 8-bit code that can be (is) used for these
    >2 languages.
    >
    >
    >Jörg
    >
    >***********/\/\/\/\/\/\/\/\/\/\/\************************************
    >** Jörg Tiedemann tiedeman@let.rug.nl **
    >** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    >** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    >** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    >** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    >*************************************/\/\/\/\/\/\/\/\/\/\/\**********
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Apr 26 2005 - 13:57:47 MET DST