Re: [Corpora-List] European Constitution in parallel

From: Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Date: Mon Apr 25 2005 - 13:06:41 MET DST

  • Next message: Andrius Utka: "Re: [Corpora-List] European Constitution in parallel"

    Would it not be possible to make the corpus available in Unicode?

    Surely that would be the best solution, especially since you're saving
    it in an XML format.

    But many thanks for this effort -- what a great resource!

    Joerg Tiedemann wrote:

    >follow-up ....
    >
    >I just realized that there are some additional problems with character
    >encodings. Latvian and Lithuanian should be supported by
    >ISO-8859-4 according to information I found. However, I got serious
    >trouble when converting from UTF-8 to ISO for these languages. Did the
    >alphabet change recently or is the ISO standard just useless?
    >
    >Now, I changed the Latvian and Lithuanian texts from the EUconst corpus to
    >UTF-8 in the CWB index. Looks good but is difficult to query for
    >diacritics. Check:
    >http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lt
    >http://logos.uio.no/cgi-bin/opus/opuscqp.pl?corpus=EUconst;lang=lv
    >
    >Let me know if there is a 8-bit code that can be (is) used for these
    >2 languages.
    >
    >
    >J�rg
    >
    >***********/\/\/\/\/\/\/\/\/\/\/\************************************
    >** J�rg Tiedemann tiedeman@let.rug.nl **
    >** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    >** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    >** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    >** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    >*************************************/\/\/\/\/\/\/\/\/\/\/\**********
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Mon Apr 25 2005 - 13:16:24 MET DST