Re: [Corpora-List] starting a machine translation project

From: Joerg Tiedemann (tiedeman@let.rug.nl)
Date: Wed Sep 13 2006 - 14:08:15 MET DST

  • Next message: Antonio Branco: "[Corpora-List] Last Cfp : DAARC'2007 - The 6th Discourse Anaphora and Anaphor Resolution Colloquium"

    > Based on your experience, is it a minimum number of words or sentences
    > in a corpus to produce a basic translation service? If the purpose is
    > for daily language use, is it enough to use an English-Indonesian
    > Bible as a corpus?

    you could include translated KDE messages from the OPUS corpus to have at
    least some more up-to-date data (http://omilia.uio.no/opus/kde.html)

    download
    http://omilia.uio.no/opus/KDE/id.tar.gz
    http://omilia.uio.no/opus/KDE/en.tar.gz
    and the sentence alignmnts in
    http://omilia.uio.no/opus/KDE/enid.ces.gz

    (all other languages are alignd to indonesian as well ... just download
    the corresponding files)

    the KDE text is of course not very exciting and maybe not exactly what you
    might need for the SMT training (it's mainly terms and not so many
    complete sentences). but you could try.
    (it's very small as well but at leasr you have many language pairs)

    good luck!

    Jörg

    ***********/\/\/\/\/\/\/\/\/\/\/\************************************
    ** Jörg Tiedemann tiedeman@let.rug.nl **
    ** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    ** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    *************************************/\/\/\/\/\/\/\/\/\/\/\**********



    This archive was generated by hypermail 2b29 : Wed Sep 13 2006 - 14:06:18 MET DST