Re: [Corpora-List] BILINGUAL PARALLEL CORPORA

From: Philipp Koehn (pkoehn@inf.ed.ac.uk)
Date: Mon Nov 13 2006 - 15:06:31 MET

  • Next message: Ramesh Krishnamurthy: "Re: [Corpora-List] Auto-generation and how to spot it"

    Hi,

    large available corpora for the languages in questions
    are the Europarl http://www.statmt.org/europarl/ and
    Acquis Communitair http://langtech.jrc.it/JRC-Acquis.html
    corpora.

    I am not sure what you mean by your second question.
    What is the purpose of such a tool? There are tools out
    there that do word alignment, build statistical machine
    translation models, etc.

    Also, the size of the corpus very much depends on
    what you want to do with it. For statistical machine
    translation, 1 million words goes a long way, although
    recent systems are typically trained on more data.

    Regards,
    Philipp Koehn

    On 11/12/06, JLDLME <jldlme@yahoo.com> wrote:
    > Dear Corpora-List members,
    >
    > I have three questions...
    >
    > Does anyone know if there is any publicly available bilingual, sentence
    > aligned, freely available corpus involving several languages, namely in
    > Scandinavian (Finnish, Norwegian, etc.) or Latin languages (Spanish,
    > Italian, etc.), for bilingual studies ?
    >
    > My second question is: Which would be the requirements to create an
    > online/desktop software tool for the whole process of a parallel corpora?
    >
    > Finally, do you should consider one million of words (in both languages) a
    > large or a little bilingual corpus?
    >
    > Any help will be appreciated.
    >
    >
    > Regards,
    >
    >
    > J. L. DeLucca (in some place of Spain)
    >
    >
    > ________________________________
    > Access over 1 million songs - Yahoo! Music Unlimited.
    >
    >



    This archive was generated by hypermail 2b29 : Mon Nov 13 2006 - 15:04:09 MET