RE: [Corpora-List] BILINGUAL PARALLEL CORPORA

From: Ralf Steinberger (ralf.steinberger@jrc.it)
Date: Tue Nov 14 2006 - 08:35:15 MET

  • Next message: Joerg Tiedemann: "Re: [Corpora-List] BILINGUAL PARALLEL CORPORA"

    Dear J.L., :-)

     

    The JRC-Acquis multilingual parallel corpus is freely available for research
    purposes. You can find information on the corpus and a link to the download
    site at the web page:

     

        http://langtech.jrc.it/JRC-Acquis.html

     

    The JRC-Acquis covers the 20 official EU languages plus Romanian. Norwegian
    is thus not included, but several other Scandinavian languages are. The
    corpus is paragraph-aligned for each of the 190 language pairs. Many of the
    paragraphs are single sentences.

     

    I hope this helps. Greetings from the Lago Maggiore in Italy to "some place
    of Spain",

     

    Ralf

     

    PS: JRC's multilingual news aggregation and analysis system NewsExplorer now
    tracks longer news stories over time. Check it out at
    http://press.jrc.it/NewsExplorer/.

     

     

    Ralf Steinberger ( <mailto:Ralf.Steinberger@jrc.it> Ralf.Steinberger@jrc.it,
    <http://langtech.jrc.it/RS.html> http://langtech.jrc.it/RS.html)
    European Commission - Joint Research Centre (JRC)
    IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
    http://langtech.jrc.it, <http://press.jrc.it/NewsExplorer/>
    http://press.jrc.it/NewsExplorer)
    21020 Ispra (VA), Italy

     

     

    Here is some more information:

     

    SIZE AND FORMAT

     

    - 21 languages (all 20 official EU languages plus Romanian)

    - Average corpus size: 8.8 million words per language

    - XML Format according to TEI P4, UTF-8-encoded

    - Modular: download the languages you need.

     

    LANGUAGES

     

    Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,

    Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,

    Romanian, Slovak, Slovene, Spanish, Swedish.

     

    TEXT TYPES

     

    - Documents on contents, principles and political objectives of the EU
    Treaties

    - EU legislation

    - Declarations

    - Resolutions

    - Acts

    - International agreements.

     

    PARAGRAPH ALIGNMENT

     

    - Paragraph-aligned for all 210 language pairs

    - Paragraphs are sentence parts, sentences, or groups of sentences

    - 2 alternative alignments: using Vanilla and HunAlign

    - Ca. 270,000 alignments per language pair.

     

    MANUAL SUBJECT DOMAIN CLASSIFICATION

     

    - Manually classified according to EUROVOC subject domains

    - Selected from 6000 hierarchically organised classes, wide-coverage.

     

    USE / DOWNLOAD

     

    - Download from <http://langtech.jrc.it/JRC-Acquis.html>
    http://langtech.jrc.it/JRC-Acquis.html

    - Usage free for research purposes.

     

    FOR MORE DETAILS

     

    Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma¾
    Erjavec, Dan Tufiº, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
    aligned parallel corpus with 20+ languages'. Proceedings of the 5th
    International Conference on Language Resources and Evaluation (LREC'2006).
    Genoa, Italy, 24-26 May 2006. Available at
    <http://langtech.jrc.it/#Publications> http://langtech.jrc.it/#Publications.

     

     

      _____

    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of JLDLME
    Sent: 12 November 2006 18:40
    To: CORPORA@HD.UIB.NO
    Subject: [Corpora-List] BILINGUAL PARALLEL CORPORA

     

    Dear Corpora-List members,

     

    I have three questions...

     

    Does anyone know if there is any publicly available bilingual, sentence
    aligned, freely available corpus involving several languages, namely in
    Scandinavian (Finnish, Norwegian, etc.) or Latin languages (Spanish,
    Italian, etc.), for bilingual studies ?

     

    My second question is: Which would be the requirements to create an
    online/desktop software tool for the whole process of a parallel corpora?

     

    Finally, do you should consider one million of words (in both languages) a
    large or a little bilingual corpus?

     

    Any help will be appreciated.

     

     

    Regards,

     

     

    J. L. DeLucca (in some place of Spain)

     



    This archive was generated by hypermail 2b29 : Tue Nov 14 2006 - 11:51:19 MET