[Corpora-List] JRC-Acquis: a large aligned parallel corpus in 21 languages, freely available

From: Ralf Steinberger (ralf.steinberger@jrc.it)
Date: Fri May 19 2006 - 14:06:03 MET DST

  • Next message: Manuela Speranza: "[Corpora-List] ECAI 2006: Early registration deadline approaching"

    JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
    available

    Readers on this list may be interested in the availability of the
    'JRC-Acquis' parallel corpus:

    SIZE AND FORMAT

    - 21 languages (all 20 official EU languages plus Romanian)
    - Average corpus size: 8.8 million words per language
    - XML Format according to TEI P4, UTF-8-encoded
    - Modular: download the languages you need.

    LANGUAGES

    Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
    Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
    Romanian, Slovak, Slovene, Spanish, Swedish.

    TEXT TYPES

    - Documents on contents, principles and political objectives of the EU
    Treaties
    - EU legislation
    - Declarations
    - Resolutions
    - Acts
    - International agreements.

    PARAGRAPH ALIGNMENT

    - Paragraph-aligned for all 210 language pairs
    - Paragraphs are sentence parts, sentences, or groups of sentences
    - 2 alternative alignments: using Vanilla and HunAlign
    - Ca. 270,000 alignments per language pair.

    MANUAL SUBJECT DOMAIN CLASSIFICATION

    - Manually classified according to EUROVOC subject domains
    - Selected from 6000 hierarchically organised classes, wide-coverage.

    USE / DOWNLOAD

    - Download from http://langtech.jrc.it/JRC-Acquis.html
    - Usage free for research purposes.

    FOR MORE DETAILS

    Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma¾
    Erjavec, Dan Tufiº, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
    aligned parallel corpus with 20+ languages'. Proceedings of the 5th
    International Conference on Language Resources and Evaluation (LREC'2006).
    Genoa, Italy, 24-26 May 2006. Available at
    http://langtech.jrc.it/#Publications.

    CONTACT FOR FURTHER INFORMATION

    Ralf Steinberger (Ralf.Steinberger@jrc.it)
    European Commission - Joint Research Centre (JRC)
    IPSC - SeS - Language Technology
    URL: http://langtech.jrc.it, http://press.jrc.it/NewsExplorer
    T.P. 267, Via Fermi 1
    21020 Ispra (VA), Italy
    Tel: +39 0332 78-6271
    Fax: +39 0332 78-5154



    This archive was generated by hypermail 2b29 : Fri May 19 2006 - 14:55:11 MET DST