Re: [Corpora-List] Legal-domain corpora

From: Jernej Vicic (jernej.vicic@pef.upr.si)
Date: Wed Oct 18 2006 - 17:45:44 MET DST

  • Next message: Stefan Evert: "[Corpora-List] Job: Lecturer/Researcher in Computational Linguistics, University of Osnabrueck"

    You can try JRC-Acquis:

    JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
    available

    SIZE AND FORMAT

    - 21 languages (all 20 official EU languages plus Romanian)
    - Average corpus size: 8.8 million words per language
    - XML Format according to TEI P4, UTF-8-encoded
    - Modular: download the languages you need.

    LANGUAGES

    Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
    Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
    Romanian, Slovak, Slovene, Spanish, Swedish.

    TEXT TYPES

    - Documents on contents, principles and political objectives of the EU
    Treaties
    - EU legislation
    - Declarations
    - Resolutions
    - Acts
    - International agreements.

    PARAGRAPH ALIGNMENT

    - Paragraph-aligned for all 210 language pairs
    - Paragraphs are sentence parts, sentences, or groups of sentences
    - 2 alternative alignments: using Vanilla and HunAlign
    - Ca. 270,000 alignments per language pair.

    MANUAL SUBJECT DOMAIN CLASSIFICATION

    - Manually classified according to EUROVOC subject domains
    - Selected from 6000 hierarchically organised classes, wide-coverage.

    USE / DOWNLOAD

    - Download from http://langtech.jrc.it/JRC-Acquis.html
    - Usage free for research purposes.

    Seth Grimes wrote:

    >Hello all,
    >
    > I'm researching legal-domain application of NLP with machine
    >learning. What annotated corpora are available in this domain, either for
    >free or for a license fee? I'd be interested in --
    >
    >- legislation and statutes
    >- case law
    >- briefs, depositions & testimony, crime reports, and evidentiary
    >materials
    >- court judgments
    >- patent filings
    >
    >-- and also in parallel, multi-lingual corpora, for instance that might
    >have been created in the EU, Switzerland, Canada, and other areas with
    >multiple official languages.
    >
    > I've been told that news-media text can provide good training
    >material for the legal domain. I'd also be interested in hearing
    >reactions to that claim, especially if anyone has formally studied the
    >question.
    >
    > Thanks very much for all help,
    >
    > Seth
    >
    >
    >--
    >Seth Grimes Alta Plana Corp, analytical computing & data management
    > Intelligent Enterprise magazine (CMP), Contributing Editor
    >grimes@altaplana.com http://altaplana.com 301-270-0795
    >
    >
    >



    This archive was generated by hypermail 2b29 : Wed Oct 18 2006 - 17:43:40 MET DST