[Corpora-List] COMPARA version 5.0 - anouncement

From: Santos Diana (Diana.Santos@sintef.no)
Date: Mon Nov 10 2003 - 10:46:56 MET

  • Next message: SIGIR 2004 Announcement: "[Corpora-List] Reminder: SIGIR 2004 Mentoring Program's Deadline is 14 Nov"

    Dear all,

    We are pleased to announce COMPARA's version 5.0, with over one million
    words of English and Portuguese parallel texts.

    COMPARA is an extensible bidirectional parallel corpus of English and
    Portuguese that is freely accessible at http://www.linguateca.pt/COMPARA/.
    The corpus has been continuously improved since its first version back in
    2000. Version 5.0 is the result of an extensive revision of the corpus and
    its encoding.

    The corpus is encoded in the IMS Corpus Workbench system and is searchable
    via the DISPARA Web interface. Alignment is based on the source-text
    sentence and allows users to search for sentences that have been joined,
    split, added to, deleted from, and reordered in translation. Other
    searchable features are translators' notes, foreign words, titles, emphasis
    and named entities.

    Version 5.0 contains 39 aligned text extracts of published fiction by 27
    different authors from Angola, Brazil, Mozambique, Portugal, South Africa,
    the United Kingdom and the United States, and 25 more texts are in the
    processing queue.

    New features in COMPARA version 5.0 include:
    - all texts have been revised for encoding of single and double quotes
    (and made distinct from apostrophes)
    - a new semantics was given to the structural markup <foreign>,
    <title> and <emph>, and a new category was added, <named> (for named
    - a new procedure for sentence definition, regarding the colon, was
    - a better and more complete display of the results, as well as of the
    corpus overview, was implemented
    - an improvement in the random choice of hits to be displayed was
    brought about
    - a new search and display feature was added, that of original vs.
    translated text

    Ana Frankenberg-Garcia & Diana Santos

    Diana Santos, Diana.Santos@sintef.no
    Linguateca, http://www.linguateca.pt
    SINTEF Telecom & Informatics
    Pb 124 Blindern, N-0314 Oslo Noruega

    This archive was generated by hypermail 2b29 : Mon Nov 10 2003 - 10:49:42 MET