[Corpora-List] COMPARA version 5.0 - anouncement

From: Santos Diana (Diana.Santos@sintef.no)
Date: Mon Nov 10 2003 - 10:46:56 MET

Next message: SIGIR 2004 Announcement: "[Corpora-List] Reminder: SIGIR 2004 Mentoring Program's Deadline is 14 Nov"

Previous message: Sirajul Islam Choudhury: "[Corpora-List] Tagger for an agglutinating language"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear all,

We are pleased to announce COMPARA's version 5.0, with over one million
words of English and Portuguese parallel texts.

COMPARA is an extensible bidirectional parallel corpus of English and
Portuguese that is freely accessible at http://www.linguateca.pt/COMPARA/.
The corpus has been continuously improved since its first version back in
2000. Version 5.0 is the result of an extensive revision of the corpus and
its encoding.

The corpus is encoded in the IMS Corpus Workbench system and is searchable
via the DISPARA Web interface. Alignment is based on the source-text
sentence and allows users to search for sentences that have been joined,
split, added to, deleted from, and reordered in translation. Other
searchable features are translators' notes, foreign words, titles, emphasis
and named entities.

Version 5.0 contains 39 aligned text extracts of published fiction by 27
different authors from Angola, Brazil, Mozambique, Portugal, South Africa,
the United Kingdom and the United States, and 25 more texts are in the
processing queue.

New features in COMPARA version 5.0 include:
- all texts have been revised for encoding of single and double quotes
(and made distinct from apostrophes)
- a new semantics was given to the structural markup <foreign>,
<title> and <emph>, and a new category was added, <named> (for named
entities)
- a new procedure for sentence definition, regarding the colon, was
enforced
- a better and more complete display of the results, as well as of the
corpus overview, was implemented
- an improvement in the random choice of hits to be displayed was
brought about
- a new search and display feature was added, that of original vs.
translated text

Ana Frankenberg-Garcia & Diana Santos
compara@linguateca.pt
www.linguateca.pt/COMPARA/

====================================
Diana Santos, Diana.Santos@sintef.no
Linguateca, http://www.linguateca.pt
SINTEF Telecom & Informatics
Pb 124 Blindern, N-0314 Oslo Noruega

Next message: SIGIR 2004 Announcement: "[Corpora-List] Reminder: SIGIR 2004 Mentoring Program's Deadline is 14 Nov"
Previous message: Sirajul Islam Choudhury: "[Corpora-List] Tagger for an agglutinating language"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Nov 10 2003 - 10:49:42 MET