PD: [Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200

From: Rafał Górsk (RafalG@ijp-pan.krakow.pl)
Date: Thu Sep 12 2002 - 14:55:27 MET DST

  • Next message: Sampo Nevalainen: "Re: PD: [Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200"

    Dear Maria and Joerg
    In fact there is a lot of confusion in the terminology. Joerg writes

    >* a translation corpus should contain the original version and at least
    >one translation (but not necessarily only one)

    on the other hand Enery & Wilson "Corpus Linguistics" (2nd edition 2001) p.
    70:
    "translation corpora differ from parallel corpora, as they do not represent
    text in translation. Rather they allow one to compare, for example, L1
    French texts in one genre with L1 English texts in the same genre." The
    authors treat "translation" and "comparable" as synonims (however they give
    preference to the former using it in the body of the text; the term
    "comparable" is given only in a footnote).

    Sinclair: "A comparable corpus is one which selects similar texts in more
    than one language or variety." EAGLES Preliminary recommendations on Corpus
    Typology. Version of May, 1996
    http://www.ilc.pi.cnr.it/EAGLES96/corpustyp/corpustyp.html
    Note however that Sinclair calls International Corpus of English a
    "comparable corpus". In this case you cannot treat "comparable" and
    "translation" as equivalents!

    > * parallel corpora should be aligned to some extent to make them
    > searchable within linked segments, alignment can be done e.g. on
    > paragraphs or sentences (translation corpora do not have to be aligned I
    > would say)
    John Sinclair in: EAGLES Preliminary recommendations... defines:
    "A parallel corpus is a collection of texts, each of which is translated
    into one or more other languages than the original."
    McEnery & Wilson (2001) and Sinclair suggest that parallel corpora are not
    necessarly aligned, although they admit that a parallel corpus with no
    alignement is a bit strange (see section 2.3.1.)

    I admit that the term "translation corpus" is confusing: you would rather
    understand it as a "corpus of translations" than "corpus for translators" or
    "used mainly by translators" (which is the right interpretation).

    Rafal L. Górski

    ----- Original Message -----
    From: Jörg Tiedemann <joerg@stp.ling.uu.se>
    To: <maria_rzewuska@mail.ukie.gov.pl>
    Cc: <corpora@hd.uib.no>
    Sent: Wednesday, September 11, 2002 6:15 PM
    Subject: Re: [Corpora-List] Date: Wed, 11 Sep 2002 15:16:20 +0200

    >
    >
    > I don't know of any single article which summarises the terminology with
    > regards to parallel corpora but from my experience some of the
    > differences are the following:
    >
    > * bilingual corpora are strictly two languages
    > * a parallel corpus contains translations of a common source but they do
    > not need to include the original version (even if this sounds strange -
    > I know of parallel corpora e.g. from the EU which do not indicate the
    > original version and I used to work with some of them without
    > knowing/using the original or intermediate documents)
    > * parallel corpora should be aligned to some extent to make them
    > searchable within linked segments, alignment can be done e.g. on
    > paragraphs or sentences (translation corpora do not have to be aligned I
    > would say)
    > * comparable corpora are two or more corpora with similar size and from
    > similar domains. usually people assume similar distribution of
    > words/phrases in comparable corpora in order to compare them. They do
    > not have to be parallel (or translations of each other)
    > * comparable and parallel corpora do not have to include multiple
    > languages whereas translation corpora should
    > * sometimes I use another term for bilingual parallel corpora: bitexts -
    > just to make it shorter. in this case, aligned segments within such
    > corpora will be bitext segments
    >
    >
    > I hope this helped a bit and did not create even more confusion,
    >
    >
    > best regards,
    >
    >
    >
    > Jörg
    >
    > ***********/\/\/\/\/\/\/\/\/\/\/\************************************
    > ** Joerg Tiedemann joerg@stp.ling.uu.se **
    > ** Department of Linguistics http://stp.ling.uu.se/~joerg/ **
    > ** Uppsala University tel: (018) 471 7007 **
    > ** S-751 20 Uppsala/SWEDEN fax: (018) 471 1416 **
    > *************************************/\/\/\/\/\/\/\/\/\/\/\**********
    >
    >
    >
    >
    > On Wed, 11 Sep 2002 maria_rzewuska@mail.ukie.gov.pl wrote:
    >
    > > Hi, I have been reading the list for a while and lately I took a closer
    > > look at some bilingual corpus projects and I noticed a relatively
    flexible
    > > use of terms: translation corpus, parallel corpus, comaparable corpus,
    but
    > > mainly between the two first. Maybe someone could tell me is there any
    > > difference or is it simply mixed up. In the composition of the corpora I
    > > did not find any difference which could explain the terminological
    > > difference. Any book or clever article that I should read?
    > > thanks
    > >
    > > Maria Rzewuska
    > > Adam Mickiewicz University
    > > Poznan
    > > PL
    > >
    > >
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Thu Sep 12 2002 - 15:11:58 MET DST