RE: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

From: TadPiotr (tadpiotr@plusnet.pl)
Date: Fri Nov 10 2006 - 15:57:22 MET

  • Next message: Alex Murzaku: "Re: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter"

    Hello All
    those of us who deal with speech might be also interested to know that there
    are different American and British audio tracks on movies on DVD. (There is
    a version of Zorro with Anthony Hopkins, and I was wondering whether he did
    both versions.) I have no idea whether the differences are only in
    pronunciation or perhaps also lexical and other ones.
    However, that means there is quite a lot of material waiting to be
    described.
    Best wishes
    Tadeusz Piotrowski

      _____

    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of Ramesh Krishnamurthy
    Sent: Friday, November 10, 2006 3:46 PM
    To: Merle Tenney; Mark P. Line; CORPORA@UIB.NO
    Subject: Re: [Corpora-List] Parallel corpora and word alignment, WAS:
    American and British English spelling converter

    Hi Merle
    I must admit I hadn't been thinking of "parallel" corpora along such
    strict-definition lines.

    So who is creating large amounts of 'parallel' data (in the
    technical/translation sense)
    for British English and American English? I wouldn't have thought there was
    a very large
    market....?

    Noah Smith mentioned Harry Potter, and I must admit I'm quite surprised to
    discover
    that publishers are making such changes as

       They had drawn for the house cup
       They had tied for the house cup

    Perhaps because it's "children's" literature? Or at least read by many
    children,
    who may not be willing/able to cross varietal boundaries with total comfort.

    But when I read a novel by an American author, I accept that it's part of my
    role as reader to
    take on board any varietal differences as part of the context. I can't
    imagine anyone wanting
    to translate it into British English for my benefit, and I suspect I would
    hate to read the resulting
    text...

    Best
    Ramesh

    At 18:53 09/11/2006, Merle Tenney wrote:

    Ramesh Krishnamurthy wrote:
    >
    > ...and there is no obvious parallel corpus of Br-Am Eng to consult...
    > Do you know of one by any chance...
    >
    > And Mark P. Line responded:
    >
    >Why would it have to be a *parallel* corpus?
     
    In a word, alignment. The formative work in parallel corpora has come from
    the machine translation crowd, especially the statistical machine
    researchers. The primary purpose of having a parallel corpus is to align
    translationally equivalent documents in two languages, first at the sentence
    level, then at the word and phrase level, in order to establish word and
    phrase equivalences. A secondary purpose, deriving from the sentence-level
    alignment, is to produce source and target sentence pairs to prime the pump
    for translation memory systems.
     
    Like you, I have wondered why you couldn't study two text corpora of similar
    but not equivalent texts and compare them in their totality. Of course you
    can, but is there any way in this scenario to come up with meaningful
    term-level comparisons, as good as you can get with parallel corpora? I can
    see two ways you might proceed:
     
    The first method largely begs the question of term equivalence. You begin
    with a set of known related words and you compare their frequencies and
    distributions. So if you are studying language models, you compare sheer,
    complete, and utter as a group. If you are studying dialect differences,
    you study diaper and nappy or bonnet and hood (clothing and automotive). If
    you are studying translation equivalence in English and Spanish, you study
    flag, banner, standard, pendant alongside bandera, estandarte, pabellón (and
    flag, flagstone vs. losa, lancha; flag, fail, languish, weaken vs. flaquear,
    debilitarse, languidecer; etc.). The point is, you already have your
    comparable sets going in, and you study their usage across a broad corpus.
    One problem here is that you need to have a strong word sense disambiguation
    component or you need to work with a word sense-tagged corpus to deal with
    homophonous and polysemous terms like sheer, bonnet, flat, and flag, so you
    still have some hard work left even if you start with the related word
    groups.
     
    The second method does not begin, a priori, with sets of related words. In
    fact, generating synonyms, dialectal variants, and translation equivalents
    is one of its more interesting challenges. Detailed lexical, collocational,
    and syntactic characterizations is another. Again, this is much easier to
    do if you are working with parallel corpora. If you are dealing with large,
    nonparallel texts, this is a real challenge. Other than inflected and
    lemmatized word forms, there are a few more hooks that can be applied,
    including POS tagging and WSD. Even if both of these technologies perform
    well, however, that is still not enough to get you to the quality of data
    that you get with parallel corpora.
     
    Mark, if you can figure out a way to combine the quality and quantity of
    data from a very large corpus with the alignment and equivalence power of a
    parallel corpus without actually having a parallel corpus, I will personally
    nominate you for the Nobel Prize in Corpus Linguistics. J
     
    Merle
     
    PS and Shameless Microsoft Plug: In the last paragraph, I accidentally
    typed “figure out a why to combine” and I got the blue squiggle from Word
    2007, which was released to manufacturing on Monday of this week. It
    suggested way, and of course I took the suggestion. I am amazed at the
    number of mistakes that the contextual speller has caught in my writing
    since I started using it. I recommend the new version of Word and Office
    for this feature alone. J

    Ramesh Krishnamurthy

    Lecturer in English Studies, School of Languages and Social Sciences, Aston
    University, Birmingham B4 7ET, UK
    [Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; Fax:
    +44 (0)121-204-3766
    http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

    Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/



    This archive was generated by hypermail 2b29 : Fri Nov 10 2006 - 16:05:06 MET