Re: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

From: Ramesh Krishnamurthy (r.krishnamurthy@aston.ac.uk)
Date: Fri Nov 10 2006 - 15:45:52 MET

  • Next message: Nicholas Sanders: "Re: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter"

    Hi Merle
    I must admit I hadn't been thinking of "parallel"
    corpora along such strict-definition lines.

    So who is creating large amounts of 'parallel'
    data (in the technical/translation sense)
    for British English and American English? I
    wouldn't have thought there was a very large
    market....?

    Noah Smith mentioned Harry Potter, and I must
    admit I'm quite surprised to discover
    that publishers are making such changes as
    > They had drawn for the house cup
    > They had tied for the house cup
    Perhaps because it's "children's" literature? Or
    at least read by many children,
    who may not be willing/able to cross varietal boundaries with total comfort.

    But when I read a novel by an American author, I
    accept that it's part of my role as reader to
    take on board any varietal differences as part of
    the context. I can't imagine anyone wanting
    to translate it into British English for my
    benefit, and I suspect I would hate to read the resulting
    text...

    Best
    Ramesh

    At 18:53 09/11/2006, Merle Tenney wrote:
    >Ramesh Krishnamurthy wrote:
    > >
    > > ...and there is no obvious parallel corpus of Br-Am Eng to consult...
    > > Do you know of one by any chance...
    > >
    > > And Mark P. Line responded:
    > >
    > >Why would it have to be a *parallel* corpus?
    >
    >In a word, alignment. The formative work in
    >parallel corpora has come from the machine
    >translation crowd, especially the statistical
    >machine researchers. The primary purpose of
    >having a parallel corpus is to align
    >translationally equivalent documents in two
    >languages, first at the sentence level, then at
    >the word and phrase level, in order to establish
    >word and phrase equivalences. A secondary
    >purpose, deriving from the sentence-level
    >alignment, is to produce source and target
    >sentence pairs to prime the pump for translation memory systems.
    >
    >Like you, I have wondered why you couldn't study
    >two text corpora of similar but not equivalent
    >texts and compare them in their totality. Of
    >course you can, but is there any way in this
    >scenario to come up with meaningful term-level
    >comparisons, as good as you can get with
    >parallel corpora? I can see two ways you might proceed:
    >
    >The first method largely begs the question of
    >term equivalence. You begin with a set of known
    >related words and you compare their frequencies
    >and distributions. So if you are studying
    >language models, you compare sheer, complete,
    >and utter as a group. If you are studying
    >dialect differences, you study diaper and nappy
    >or bonnet and hood (clothing and
    >automotive). If you are studying translation
    >equivalence in English and Spanish, you study
    >flag, banner, standard, pendant alongside
    >bandera, estandarte, pabellón (and flag,
    >flagstone vs. losa, lancha; flag, fail,
    >languish, weaken vs. flaquear, debilitarse,
    >languidecer; etc.). The point is, you already
    >have your comparable sets going in, and you
    >study their usage across a broad corpus. One
    >problem here is that you need to have a strong
    >word sense disambiguation component or you need
    >to work with a word sense-tagged corpus to deal
    >with homophonous and polysemous terms like
    >sheer, bonnet, flat, and flag, so you still have
    >some hard work left even if you start with the related word groups.
    >
    >The second method does not begin, a priori, with
    >sets of related words. In fact, generating
    >synonyms, dialectal variants, and translation
    >equivalents is one of its more interesting
    >challenges. Detailed lexical, collocational,
    >and syntactic characterizations is
    >another. Again, this is much easier to do if
    >you are working with parallel corpora. If you
    >are dealing with large, nonparallel texts, this
    >is a real challenge. Other than inflected and
    >lemmatized word forms, there are a few more
    >hooks that can be applied, including POS tagging
    >and WSD. Even if both of these technologies
    >perform well, however, that is still not enough
    >to get you to the quality of data that you get with parallel corpora.
    >
    >Mark, if you can figure out a way to combine the
    >quality and quantity of data from a very large
    >corpus with the alignment and equivalence power
    >of a parallel corpus without actually having a
    >parallel corpus, I will personally nominate you
    >for the Nobel Prize in Corpus Linguistics. J
    >
    >Merle
    >
    >PS and Shameless Microsoft Plug: In the last
    >paragraph, I accidentally typed “figure out a
    >why to combine” and I got the blue squiggle from
    >Word 2007, which was released to manufacturing
    >on Monday of this week. It suggested way, and
    >of course I took the suggestion. I am amazed at
    >the number of mistakes that the contextual
    >speller has caught in my writing since I started
    >using it. I recommend the new version of Word
    >and Office for this feature alone. J

    Ramesh Krishnamurthy

    Lecturer in English Studies, School of Languages
    and Social Sciences, Aston University, Birmingham B4 7ET, UK
    [Room NX08, North Wing of Main Building] ; Tel:
    +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
    http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

    Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/



    This archive was generated by hypermail 2b29 : Fri Nov 10 2006 - 15:43:28 MET