RE: [Corpora-List] Parallel corpora and word alignment, WAS: American and British English spelling converter

From: Ramesh Krishnamurthy (r.krishnamurthy@aston.ac.uk)
Date: Sat Nov 11 2006 - 01:35:29 MET

  • Next message: John F. Sowa: "Re: [Corpora-List] American and British English spelling converter"

    Hi Merle
    Yes, again, I am reasonably aware of the work going on with learner corpora.

    >Many of the errors are not that much of a reach
    >or subject to researcher interpretation -
    >use of actually instead of currently, use of
    >depend of instead of depend on, use of
    >informations instead of information, etc.

    But, to take just one example you mentioned,
    "accommodations" seems to be fairly common
    in NAm English, according to the Bank of English
    corpus (2002: 448m words), and shows some penetration
    of BR English as well (and not always referring to NAm contexts)...

    Query is "accommodations"
    Term 1 in your query has been selected as the node

    1230 matching lines
    Corpus Total Number of Average Number per
                    Occurrences Million Words

    usspok 220 108.7/million
    usephem 145 41.4/million
    usbooks 429 13.2/million
    strathy 166 10.4/million
    usacad 37 5.8/million
    usnews 45 4.5/million
    npr 54 2.4/million
    wbe 11 1.1/million
    brbooks 40 0.9/million
    indy 20 0.7/million
    brephem 3 0.6/million
    bbc 7 0.4/million
    brspok 7 0.3/million
    guard 11 0.3/million
    brmags 15 0.3/million
    oznews 9 0.3/million
    times 8 0.2/million
    econ 2 0.1/million
    sunnow 1 0.0/million
    newsci 0 0.0/million

    So we need to be careful not to mark this usage
    wrong in a student's work, without
    examining the specific context, target variety, etc etc

    Here are a few selected examples (I presume you
    refer to the 'place to live in' sense of
    accommodation, rather than the 'making a
    compromise' sense, which is more commonly countable):

    White Washingtonians did not lack means to
    discriminate against their black fellow citizens before Wilson came to town in
    1913, but the first southerner to occupy the White House since the Civil War
    did come with something new: the South's system of separate-and-unequal public
    accommodations and services that survived until it was dismantled by protest
    movements and court decisions in the 1950s and 1960s.

    Another irritation to European visitors was
    the absence of special first-class accommodations on steamboats and railroads.

    Congressman Newt Gingrich, a powerful voice against corruption in the
    House, enjoyed 49 days on the road in lush accommodations in 1992 at the
    expense of various interest groups, according to House financial reports.

    The bonus for Easterners: free dormitory accommodations.

    A proforma on the screen then asks them for
    details of their requirements, including neighbourhoods and price range.
    Immediately Gems will display a map of accommodations which fit their
    requirements. These will have been supplied by
    anyone with a room to hire who...

    I have also seen many of the type of corrections
    you mention, and sometimes the trigger for the error is
    signalled elsewhere in the text, so the real cause is obscured.

    >A corpus which took a strict view of learner
    >errors and associated those errors with correct native forms
    I'm not sure what you mean by a "strict view"
    (and surely it would be the observer and not the corpus which
    took it), and in my experience there may be a
    variety of "correct native forms" depending on where you
    perceive the error to be located... e.g. in a
    case of mis-concord, do you correct the number of the noun
    group or the form of the verb group? It's not
    always straightforward. To take one simple example:
    The 10-week course run from the middle of July
    Do you amend this to "courses" or to "runs"? What
    was the intention of the writer?
    To make a generic or specific statement?

    I'm afraid I remain far from convinced that this is an easy task.

    I agree that learner data can be very interesting and rewarding to study,
    but I'm not sure that the inferences to be made are at all obvious.

    Best
    Ramesh

    At 23:51 10/11/2006, you wrote:
    >Ramesh,
    >
    >Actually, there have been a lot of studies of
    >language learner errors. Many of the errors are
    >not that much of a reach or subject to
    >researcher interpretation­use of actually
    >instead of currently, use of depend of instead
    >of depend on, use of informations instead of
    >information, etc. A corpus which took a strict
    >view of learner errors and associated those
    >errors with correct native forms, via a parallel
    >corpus of corrected texts or a rich tagging
    >scheme, would be very useful for studying interference errors.
    >
    >Merle
    >
    >From: Ramesh Krishnamurthy [mailto:r.krishnamurthy@aston.ac.uk]
    >Sent: Friday, November 10, 2006 3:41 PM
    >To: Merle Tenney; CORPORA@UIB.NO
    >Subject: RE: [Corpora-List] Parallel corpora and
    >word alignment, WAS: American and British English spelling converter
    >
    >Hi Merle,
    >
    >Yes, I was aware of parallel corpora in 2 or more languages.
    >In fact, it's part of the corpus development we've initiated at Aston
    >(please see http://corpus.aston.ac.uk).
    >
    >But it intrigued me to think of parallel corpora *within* a language.
    >I suppose dialectal texts rendered into "standard" language or vice versa
    >might come close... I need to muse some more on this.
    >
    >
    >Another variant on the parallel corpus theme is
    >papers written by English language learners and
    >the corrected versions with interference problems removed.
    >I'm not sure how this could be done without
    >making huge intuitive leaps as to what the 'errors' were,
    >and what the 'interference problems' were... I'm
    >afraid a lot of the error analysis I've seen leaves me
    >greatly disturbed....
    >
    >Best
    >Ramesh
    >
    >
    >At 23:09 10/11/2006, Merle Tenney wrote:
    >
    >Ramesh,
    >
    >Lots of people are working with parallel corpora
    >in two or more languages. Honestly, I don’t
    >know of any effort to acquire parallel corpora
    >of two or more varieties of English, French,
    >Portuguese, etc. I should think that sources
    >for such corpora must exist, though not nearly
    >to the extent that they exist for texts in
    >different languages. Another variant on the
    >parallel corpus theme is papers written by
    >English language learners and the corrected
    >versions with interference problems
    >removed. Again, it is not hard to imagine that
    >such sources exist, but I cannot provide a
    >reference for either sort of same-language
    >corpus. Can someone point Ramesh and me in the right direction?
    >
    >Merle
    >
    >From: Ramesh Krishnamurthy [ mailto:r.krishnamurthy@aston.ac.uk]
    >Sent: Friday, November 10, 2006 6:46 AM
    >To: Merle Tenney; Mark P. Line; CORPORA@UIB.NO
    >Subject: Re: [Corpora-List] Parallel corpora and
    >word alignment, WAS: American and British English spelling converter
    >
    >Hi Merle
    >I must admit I hadn't been thinking of
    >"parallel" corpora along such strict-definition lines.
    >
    >So who is creating large amounts of 'parallel'
    >data (in the technical/translation sense)
    >for British English and American English? I
    >wouldn't have thought there was a very large
    >market....?
    >
    >Noah Smith mentioned Harry Potter, and I must
    >admit I'm quite surprised to discover
    >that publishers are making such changes as
    >
    > They had drawn for the house cup
    > They had tied for the house cup
    >Perhaps because it's "children's" literature? Or
    >at least read by many children,
    >who may not be willing/able to cross varietal boundaries with total comfort.
    >
    >But when I read a novel by an American author, I
    >accept that it's part of my role as reader to
    >take on board any varietal differences as part
    >of the context. I can't imagine anyone wanting
    >to translate it into British English for my
    >benefit, and I suspect I would hate to read the resulting
    >text...
    >
    >Best
    >Ramesh
    >
    >
    >At 18:53 09/11/2006, Merle Tenney wrote:
    >
    >Ramesh Krishnamurthy wrote:
    > >
    > > ...and there is no obvious parallel corpus of Br-Am Eng to consult...
    > > Do you know of one by any chance...
    > >
    > > And Mark P. Line responded:
    > >
    > >Why would it have to be a *parallel* corpus?
    >
    >In a word, alignment. The formative work in
    >parallel corpora has come from the machine
    >translation crowd, especially the statistical
    >machine researchers. The primary purpose of
    >having a parallel corpus is to align
    >translationally equivalent documents in two
    >languages, first at the sentence level, then at
    >the word and phrase level, in order to establish
    >word and phrase equivalences. A secondary
    >purpose, deriving from the sentence-level
    >alignment, is to produce source and target
    >sentence pairs to prime the pump for translation memory systems.
    >
    >Like you, I have wondered why you couldn't study
    >two text corpora of similar but not equivalent
    >texts and compare them in their totality. Of
    >course you can, but is there any way in this
    >scenario to come up with meaningful term-level
    >comparisons, as good as you can get with
    >parallel corpora? I can see two ways you might proceed:
    >
    >The first method largely begs the question of
    >term equivalence. You begin with a set of known
    >related words and you compare their frequencies
    >and distributions. So if you are studying
    >language models, you compare sheer, complete,
    >and utter as a group. If you are studying
    >dialect differences, you study diaper and nappy
    >or bonnet and hood (clothing and
    >automotive). If you are studying translation
    >equivalence in English and Spanish, you study
    >flag, banner, standard, pendant alongside
    >bandera, estandarte, pabellón (and flag,
    >flagstone vs. losa, lancha; flag, fail,
    >languish, weaken vs. flaquear, debilitarse,
    >languidecer; etc.). The point is, you already
    >have your comparable sets going in, and you
    >study their usage across a broad corpus. One
    >problem here is that you need to have a strong
    >word sense disambiguation component or you need
    >to work with a word sense-tagged corpus to deal
    >with homophonous and polysemous terms like
    >sheer, bonnet, flat, and flag, so you still have
    >some hard work left even if you start with the related word groups.
    >
    >The second method does not begin, a priori, with
    >sets of related words. In fact, generating
    >synonyms, dialectal variants, and translation
    >equivalents is one of its more interesting
    >challenges. Detailed lexical, collocational,
    >and syntactic characterizations is
    >another. Again, this is much easier to do if
    >you are working with parallel corpora. If you
    >are dealing with large, nonparallel texts, this
    >is a real challenge. Other than inflected and
    >lemmatized word forms, there are a few more
    >hooks that can be applied, including POS tagging
    >and WSD. Even if both of these technologies
    >perform well, however, that is still not enough
    >to get you to the quality of data that you get with parallel corpora.
    >
    >Mark, if you can figure out a way to combine the
    >quality and quantity of data from a very large
    >corpus with the alignment and equivalence power
    >of a parallel corpus without actually having a
    >parallel corpus, I will personally nominate you
    >for the Nobel Prize in Corpus Linguistics. J
    >
    >Merle
    >
    >PS and Shameless Microsoft Plug: In the last
    >paragraph, I accidentally typed “figure out a
    >why to combine” and I got the blue squiggle from
    >Word 2007, which was released to manufacturing
    >on Monday of this week. It suggested way, and
    >of course I took the suggestion. I am amazed at
    >the number of mistakes that the contextual
    >speller has caught in my writing since I started
    >using it. I recommend the new version of Word
    >and Office for this feature alone. J
    >
    >Ramesh Krishnamurthy
    >
    >Lecturer in English Studies, School of Languages
    >and Social Sciences, Aston University, Birmingham B4 7ET, UK
    >[Room NX08, North Wing of Main Building] ; Tel:
    >+44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
    ><http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp>http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp
    >
    >Project Leader, ACORN (Aston Corpus Network):
    ><http://corpus.aston.ac.uk/>http://corpus.aston.ac.uk/



    This archive was generated by hypermail 2b29 : Sat Nov 11 2006 - 22:54:43 MET