Re: [Corpora-List] labels of COLT files in BNC spoken

From: Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Date: Thu Nov 13 2003 - 13:42:33 MET

  • Next message: Eric Atwell: "Re: [Corpora-List] labels of COLT files in BNC spoken"

    Apologies for not contributing to this enquiry sooner. A number of
    different issues seem to be confused here:

    1. Which bits of COLT also appear in the BNC?
    2. How do I find out which bits of the BNC contain London teenage
    speech?
    3. Is "ain't" characteristic of spoken London teenage language?

    Here's what *I* think on each of these (see also
    http://www.hf.uib.no/i/Engelsk/colt/COLTinfo.html):

    1. None! COLT is the brainchild of Anna Brita Stenstrom and colleagues
    at Bergen. With funding from Longman and others, they collected the
    audio material which is the "fons et origo" of this material. Longman
    made a transcription of (most of) this audio material and contributed it
    to the BNC. Bergen made a *different* transcription of (most of) the
    same audio, using different conventions, and different markup, and also
    substantially revised the part of speech tagging. The result was
    eventually published as COLT. They did not include any way of linking
    their transcription to the older transcription in the BNC, in particular
    they did not specify which files correspond with which. The BNC files of
    course combine all conversations collected by a single respondent into
    one file, whereas Colt has them in separate files.

    2. Easy. Look at the <catRef> element in the header of each text and
    select those which have appropriate values: (sdeage1 sdeage2 sporeg1 to
    be exact). This gives 43 texts thus classified. You could further refine
    this by looking for words like London in the header, of course, but it
    probably isn't worth the effort.

    3. Hmm. The problem is in the transcription. As Ylva Berglund found in
    her study of "innit", any pronouncements about relative rates of these
    quasi-lexicalized words in speech and writing have to be hedged around
    with all sorts of caution. The BNC speech transcriptions went through at
    least two normalization stages -- one using the transcriber's judgment
    as to what was intended, and the other using an automatic spelling
    correction tool. Paradoxically, I would expect "aint" or "ent" or
    "innit" to get tidied up into "isn't" disproportionately more often in
    the spoken transcripts than in the written texts, precisely for that
    reason. You can't argue with "ain't" when it's there in black and white
    on the page. The COLT speech transcription, however, was made by people
    with a different agenda, and so I would expect them to both more
    sensitive to and more likely to wish to record such variation than the
    BNC speech transcribers.

    Lou Burnard

    On Thu, 2003-11-13 at 07:38, Ute Römer wrote:
    > Dear Eric, Bayan, and others,
    >
    >
    > > but as far as I know there isnt anything in BNC documentation equivalent
    > to a list of filenames of files from COLT
    >
    > That's too bad. I was sure there had to exist such a list somewhere but
    > apparently it doesn't (or nobody knows about it).
    >
    > I'm not 100% sure yet (more concordance checks required), but I think I've
    > found the 377 COLT files. Last night I scrolled through the list of BNC
    > texts (in SARA; unfortunately, it's not possible to copy and past this list
    > to search it automatically) and checked the bibliographic reference for
    > quite a number of those labelled "n conversations recorded by X" in the
    > list. It looks as if files KNR to KR2 and KSN to KSW (51 files, consisting
    > of 1 to 39 conversations each) are COLT files, or most of them at least. You
    > get information like
    >
    > "<hi>7 conversations recorded by `Robin' (PS58K) [dates unknown] with 6
    > interlocutors, totalling 1126 s-units, 5165 words (duration not
    > recorded).</hi>
    >
    > PS58K `Robin', 14, student, AB, male
    >
    > PS58L `Jones'teacher, male
    >
    > PS58M `Zoe', 13, student, female
    >
    > PS58N `Ben', 14, student, male
    >
    > PS58P `Oliver', 13, student, male
    >
    > PS5AV `Jenny', 13, student, female"
    >
    > -- sounds very COLTish to me.
    >
    > Also, I had a look at some headers of these files (checked the BNC texts in
    > version 1.0 though) and spotted lots of COLT key items like "Hackney" or
    > "Greater London". I then saved these 51 BNC files as a subcorpus and did a
    > concordance check of "ai" in this collection (using SARA2) and of "ain"
    > ("ai" didn't work here) in the real COLT (using WST). I found 307
    > occurrences in my supposed COLT and 293 in the real one - not 100%
    > convincing but not too bad either.
    >
    > However, if these files (my saved "COLT?" BNC subcorpus) really make up
    > COLT, then most of my occurrences of "ain't" are not from teenage language.
    > So, unfortunately, all that searching, browsing, and alerting you hasn't
    > really solved my problem. Anyway, I guess I know a bit more about the BNC
    > and COLT contents now (and about the importance of knowing exactly what's in
    > your corpus - and, ideally, where it is).
    >
    > Thanks to Eric and to Linda Bawcom (who contacted me off the list).
    >
    > Best from Hanover... Ute
    >
    >
    > ************************************************************
    >
    > Ute Römer
    > English Department
    > University of Hanover
    > Königsworther Platz 1
    > 30167 Hannover
    > Germany
    >
    > Phone: +49 (0)511 762 2997
    > Fax: +49 (0)511 762 2996
    > E-mail: ute.roemer@anglistik.uni-hannover.de
    > http://www.fbls.uni-hannover.de/angli/
    >
    >
    > > Bayan ended up searching all
    > > spoken transcript files including teenager speakers (speaker age is in
    > > the header info).
    > >
    > > If you (or soemone else) discovers a solution, do please let us know...
    > >
    > > and in the meantime, feel free to try out the chatbots we have trained
    > > on various BNC files at http://www.comp.leeds.ac.uk/eric/
    > >
    > > - we have to demo these at the BCS Machine Intelligence contest at
    > > Cambridge Univ, December 16th, as an example of Machine Learning used
    > > to visualise sublanguage ... so feedback to help us carry off the
    > > trophy and GBP1000 cash prize is welcome!!!
    > >
    > > cheers
    > >
    > > eric atwell
    > >
    > >
    > > On Tue, 11 Nov 2003, Ute Römer wrote:
    > >
    > > > Dear all,
    > > >
    > > > I was wondering if anyone of you could tell me which text files in the
    > BNC are COLT files. I checked David Lee's Excel spreadsheet and the BNC
    > World list of texts (on the SARA2 start page) but didn't find the
    > information I was hoping to get (maybe I didn't search long enough though).
    > > > The thing is that I'm trying to nail down repeated occurrences of "ai
    > n't" plus progressive form (and missing form of TO BE plus progressive form)
    > in BNC (spoken) data which I don't get in my Bank of English (brspok) data.
    > I thought that the amount of teenage and adolescent language in the BNC
    > might be a possible explanation for fragmentary constructions. It's not a
    > big thing, really, and I suppose I could check the headers of all the BNC
    > files my concordance examples come from (to see how old the participants
    > are), but maybe there is an easier/faster option.
    > > >
    > > > Thanks in advance and best wishes. Ute
    > > >
    > > >
    > > > ************************************************************
    > > >
    > > > Ute Römer
    > > > English Department
    > > > University of Hanover
    > > > Königsworther Platz 1
    > > > 30167 Hannover
    > > > Germany
    > > >
    > > > Phone: +49 (0)511 762 2997
    > > > Fax: +49 (0)511 762 2996
    > > > E-mail: ute.roemer@anglistik.uni-hannover.de
    > > > http://www.fbls.uni-hannover.de/angli/
    > > >
    > > >
    > >
    > > --
    > > Eric Atwell, Senior Lecturer, Computer Vision and Language research group
    > > Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
    > > School of Computing, University of Leeds, LEEDS LS2 9JT
    > > TEL: 0113-3435761 MOBILE: 0775-1039104 FAX: 0113-3435468
    > > WWW: http://www.comp.leeds.ac.uk/eric EMAIL: eric@comp.leeds.ac.uk
    > > Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
    > >
    > >
    > >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Thu Nov 13 2003 - 13:53:10 MET