[Corpora-List] Divisions of corpus files in the BNC World

From: Stefan Th. Gries (stgries_lists@arcor.de)
Date: Sat Jun 24 2006 - 21:02:59 MET DST

  • Next message: Yuri Tambovtsev: "[Corpora-List] I need e-mail addresses of Juola. Sofko and Brennan"

    Hi all

    I have a question concerning the files from the BNC World from the genre of "W_essay_school" (using David Lee's label). Obviously, the files contain essays from several different students, both adults and teens depending on the exact file. However, I have not been able to find a straightforward way to determine

    - the number of different students whose essays entered into the file (which I would like to have found in the header);
    - the exact locations where the essays of the different students that were lumped together in any one file begin/end.

    My heuristic so far has been to rely on <head> and </head> since these should usually indicate the heading of a different essay, but

    (i) that's just been my heuristic and I am wondering whether there's a more principled way;
    (ii) that does of course not guarantee that the new essay is by a diffferent student.

    I apologize if that's a stupid question to which I should know the answer myself but I have not been able to get my head around this. Any pointers either via the list or to me directly would be greatly appreciated ... Thanks a lot for any help you might be able to offer, and I'll post a summary of the responses.
    Best,
    STG

    --
    Stefan Th. Gries
    -----------------------------------------------
    University of California, Santa Barbara
    http://www.linguistics.ucsb.edu/faculty/stgries
    -----------------------------------------------
    

    Viel oder wenig? Schnell oder langsam? Unbegrenzt surfen + telefonieren ohne Zeit- und Volumenbegrenzung? DAS TOP ANGEBOT JETZT bei Arcor: günstig und schnell mit DSL - das All-Inclusive-Paket für clevere Doppel-Sparer, nur 44,85 € inkl. DSL- und ISDN-Grundgebühr! http://www.arcor.de/rd/emf-dsl-2



    This archive was generated by hypermail 2b29 : Sun Jun 25 2006 - 07:57:05 MET DST