[Corpora-List] CNN Transcripts

From: Mark Davies (Mark_Davies@byu.edu)
Date: Wed Nov 16 2005 - 18:31:10 MET

  • Next message: Andreea Irina Constantinescu: "[Corpora-List] Summary on "Computers and motivation""

    Has anyone here done much with the CNN transcripts:
    http://transcripts.cnn.com/TRANSCRIPTS/ ?

    I'm aware of one publication (below), but would be interested in others
    as well:

    Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
    Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
    (eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.

    I'm also aware of some LDC Corpora that contain CNN transcripts, but in
    general these appear to be either from the newspaper or from scripted
    news broadcasts, e.g.:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11

    At any rate, even though the genre/register of these transcripts is
    fairly homogenous, they do contain more than 170 million words of
    unscripted spoken English, so it seems like it might be a nice resource.

    Thanks in advance for any information that you might have.

    Mark Davies

    =================================================

    Mark Davies
    Assoc. Prof., Linguistics
    Brigham Young University
    (phone) 801-422-9168 / (fax) 801-422-0906

    http://davies-linguistics.byu.edu

    ** Corpus design and use // Linguistic databases **
    ** Historical linguistics // Language variation **
    ** English, Spanish, and Portuguese **

    =================================================



    This archive was generated by hypermail 2b29 : Wed Nov 16 2005 - 18:57:40 MET