Re: [Corpora-List] CNN Transcripts

From: Stephanie M. Strassel (strassel@ldc.upenn.edu)
Date: Wed Nov 16 2005 - 20:03:39 MET

  • Next message: Delip Rao: "[Corpora-List] free tagged corpus"

    Several new LDC corpora currently under development include transcripts
    harvested from CNN and other broadcaster websites (per the data
    licensing agreements we have negotiated with the copyright holders).
    Previous LDC corpora containing CNN material uses transcripts derived
    from closed-captioning, or in some cases manually-created transcripts.

    The CNN transcript archive is particularly nice because in most cases
    they are verbatim transcripts including speaker attribution, not scripts
    or summaries. Most data providers include scripts rather than full
    transcripts, if they feature "transcripts" on their site.

    Stephanie

    Mark Davies wrote:
    > Has anyone here done much with the CNN transcripts:
    > http://transcripts.cnn.com/TRANSCRIPTS/ ?
    >
    > I'm aware of one publication (below), but would be interested in others
    > as well:
    >
    > Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
    > Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
    > (eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.
    >
    > I'm also aware of some LDC Corpora that contain CNN transcripts, but in
    > general these appear to be either from the newspaper or from scripted
    > news broadcasts, e.g.:
    >
    > http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
    > http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
    >
    > At any rate, even though the genre/register of these transcripts is
    > fairly homogenous, they do contain more than 170 million words of
    > unscripted spoken English, so it seems like it might be a nice resource.
    >
    > Thanks in advance for any information that you might have.
    >
    > Mark Davies
    >
    > =================================================
    >
    > Mark Davies
    > Assoc. Prof., Linguistics
    > Brigham Young University
    > (phone) 801-422-9168 / (fax) 801-422-0906
    >
    > http://davies-linguistics.byu.edu
    >
    > ** Corpus design and use // Linguistic databases **
    > ** Historical linguistics // Language variation **
    > ** English, Spanish, and Portuguese **
    >
    > =================================================

    -- 
    Stephanie Strassel
    Associate Director, Annotation Research & Program Coordination
    Linguistic Data Consortium
    3600 Market Street, Suite 810  Philadelphia, PA 19104-2653 USA
    phone: 215-898-9681, fax: 215-573-2175
    strassel@ldc.upenn.edu
    http://www.ldc.upenn.edu
    



    This archive was generated by hypermail 2b29 : Wed Nov 16 2005 - 20:37:40 MET