Re: [Corpora-List] CNN Transcripts

From: David Graff (graff@ldc.upenn.edu)
Date: Wed Nov 16 2005 - 19:33:55 MET

  • Next message: Stephanie M. Strassel: "Re: [Corpora-List] CNN Transcripts"

    To clarify about the LDC's releases of CNN transcripts, there are actually
    several corpora currently available, all of which have distinct,
    non-overlapping content:

    LDC97T22 1996 English Broadcast News Transcripts (Hub-4)
    LDC98T28 1997 English Broadcast News Transcripts (Hub-4)
    LDC98T31 1996 CSR Hub-4 Language Model
    LDC2001T57 TDT2 Multilanguage Text Version 4.0
    LDC2001T58 TDT3 Multilanguage Text Version 2.0
    LDC2005T16 TDT4 Multilingual Text and Annotations

    The two "Broadcast News Transcripts (Hub-4)" corpora were transcribed
    manually from various CNN programs recorded in 1996 and 1997; these corpora
    also include manual transcripts from other network news broadcasts (ABC,
    CSPAN, PBS, etc), for a total overall content of about 200 hours of audio.

    The "Hub-4 Language Model" comprises a large archive of older transcripts
    (obtained from a commercial archive, "Primary Source Media"), spanning
    Jan. 1992 - April 1996; again, CNN programs are included along with
    transcripts from numerous other broadcast news sources.

    The TDT corpora have data drawn from CNN Headline News (not any other CNN
    programming), in the form of closed-caption texts captured from the
    broadcasts; other network sources are included, covering thousands of
    hours of audio. The TDT corpora also include newswire text data.

    Regarding the two corpora cited by Mark Davies:

     - LDC98T25 was actually a "pilot" corpus for the first phase of the TDT
    project (Topic Detection and Tracking), which contains a subset of CNN data
    from the "Hub-4 Language Model" collection.

     - LDC2003T11 is a corpus annotated specifically for the "ACE" project
    (Automatic Content Extraction), which contains a subset of the TDT2 corpus.

    -----------
    David Graff Linguistic Data Consortium
    graff@ldc.upenn.edu 3600 Market St., Suite 810
    University of Pennsylvania Philadelphia, PA 19104
                    http://www.ldc.upenn.edu

    Mark_Davies@byu.edu said:
    > I'm also aware of some LDC Corpora that contain CNN transcripts, but in
    > general these appear to be either from the newspaper or from scripted
    > news broadcasts, e.g.:
    >
    > http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
    > http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
    >
    > At any rate, even though the genre/register of these transcripts is
    > fairly homogenous, they do contain more than 170 million words of
    > unscripted spoken English, so it seems like it might be a nice resource.
    >
    > Thanks in advance for any information that you might have.



    This archive was generated by hypermail 2b29 : Wed Nov 16 2005 - 20:10:21 MET