Re: [Corpora-List] CNN Transcripts

From: David Graff (graff@ldc.upenn.edu)
Date: Wed Nov 16 2005 - 19:33:55 MET

Next message: Stephanie M. Strassel: "Re: [Corpora-List] CNN Transcripts"

Previous message: Stefan Kaufmann: "[Corpora-List] Jobs: Syntax/Semantics: Asst Prof, Northwestern U"
In reply to: Mark Davies: "[Corpora-List] CNN Transcripts"
Next in thread: Stephanie M. Strassel: "Re: [Corpora-List] CNN Transcripts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

To clarify about the LDC's releases of CNN transcripts, there are actually
several corpora currently available, all of which have distinct,
non-overlapping content:

LDC97T22 1996 English Broadcast News Transcripts (Hub-4)
LDC98T28 1997 English Broadcast News Transcripts (Hub-4)
LDC98T31 1996 CSR Hub-4 Language Model
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2005T16 TDT4 Multilingual Text and Annotations

The two "Broadcast News Transcripts (Hub-4)" corpora were transcribed
manually from various CNN programs recorded in 1996 and 1997; these corpora
also include manual transcripts from other network news broadcasts (ABC,
CSPAN, PBS, etc), for a total overall content of about 200 hours of audio.

The "Hub-4 Language Model" comprises a large archive of older transcripts
(obtained from a commercial archive, "Primary Source Media"), spanning
Jan. 1992 - April 1996; again, CNN programs are included along with
transcripts from numerous other broadcast news sources.

The TDT corpora have data drawn from CNN Headline News (not any other CNN
programming), in the form of closed-caption texts captured from the
broadcasts; other network sources are included, covering thousands of
hours of audio. The TDT corpora also include newswire text data.

Regarding the two corpora cited by Mark Davies:

- LDC98T25 was actually a "pilot" corpus for the first phase of the TDT
project (Topic Detection and Tracking), which contains a subset of CNN data
from the "Hub-4 Language Model" collection.

- LDC2003T11 is a corpus annotated specifically for the "ACE" project
(Automatic Content Extraction), which contains a subset of the TDT2 corpus.

-----------
David Graff Linguistic Data Consortium
graff@ldc.upenn.edu 3600 Market St., Suite 810
University of Pennsylvania Philadelphia, PA 19104
http://www.ldc.upenn.edu

Mark_Davies@byu.edu said:
> I'm also aware of some LDC Corpora that contain CNN transcripts, but in
> general these appear to be either from the newspaper or from scripted
> news broadcasts, e.g.:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
>
> At any rate, even though the genre/register of these transcripts is
> fairly homogenous, they do contain more than 170 million words of
> unscripted spoken English, so it seems like it might be a nice resource.
>
> Thanks in advance for any information that you might have.

Next message: Stephanie M. Strassel: "Re: [Corpora-List] CNN Transcripts"
Previous message: Stefan Kaufmann: "[Corpora-List] Jobs: Syntax/Semantics: Asst Prof, Northwestern U"
In reply to: Mark Davies: "[Corpora-List] CNN Transcripts"
Next in thread: Stephanie M. Strassel: "Re: [Corpora-List] CNN Transcripts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Nov 16 2005 - 20:10:21 MET