Re: [Corpora-List] CNN Transcripts

From: Stephanie M. Strassel (strassel@ldc.upenn.edu)
Date: Wed Nov 16 2005 - 20:03:39 MET

Next message: Delip Rao: "[Corpora-List] free tagged corpus"

Previous message: David Graff: "Re: [Corpora-List] CNN Transcripts"
In reply to: Mark Davies: "[Corpora-List] CNN Transcripts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Several new LDC corpora currently under development include transcripts
harvested from CNN and other broadcaster websites (per the data
licensing agreements we have negotiated with the copyright holders).
Previous LDC corpora containing CNN material uses transcripts derived
from closed-captioning, or in some cases manually-created transcripts.

The CNN transcript archive is particularly nice because in most cases
they are verbatim transcripts including speaker attribution, not scripts
or summaries. Most data providers include scripts rather than full
transcripts, if they feature "transcripts" on their site.

Stephanie

Mark Davies wrote:
> Has anyone here done much with the CNN transcripts:
> http://transcripts.cnn.com/TRANSCRIPTS/ ?
>
> I'm aware of one publication (below), but would be interested in others
> as well:
>
> Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
> Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
> (eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.
>
> I'm also aware of some LDC Corpora that contain CNN transcripts, but in
> general these appear to be either from the newspaper or from scripted
> news broadcasts, e.g.:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
>
> At any rate, even though the genre/register of these transcripts is
> fairly homogenous, they do contain more than 170 million words of
> unscripted spoken English, so it seems like it might be a nice resource.
>
> Thanks in advance for any information that you might have.
>
> Mark Davies
>
> =================================================
>
> Mark Davies
> Assoc. Prof., Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
>
> http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
>
> =================================================

-- 
Stephanie Strassel
Associate Director, Annotation Research & Program Coordination
Linguistic Data Consortium
3600 Market Street, Suite 810  Philadelphia, PA 19104-2653 USA
phone: 215-898-9681, fax: 215-573-2175
strassel@ldc.upenn.edu
http://www.ldc.upenn.edu

Next message: Delip Rao: "[Corpora-List] free tagged corpus"
Previous message: David Graff: "Re: [Corpora-List] CNN Transcripts"
In reply to: Mark Davies: "[Corpora-List] CNN Transcripts"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Nov 16 2005 - 20:37:40 MET