Several new LDC corpora currently under development include transcripts
harvested from CNN and other broadcaster websites (per the data
licensing agreements we have negotiated with the copyright holders).
Previous LDC corpora containing CNN material uses transcripts derived
from closed-captioning, or in some cases manually-created transcripts.
The CNN transcript archive is particularly nice because in most cases
they are verbatim transcripts including speaker attribution, not scripts
or summaries. Most data providers include scripts rather than full
transcripts, if they feature "transcripts" on their site.
Stephanie
Mark Davies wrote:
> Has anyone here done much with the CNN transcripts:
> http://transcripts.cnn.com/TRANSCRIPTS/ ?
>
> I'm aware of one publication (below), but would be interested in others
> as well:
>
> Hoffmann, Sebastian. "From Web-Page to Mega-Corpus: The CNN
> Transcripts." In: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
> (eds.) Corpus Linguistics and the Web. Amsterdam: Rodopi.
>
> I'm also aware of some LDC Corpora that contain CNN transcripts, but in
> general these appear to be either from the newspaper or from scripted
> news broadcasts, e.g.:
>
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T25
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11
>
> At any rate, even though the genre/register of these transcripts is
> fairly homogenous, they do contain more than 170 million words of
> unscripted spoken English, so it seems like it might be a nice resource.
>
> Thanks in advance for any information that you might have.
>
> Mark Davies
>
> =================================================
>
> Mark Davies
> Assoc. Prof., Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
>
> http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
>
> =================================================
-- Stephanie Strassel Associate Director, Annotation Research & Program Coordination Linguistic Data Consortium 3600 Market Street, Suite 810 Philadelphia, PA 19104-2653 USA phone: 215-898-9681, fax: 215-573-2175 strassel@ldc.upenn.edu http://www.ldc.upenn.edu
This archive was generated by hypermail 2b29 : Wed Nov 16 2005 - 20:37:40 MET