Hi John,
You might wish to consider the following HUB4 and TDT resources
distributed by the LDC. These data sets contain substantial quantities
of recent broadcast news in several languages, segmented into individual
stories and time-aligned with verbatim transcripts.
LDC97S66 <catalogEntry.jsp?catalogId=LDC97S66> 1996 English Broadcast
News Dev and Eval (Hub-4)
LDC97S44 <catalogEntry.jsp?catalogId=LDC97S44> 1996 English Broadcast
News Speech (Hub-4)
LDC97T22 <catalogEntry.jsp?catalogId=LDC97T22> 1996 English Broadcast
News Transcripts (Hub-4)
LDC98S71 <catalogEntry.jsp?catalogId=LDC98S71> 1997 English Broadcast
News Speech (Hub-4)
LDC98T28 <catalogEntry.jsp?catalogId=LDC98T28> 1997 English Broadcast
News Transcripts (Hub-4)
LDC2002S11 <catalogEntry.jsp?catalogId=LDC2002S11> 1997 HUB4 English
Evaluation Speech and Transcripts
LDC98S73 <catalogEntry.jsp?catalogId=LDC98S73> 1997 Mandarin Broadcast
News Speech (Hub-4NE)
LDC98T24 <catalogEntry.jsp?catalogId=LDC98T24> 1997 Mandarin Broadcast
News Transcripts (Hub-4NE)
LDC98S74 <catalogEntry.jsp?catalogId=LDC98S74> 1997 Spanish Broadcast
News Speech (Hub-4NE)
LDC98T29 <catalogEntry.jsp?catalogId=LDC98T29> 1997 Spanish Broadcast
News Transcripts (Hub-4NE)
LDC2000S86 <catalogEntry.jsp?catalogId=LDC2000S86> 1998 HUB-4 Broadcast
News Evaluation English Test Material
LDC2000S92 <catalogEntry.jsp?catalogId=LDC2000S92> TDT2 Careful
Transcription Audio
LDC2000T44 <catalogEntry.jsp?catalogId=LDC2000T44> TDT2 Careful
Transcription Text
LDC99S84 <catalogEntry.jsp?catalogId=LDC99S84> TDT2 English Audio
LDC2001S93 <catalogEntry.jsp?catalogId=LDC2001S93> TDT2 Mandarin Audio
Corpus
LDC2001T57 <catalogEntry.jsp?catalogId=LDC2001T57> TDT2 Multilanguage
Text Version 4.0
LDC2001S94 <catalogEntry.jsp?catalogId=LDC2001S94> TDT3 English Audio
LDC2001S95 <catalogEntry.jsp?catalogId=LDC2001S95> TDT3 Mandarin Audio
LDC2001T58 <catalogEntry.jsp?catalogId=LDC2001T58> TDT3 Multilanguage
Text Version 2.0
LDC2005S11 <catalogEntry.jsp?catalogId=LDC2005S11> TDT4 Multilingual
Broadcast News Speech Corpus
LDC2005T16 <catalogEntry.jsp?catalogId=LDC2005T16> TDT4 Multilingual
Text and Annotations
You can view our entire online catalog at:
http://www.ldc.upenn.edu/Catalog/
Kind regards,
Ilya
Bryar Family wrote:
>Hello:
>
>I'm developing a project for rapid identification and categorization of
>audio news clips, with a "target communities" focus. Are there any public
>corpora available that consist of individual audio news stories of recent
>vintage? (last 5-10 years)
>
>I'd also be interested in corresponding with any members of the list who are
>developing content categorization strategies for such audio content. For
>example, if there are any members of the list who are involved with the
>NewsML project, I'd like to hear from them.
>
>John V "Jack" Bryar
>Managing Partner and acting CTO,
>MilkBottleNews Partners
>Direct: 802-843-6033
>jack@milkbottlenews.com
>
>
>
--Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: (215) 573-1275 University of Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc@ldc.upenn.edu Philadelphia, PA 19104 http://www.ldc.upenn.edu
This archive was generated by hypermail 2b29 : Wed Nov 09 2005 - 19:20:25 MET