[Corpora-List] New from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Aug 31 2006 - 21:22:53 MET DST

*
The Linguistic Data Consortium (LDC) is please to announce the
availability of three new publications.

------------------------------------------------------------------------

(1) Korean Broadcast News Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42>
consists of 18 audio files recorded by LDC in January 2000 and February
2000 from Voice of America (VOA) satellite radio news broadcasts in
Korean. The recordings, captured from a dedicated satellite receiver,
are stored as 16-bit PCM, 16-kHz, single-channel, in NIST SPHERE format.
The duration of each recording is either 30 minutes or 60 minutes,
depending on the VOA broadcast schedule; the date (YYYYMMDD), start-time
and end-time (HHMM, Eastern Standard Time) for each recording are
indicated in the file names. The sample data are not compressed.

Transcripts for these recordings are available as a separate corpus from
the LDC: Korean Broadcast News Transcripts, LDC2006T14.

(2) Korean Broadcast News Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14>
consists of 18 text files containing transcripts prepared by the LDC for
Voice of America satellite radio news broadcasts in Korean. The
broadcasts were recorded by the LDC at transmission time during a two
week period between January 21, 2000 and February 7, 2000. Nine of the
broadcasts are 30 minutes long, and the other nine broadcasts are 60
minutes long. The file names indicate the date (YYYYMMDD)and the begin
and end times (HHMM EST) of the original transmission.

The character encoding is Unicode UTF-8, and the file contents are
structured using SGML. The markup strategy used here was defined by NIST
specifically for use in transcripts of broadcast news speech. The "docs"
directory provides a working DTD file, a complete description (in the
form of a PostScript file) of the document structure, tags and
attributes, and a simple text file listing the 18 data file names in the
corpus.

The transcripts have been manually time aligned at the phrasal level and
annotated to identify boundaries between news stories and speaker turns;
speaker names and gender are given where identifiable. These annotations
are all provided via the SGML tags and their attributes. A strong
effort has been made to identify all unique speakers across the
transcripts. However, there may be cases where an individual speaker has
not been recognized and has been given a unique, anonymous identification.

Audio files for these transcripts are available as a separate corpus
from the LDC: Korean Broadcast News Speech, LDC2006S42.

(3) West Point Korean Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36>
contains digital recordings of spoken Korean. Corpus design and data
collection were carried out by staff and faculty of the Department of
Foreign Languages (DFL) and Center for Technology Enhanced Language
Learning (CTELL), located at the United States Military Academy (USMA),
West Point, New York. The corpus was designed to develop speech
recognition systems that would be used by the US government for
speech-recognition enhanced language learning courseware .

The prompt scripts were created from 20,000 distinct sentences, along
with a subset of prompts designed to elicit free response answers to
questions for use in domain-specific speech-to-speech translation
systems. Each speaker attempted to record 100 utterances.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
1275.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

Next message: Cyrus Shaoul: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"
Previous message: Ramesh Krishnamurthy: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"
Next in thread: Linguistic Data Consortium: "[Corpora-List] New from the LDC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Aug 31 2006 - 21:51:27 MET DST