[Corpora-List] New from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Aug 31 2006 - 21:22:53 MET DST

  • Next message: Cyrus Shaoul: "Re: [Corpora-List] Word frequencies for a large corpus of recent USENET text"

    LDC2006S42
    *Korean Broadcast News Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42>*

    LDC2006T14
    *Korean Broadcast News Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14>*

    LDC2006S36
    *West Point Korean Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36>

    *
    The Linguistic Data Consortium (LDC) is please to announce the
    availability of three new publications.

    ------------------------------------------------------------------------

    (1) Korean Broadcast News Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S42>
    consists of 18 audio files recorded by LDC in January 2000 and February
    2000 from Voice of America (VOA) satellite radio news broadcasts in
    Korean. The recordings, captured from a dedicated satellite receiver,
    are stored as 16-bit PCM, 16-kHz, single-channel, in NIST SPHERE format.
    The duration of each recording is either 30 minutes or 60 minutes,
    depending on the VOA broadcast schedule; the date (YYYYMMDD), start-time
    and end-time (HHMM, Eastern Standard Time) for each recording are
    indicated in the file names. The sample data are not compressed.

    Transcripts for these recordings are available as a separate corpus from
    the LDC: Korean Broadcast News Transcripts, LDC2006T14.

    *

    (2) Korean Broadcast News Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T14>
    consists of 18 text files containing transcripts prepared by the LDC for
    Voice of America satellite radio news broadcasts in Korean. The
    broadcasts were recorded by the LDC at transmission time during a two
    week period between January 21, 2000 and February 7, 2000. Nine of the
    broadcasts are 30 minutes long, and the other nine broadcasts are 60
    minutes long. The file names indicate the date (YYYYMMDD)and the begin
    and end times (HHMM EST) of the original transmission.

    The character encoding is Unicode UTF-8, and the file contents are
    structured using SGML. The markup strategy used here was defined by NIST
    specifically for use in transcripts of broadcast news speech. The "docs"
    directory provides a working DTD file, a complete description (in the
    form of a PostScript file) of the document structure, tags and
    attributes, and a simple text file listing the 18 data file names in the
    corpus.

    The transcripts have been manually time aligned at the phrasal level and
    annotated to identify boundaries between news stories and speaker turns;
    speaker names and gender are given where identifiable. These annotations
    are all provided via the SGML tags and their attributes. A strong
    effort has been made to identify all unique speakers across the
    transcripts. However, there may be cases where an individual speaker has
    not been recognized and has been given a unique, anonymous identification.

    Audio files for these transcripts are available as a separate corpus
    from the LDC: Korean Broadcast News Speech, LDC2006S42.

    *

    (3) West Point Korean Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S36>
    contains digital recordings of spoken Korean. Corpus design and data
    collection were carried out by staff and faculty of the Department of
    Foreign Languages (DFL) and Center for Technology Enhanced Language
    Learning (CTELL), located at the United States Military Academy (USMA),
    West Point, New York. The corpus was designed to develop speech
    recognition systems that would be used by the US government for
    speech-recognition enhanced language learning courseware .

    The prompt scripts were created from 20,000 distinct sentences, along
    with a subset of prompts designed to elicit free response answers to
    questions for use in domain-specific speech-to-speech translation
    systems. Each speaker attempted to record 100 utterances.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    University of Pennsylvania Fax: (215) 573-2175
    3600 Market St., Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 USA http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Aug 31 2006 - 21:51:27 MET DST