[Corpora-List] News from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Jun 28 2006 - 22:49:11 MET DST

  • Next message: Linda Bawcom: "[Corpora-List] Intro to Language Textbook request: Summary and Thank Yous"

    LDC2006S35*
    CSLU: Multilanguage Telephone Speech Version 1.2
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35>
    *

    LDC2006S31
    *NIST 2003 Language Recognition Evaluation
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31>
    *

    LDC2006T12
    *Spanish Gigaword First Edition
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12>

    *

    The Linguistic Data Consortium (LDC) would like to announce the
    availability of three new publications.

    ------------------------------------------------------------------------

    (1) The CSLU: Multilanguage Telephone Speech Version 1.2
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35>
    corpus consists of telephone speech from eleven languages: English,
    Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish,
    Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances
    (eg. days of the week) as well as fluent continuous speech. The current
    release includes recorded utterances from about 2052 speakers, for a
    total of about 38.5 hours of speech. Time-aligned phonetic
    transcriptions for 619 of the utterances are also included. For the
    data collection, the sampling rate was 8khz and the files were stored in
    16bit linear format on a UNIX file system. Each utterance was recorded
    as a separate file.

    *

    (2) The goal of the NIST Language Recognition Evaluation (LRE) is to
    establish the baseline of current performance capability for language
    recognition of conversational telephone speech and to lay the groundwork
    for further research efforts in the field. The series had its first
    evaluation in 1996. The 2003 NIST Language Recognition Evaluation
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S31>
    (LRE-03) was part of this ongoing series of evaluations of language
    recognition technology. The task evaluated was the detection of a given
    target language. Given a test segment of speech, a target language was
    assigned as a test hypothesis, and the task was to determine whether
    this test hypothesis was true or false.

    Each speech file is one side of a "4 wire" telephone conversation
    represented as 8-bit, 8kHz mulaw data. There are 7990 speech files in
    sphere(.sph) format for a total of around six hours of speech. The
    speech data was compiled from the LDC's CALLFRIEND, CALLHOME, and
    SWITCHBOARD-2 corpora.

    *

    (3) The Spanish Gigaword First Edition
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T12>
    is a comprehensive archive of newswire text data that has been acquired
    over several years by the Linguistic Data Consortium; some of the data
    included has been released previously in other LDC corpora.

    The three distinct international sources of Spanish newswire in this
    edition, and the time spans of collection covered for each, are as follows:

        * Agence France-Presse, Spanish Service, May 1994 - Dec 2005
        * Associated Press Worldstream, Spanish, Nov 1993 - Dec 2005
        * Xinhua News Agency, Spanish Service, Sep 2001 - Dec 2005

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    University of Pennsylvania Fax: (215) 573-2175
    3600 Market St., Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 USA http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed Jun 28 2006 - 22:48:14 MET DST