[Corpora-List] New Data from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed May 24 2006 - 22:50:20 MET DST

  • Next message: Mohand-Said Hacid: "[Corpora-List] ODBASE 2006 CFP : Hard Deadline for Abstract Submission"

    LDC2006S26*
    CSLU: Speaker Recognition Version 1.1
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>
    *

    LDC2006T10
    *English-Arabic Treebank V1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T10>
    *

    LDC2006S33*
    Middle East Technical University Turkish Microphone Speech V 1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S33>
    *

    *
    *In this month's newsletter, the Linguistic Data Consortium (LDC) would
    like to announce the availability of three new publications.

    ------------------------------------------------------------------------

    *New Publications*

    (1) CSLU: Speaker Recognition Version 1.1
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S26>
    consists of telephone speech from 91 participants. Each participant has
    recorded speech in twelve sessions over a two-year period answering
    questions like "what is your eye color" or respond to prompts like
    "describe a typical day in your life." Most of the utterances in the
    corpus have corresponding non-time-aligned word level transcriptions.

    The goal of Speaker Recognition data collection was to collect speech
    from each participant over a two year period. Each participant called
    the data collection system twelve times over the two-year period and
    said the same utterances each time.

    *

    (2) English-Arabic Parallel Treebank V1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T10>
    consists of 52,238 words in 224 files of individual Agence France Presse
    (AFP) news stories (corresponding to approximately the first 50K words
    of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02).
    The English translation was provided by LDC, and was part-of-speech
    tagged and treebanked for this project.

    The guidelines followed for both part-of-speech and treebank annotation
    are essentially Penn Treebank II style, with two notable differences:

       1. POS: tokenization of hyphenated items ("New York-based" has been
          replaced by "New York - based" for example), and the addition of
          HYPH and AFX tags necessitated by this change in tokenization
       2. TreeBank: the addition of the node label NML for sub-NP nominal
          constituents (replacing NX and most NP-internal NAC)

    *

    (3) Middle East Technical University Turkish Microphone Speech V 1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S33>
    corpus has been collected at the Middle East Technical University (METU)
    as part of a collaborative work between the Department of Electrical and
    Electronics Engineering of the Middle East Technical University in
    Turkey and the Center for Spoken Language Research (CSLR) of the
    University of Colorado at Boulder, USA. The corpus was used to port the
    Speech Recognition System of CSLR, SONIC, to Turkish.

    The corpus contains text, speech, and alignment files. 120 speakers (60
    male and 60 female) spoke 40 sentences each for a total of approximately
    500 minutes of speech. The 40 sentences were selected randomly for each
    speaker from a triphone-balanced set of 2462 Turkish sentences. All
    participants were native speakers of Turkish.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed May 24 2006 - 22:50:00 MET DST