[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Tue May 03 2005 - 17:40:56 MET DST

  • Next message: Jesus Angel Gimenez Linares: "Re: [Corpora-List] Spanish Tagger"

    LDC2005S13
    *Fisher English Training Part 2 Speech*
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S13>

    LDC2005T19
    *Fisher English Training Part 2 Transcripts*
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T19>

    LDC2005L01
    *Mawukakan Lexicon*
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005L01>

    * *
    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of three new corpora.

    ------------------------------------------------------------------------

    Fisher English Training Part 2 Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S13>
    represents the second half of a collection of conversational telephone
    speech (CTS) that was collected at the LDC. It contains 5849 audio
    files, each one containing a full conversation of up to 10 minutes.
    Corresponding transcripts are available as Fisher English Training Text
    Data, Part 2.

    The individual audio files are presented in NIST SPHERE format, and
    contain two-channel mu-law sample data; "shorten" compression has been
    applied to all files.

    *

    Fisher English Training Part 2 Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T19>
    contains the corresponding transcripts for the Fisher English Training
    Part 2 Speech collection. About 12% of the conversations were
    transcribed at the LDC, and the rest were done by BBN and WordWave,
    using a significantly different approach to the task. A central goal in
    both sets was to
    maximize the speed and economy of the transcription process, and this in
    turn involved certain aspects of mark-up detail and quality control that
    may have been common in previous, smaller corpora.

    *

    Mawukakan Lexicon
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005L01>
    is the first publication of an on going project aiming to build an
    Electronic Dictionary of four Mandekan (Eastern Manding languages of the
    Mande Group of the Niger-Congo family). The lack of written tradition
    makes such a dictionary project extremely important. Our expectation is
    that once this initial goal reached, it will become easier to extend the
    dictionary to all the other varieties of Mandekan.

    The lexicon is trilingual, that is, the target language is Mawukakan,
    while English and French are used as glossing languages. Both the
    Toolbox and the XML versions of this dictionary use the Unicode (UTF-8)
    encoding.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Tue May 03 2005 - 17:52:00 MET DST