[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Tue May 03 2005 - 17:40:56 MET DST

Next message: Jesus Angel Gimenez Linares: "Re: [Corpora-List] Spanish Tagger"

Previous message: Montserrat Civit: "[Corpora-List] TLT05 1st Call for Papers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

LDC2005S13
*Fisher English Training Part 2 Speech*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S13>

LDC2005T19
*Fisher English Training Part 2 Transcripts*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T19>

LDC2005L01
*Mawukakan Lexicon*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005L01>

* *
The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new corpora.

------------------------------------------------------------------------

Fisher English Training Part 2 Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S13>
represents the second half of a collection of conversational telephone
speech (CTS) that was collected at the LDC. It contains 5849 audio
files, each one containing a full conversation of up to 10 minutes.
Corresponding transcripts are available as Fisher English Training Text
Data, Part 2.

The individual audio files are presented in NIST SPHERE format, and
contain two-channel mu-law sample data; "shorten" compression has been
applied to all files.

Fisher English Training Part 2 Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T19>
contains the corresponding transcripts for the Fisher English Training
Part 2 Speech collection. About 12% of the conversations were
transcribed at the LDC, and the rest were done by BBN and WordWave,
using a significantly different approach to the task. A central goal in
both sets was to
maximize the speed and economy of the transcription process, and this in
turn involved certain aspects of mark-up detail and quality control that
may have been common in previous, smaller corpora.

Mawukakan Lexicon
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005L01>
is the first publication of an on going project aiming to build an
Electronic Dictionary of four Mandekan (Eastern Manding languages of the
Mande Group of the Niger-Congo family). The lack of written tradition
makes such a dictionary project extremely important. Our expectation is
that once this initial goal reached, it will become easier to extend the
dictionary to all the other varieties of Mandekan.

The lexicon is trilingual, that is, the target language is Mawukakan,
while English and French are used as glossing languages. Both the
Toolbox and the XML versions of this dictionary use the Unicode (UTF-8)
encoding.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
2175.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu

Next message: Jesus Angel Gimenez Linares: "Re: [Corpora-List] Spanish Tagger"
Previous message: Montserrat Civit: "[Corpora-List] TLT05 1st Call for Papers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue May 03 2005 - 17:52:00 MET DST