[Corpora-List] New Corpora from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Fri Apr 28 2006 - 22:56:51 MET DST

Next message: Nicklas Karlsson: "[Corpora-List] Phishing email corpus"

Previous message: Roxana Angheluta: "[Corpora-List] vacancy in information retrieval - sentiment analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

LDC2006S16
*CSLU Spoltech Brazilian Portuguese Version 1.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S16>*

LDC2006T09
*Korean Treebank Annotations Version 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T09>*

LDC2006S13
*N4 NATO Native and Non-Native Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S13>*

LDC2006T08
*Timebank 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>*

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of four new publications.

------------------------------------------------------------------------

*New LDC Publications

(1) The CSLU Spoltech Brazilian Portuguese
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S16>
corpus contains microphone speech from a variety of regions in Brazil
with phonetic and orthographic transcriptions. The utterances consist of
both read speech (for phonetic coverage) and responses to questions (for
spontaneous speech). The corpus contains 477 speakers and 8080 separate
utterances. A total of 2540 utterances have been transcribed at the word
level (without time alignments), and 5479 utterances have been
transcribed at the phoneme level (with time alignments).

The data have been recorded at 44.1 kHz (mono, 16 bit) and stored in
RIFF format. The recording was conducted with a direct connection from
the microphone to the sound card. The sound card was
SoundBlaster-compatible. For the prompted sentences, the sentence was
hidden from view when recording began, so that the speaker might utter
the sentence more naturally. Verification of the recording quality was
performed immediately after each utterance recording; the
data-collection software allowed the speaker to re-record utterances in
case the recording was not of sufficient quality. The acoustic
environment was not controlled, in order to allow for background
conditions that would occur in application environments.

*
*
*(2) The Korean Treebank Annotations Version 2.0
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T09>
is an extension of the Korean English Treebank Annotations corpus,
LDC2002T26 (2002). It is essentially an electronic corpus of Korean
texts annotated with morphological and syntactic information. The
original texts for the Korean Treebank 2.0 were selected from The Korean
Newswire corpus published by LDC, catalog number LDC2000T45, which is a
collection of Korean Press Agency news articles from June 2, 1994 to
March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion
of the corpus and includes 647 articles. The annotated corpus can find
many uses, including training of morphological analyzers, part-of-speech
taggers and syntactic parsers.

(3) The N4 NATO Native and Non-Native Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S13>
corpus was developed by the NATO research group on Speech and Language
Technology in order to provide a military oriented database for
multilingual and non-native speech processing studies. The NATO Speech
and Language Technology group decided to create a corpus geared towards
the study of non-native accents. The group chose naval communications as
the common task because it naturally includes a great deal of non-native
speech and because there were training facilities where data could be
collected in several countries.

Speech data was recorded in the Naval transmission training centers of
four countries (Germany, The Netherlands, United Kingdom, and Canada).
The material consists of native and non-native speakers speakers using
NATO English procedure between ships and reading from a text, "The North
Wind and the Sun" in both English and the speaker's native language.

(4) The TimeBank 1.2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>
corpus contains 183 news articles that have been annotated with temporal
information, adding events, times and temporal links between events and
times. The annotation follows the TimeML 1.2.1 specification. The most
recent information on TimeML is always available at www.timeml.org
<http://www.timeml.org>.

TimeML aims to capture and represent temporal information. This is
accomplished using four primary tag types: TIMEX3 for temporal
expressions, EVENT for temporal events, SIGNAL for temporal signals, and
LINK for representing relationships. Timebank 1.2 is distributed via
web download.

Nonmembers may also license this data at *no cost* - please note that a
signed copy of our generic nonmember user agreement
<http://www.ldc.upenn.edu/Catalog/nonmem_agree/generic.license.html> is
required.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
1275.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

Next message: Nicklas Karlsson: "[Corpora-List] Phishing email corpus"
Previous message: Roxana Angheluta: "[Corpora-List] vacancy in information retrieval - sentiment analysis"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Apr 28 2006 - 22:56:52 MET DST