[Corpora-List] New Corpora from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Fri Apr 28 2006 - 22:56:51 MET DST

  • Next message: Nicklas Karlsson: "[Corpora-List] Phishing email corpus"

    LDC2006S16
    *CSLU Spoltech Brazilian Portuguese Version 1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S16>*

    LDC2006T09
    *Korean Treebank Annotations Version 2.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T09>*

    LDC2006S13
    *N4 NATO Native and Non-Native Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S13>*

    LDC2006T08
    *Timebank 1.2
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>*

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of four new publications.

    ------------------------------------------------------------------------

    *New LDC Publications

    *

    (1) The CSLU Spoltech Brazilian Portuguese
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S16>
    corpus contains microphone speech from a variety of regions in Brazil
    with phonetic and orthographic transcriptions. The utterances consist of
    both read speech (for phonetic coverage) and responses to questions (for
    spontaneous speech). The corpus contains 477 speakers and 8080 separate
    utterances. A total of 2540 utterances have been transcribed at the word
    level (without time alignments), and 5479 utterances have been
    transcribed at the phoneme level (with time alignments).

    The data have been recorded at 44.1 kHz (mono, 16 bit) and stored in
    RIFF format. The recording was conducted with a direct connection from
    the microphone to the sound card. The sound card was
    SoundBlaster-compatible. For the prompted sentences, the sentence was
    hidden from view when recording began, so that the speaker might utter
    the sentence more naturally. Verification of the recording quality was
    performed immediately after each utterance recording; the
    data-collection software allowed the speaker to re-record utterances in
    case the recording was not of sufficient quality. The acoustic
    environment was not controlled, in order to allow for background
    conditions that would occur in application environments.

    *
    *
    *(2) The Korean Treebank Annotations Version 2.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T09>
    is an extension of the Korean English Treebank Annotations corpus,
    LDC2002T26 (2002). It is essentially an electronic corpus of Korean
    texts annotated with morphological and syntactic information. The
    original texts for the Korean Treebank 2.0 were selected from The Korean
    Newswire corpus published by LDC, catalog number LDC2000T45, which is a
    collection of Korean Press Agency news articles from June 2, 1994 to
    March 20, 2000. Korean Treebank 2.0 is based on the March 2000 portion
    of the corpus and includes 647 articles. The annotated corpus can find
    many uses, including training of morphological analyzers, part-of-speech
    taggers and syntactic parsers.

    *

    (3) The N4 NATO Native and Non-Native Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S13>
    corpus was developed by the NATO research group on Speech and Language
    Technology in order to provide a military oriented database for
    multilingual and non-native speech processing studies. The NATO Speech
    and Language Technology group decided to create a corpus geared towards
    the study of non-native accents. The group chose naval communications as
    the common task because it naturally includes a great deal of non-native
    speech and because there were training facilities where data could be
    collected in several countries.

    Speech data was recorded in the Naval transmission training centers of
    four countries (Germany, The Netherlands, United Kingdom, and Canada).
    The material consists of native and non-native speakers speakers using
    NATO English procedure between ships and reading from a text, "The North
    Wind and the Sun" in both English and the speaker's native language.

    *

    (4) The TimeBank 1.2
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08>
    corpus contains 183 news articles that have been annotated with temporal
    information, adding events, times and temporal links between events and
    times. The annotation follows the TimeML 1.2.1 specification. The most
    recent information on TimeML is always available at www.timeml.org
    <http://www.timeml.org>.

    TimeML aims to capture and represent temporal information. This is
    accomplished using four primary tag types: TIMEX3 for temporal
    expressions, EVENT for temporal events, SIGNAL for temporal signals, and
    LINK for representing relationships. Timebank 1.2 is distributed via
    web download.

    Nonmembers may also license this data at *no cost* - please note that a
    signed copy of our generic nonmember user agreement
    <http://www.ldc.upenn.edu/Catalog/nonmem_agree/generic.license.html> is
    required.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    University of Pennsylvania Fax: (215) 573-2175
    3600 Market St., Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 USA http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Fri Apr 28 2006 - 22:56:52 MET DST