[Corpora-List] News from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Mon Dec 12 2005 - 19:18:46 MET

  • Next message: Briony Williams: "Re: [Corpora-List] Query about corpora of spoken English"

    ** New LDC Online Membership! **

    LDC2005S26
    ** CSLU: 22 Languages Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S26> **

    LDC2005T34
    ** Chinese <-> English Name Entity Lists (v1.0)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34> **

    LDC2005S30
    ** The West Point Company G3 American English Speech Data Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S30> *
    *

    The Linguistic Data Consortium (LDC) would like to announce a new
    membership option, the LDC Online Membership, and provide information
    regarding our new publications.

    ------------------------------------------------------------------------

    *LDC Online Membership*

    The Linguistic Data Consortium is pleased to announce the LDC Online
    Membership, which is now available for the 2006 Membership year. LDC
    Online contains a continuously growing, indexed collection of Arabic,
    Chinese and English newswire text, millions of words of English
    telephone speech from the Switchboard and Fisher collections and the
    American English Spoken Lexicon, as well as the full text of the Brown
    corpus. With LDC Online, users can search textual data and play audio
    extracts for transcribed utterances on standard web browsers. LDC will
    continue to add new material to LDC Online, including Spanish, Arabic,
    and Chinese conversational telephone data in 2006.
     
    The LDC Online Membership is a reduced cost alternative providing
    interactive access to a growing subset of LDC data to users who do not
    have a need for linguistic data on media. Current LDC members already
    have access to all LDC Online resources. The LDC Online Membership is
    available to Non-Profit and U.S. government organizations for $1,000
    (USD) per calendar year (January to December). The obligations and data
    usage restrictions of the LDC Online Membership are contained in the LDC
    Online Membership Agreement
    <http://www.ldc.upenn.edu/Membership/Agreements/LDCOnline.Agrmnt.new.htm>.

    We invite you to try LDC Online if you have not already done so. Please
    go to http://online.ldc.upenn.edu for a free, limited demonstration and
    to sign up for a non-member LDC Online account. To become an LDC Online
    member or to request additional information, contact the LDC Membership
    Department at ldc@ldc.upenn.edu.

    We hope that the LDC Online Membership will enhance your linguistic
    research and your association with the LDC.

    *
    *
    *New Publications
    *

    (1) The CSLU: 22 Language Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S26>
    was produced by the Center for Spoken Language Understanding at Oregon
    Health & Science University. The corpus consists of telephone speech
    from the following languages: Arabic, Cantonese, Czech, Farsi, German,
    Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish,
    Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and
    English. The corpus contains fixed vocabulary utterances (e.g. days of
    the week) as well as fluent continuous speech. Each of the 50191
    utterances is verified by a native speaker to determine if the caller
    followed instructions when answering the prompts. For this release,
    approximately 19758 utterances have corresponding orthographic
    transcriptions.

    *

    (2) Chinese <-> English Name Entity Lists (v1.0)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T34>
    are compiled from Xinhua News Agency articles. This release consists of
    9 pairs of bi-directional lists in the following categories: Person
    Names, Place Names, Organization Names, Industry Names, Press Names,
    Other Names, and Who is Who Names. The English->Chinese version of each
    pair was created by reversing the Chinese->English, both sorted by the
    Unix built-in sort function.

    *

    (3) The West Point Company G3 American English Speech Data Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S30>
    was produced by Center for Technology Enhanced Language Learning, part
    of the U.S. Military Academy's Department of Foreign Languages. During
    the 2000-2001 academic year, cadets, staff and faculty members at the
    United States Military Academy volunteered to participate in a speech
    data collection project for American English. The goal of the project
    was to amass recordings from no less than one hundred adult speakers,
    fifty males and fifty females, to form a substantial corpus of
    high-quality read speech.

    The 185 sentences comprising the data collection script were written to
    elicit examples of all or most all of the possible syllables used in
    spoken American English. The G3 Corpus audio data comes from 53 female
    and 56 male volunteers, each of whom recorded approximately 104
    utterances. The recordings are sampled at a 16 bit resolution, 22,050
    samples per second. Recordings were made using headset microphones
    (Shure M10) with preamplifiers attached to the line input jack of
    desktop computers. The total amount of speech is about 15 hours.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu

    du



    This archive was generated by hypermail 2b29 : Mon Dec 12 2005 - 20:00:50 MET