[Corpora-List] News from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Apr 05 2006 - 22:29:30 MET DST

  • Next message: Fiammetta NAMER: "Re: [Corpora-List] Looking for a French morphological analyzer"

    *Agreement between AsiaNet and LDC

    * LDC2006S15
    *CSLU: Spelled and Spoken Words*
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S15>

    LDC2006T03
    *Korean Propbank*
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T03>

    LDC2006S30
    *Speech Controlled Computing*
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S30>

    The Linguistic Data Consortium (LDC) would like to highlight recent
    developments and announce the availability of three new publications.

    ------------------------------------------------------------------------

    *Agreement between AsiaNet and LDC*

    LDC has recently entered into a data license agreement with AsiaNet, a
    consortium of Asia Pacific news agencies headquartered in Australia.
    AsiaNet translates and distributes full text (unedited) press releases
    to all forms of media worldwide through its Asia Pacific agencies and
    affiliates in the US, Canada and Europe. AsiaNet also has the capacity
    to deliver images, audio and video releases.

    The LDC/AsiaNet agreement gives LDC access to AsiaNet's multilingual
    texts. LDC is already utilizing AsiaNet's Urdu and Thai texts in the
    Less Commonly Taught Languages (LCTL) project.

    LDC and AsiaNet look forward to a long and fruitful association -
    mutually supporting language-related education, research and technology
    development. As it strengthens its ties with the LDC and becomes more
    widely known, AsiaNet hopes to attract interest in its services through
    its news agency contacts at http://www.asianetnews.net/home.asp

    *New Publications from the LDC

    *

    The CSLU: Spelled and Spoken Words
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S15>
    corpus consists of spelled and spoken words. 3647 callers were prompted
    to to say and spell their first and last names, to say what city they
    grew up in and what city they were calling from, and to answer two
    yes/no questions. In order to collect sufficient instances of each
    letter, 1371 callers also recited the English alphabet with pauses
    between the letters. Each call was transcribed by two people, and all
    differences were resolved. In addition, a subset of 2648 calls has been
    phonetically labeled.

          *

    Korean Propbank
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T03>
    is a semantic annotation of the Korean English Treebank Annotations and
    Korean Treebank version 2.0. Each verb and adjective occurring in the
    Treebank has been treated as a semantic predicate and the surrounding
    text has been annotated for arguments and adjuncts of the predicate. The
    verbs and adjectives have also been tagged with coarse grained senses.

    There are two basic components to Korean Propbank:

        * The Verb Lexicon. A frames file, consisting of one or more frame
          sets, has been created for each predicate occurring in the
          Treebank. These files serve as a reference for the annotators and
          for users of the data. 2,749 such files have been created.
        * The Annotation. There are two annotation files. The
          virginia-verbs.pb file has 9,588 annotated predicate tokens. These
          predicate tokens include all those occurring in over 54 thousand
          words of the Korean English Treebank Annotations, totaling ~791 KB
          of uncompressed data. The newswire-verbs.pb file has 23,707
          annotated predicate tokens. These predicate tokens include all
          those occurring in over 131 thousand words of the Korean Treebank
          version 2.0.

    *

    The Speech Controlled Computing
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S30>
    corpus was designed to support the development of small footprint,
    embedded ASR applications in the domain of voice control for the home.
    It consists of the recordings of 125 speakers of American English from
    four regions, three age groups and two gender groups, pronouncing
    isolated words. The recordings were conducted in a sound-attenuated
    room, and a high-quality microphone was used. Each speaker read a
    randomized word list consisting of 2100 words (100 distinct words
    appearing 21 times each).

    **NOTE: Nonmembers may obtain a commercial rights license to Speech
    Controlled Computing for US$7000 by signing the LDC User License
    Agreement for Speech Controlled Computing
    <http://www.ldc.upenn.edu/Catalog/mem_agree/SCC_User_Agreement.htm>.
    For-Profit Membership to the LDC is not required.**

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    University of Pennsylvania Fax: (215) 573-2175
    3600 Market St., Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 USA http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed Apr 05 2006 - 22:59:47 MET DST