[Corpora-List] LDC News

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Oct 05 2005 - 22:21:30 MET DST

  • Next message: Mona Diab: "[Corpora-List] Postdoctoral Position at CCLS, Columbia University"

    * Free Talkbank Corpora Still Available!*

    LDC2005T33
    *BBN Pronoun Coreference and Entity Type Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33>*

    LDC2005T23
    *Chinese Proposition Bank 1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>*

    LDC2005S25
    *Santa Barbara Corpus of Spoken American English Part-IV
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>*

    The Linguistic Data Consortium would like to announce the availability
    of free Talkbank data and of three new corpora.

    ------------------------------------------------------------------------

    TalkBank <http://www.talkbank.org/> is an indisciplinary research
    project funded by a five year NSF grant to foster research and
    development in communicative behavior by providing tools and standards
    for analysis and distribution of language data. The LDC distributes the
    following Talkbank corpora:

      LDC2003V01 FORM2 Kinematic Gesture
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003V01>
    - gesture annotation scheme designed to capture the kinematic
    information in gesture from videos of speakers
     
      LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003L01>
    - spoken lexicon with 5000+ sound files
     
      LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S02>
    - tone paradigms along with phonetic and tonological transcriptions
     
      LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001S16>
    - tone paradigms along with phonetic and tonological transcriptions
     
      LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L01>
    - for morphological analysis and generation applications
     
      LDC2004T03 Morphologically Annotated Korean Text
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T03>
    - annotated morphological analysis and part-of-speech tags
     
      LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T15>
    - 8 interviews conducted by William Labov, plus transcripts, variable
    survey and annotation tools
     
      LDC2003S06 Santa Barbara Corpus of Spoken American English Part-II
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003S06>
    - recordings of natural speech from all over U.S.

      LDC2004S10 Santa Barbara Corpus of Spoken American English III
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S10>
    - recordings of natural speech from all over U.S.

      LDC2005S25 Santa Barbara Corpus of Spoken American English Part-IV
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>
    - over 5 hours of recordings of natural speech from all over U.S.
     
      LDC2004S12 Talkbank Ethology Data: Field Recordings of Vervet Monkey
    Calls
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004S12>
    - 60 recordings with corresponding annotations

    Grant-sponsored copies for all of the above corpora are still
    available. Shipping and handling charges apply. Please contact the LDC
    to learn if your organizaiton is eligle to receive a free copy.

    *

    BBN Pronoun Coreference and Entity Type Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T33>
    supplements the 1 million word Penn Treebank corpus of Wall Street
    Journal texts (LDC95T7). The corpus contains stand-off annotation of
    pronoun coreference, indicated by sentence and token numbers, as well as
    annotation of a variety of entity and numeric types. All annotation was
    done by hand at BBN using proprietary annotation tools. This corpus was
    developed by BBN to support the ACE and AQUAINT programs

    The corpus contains two components:

        *

          Pronoun coreference. Stand-off annotation of pronoun coreference
          of the WSJ corpus is provided in a single file. Pronouns and
          antecedents are indexed by sentence and token numbers.

        *

          Entity types. The corpus includes annotation of 12 named entity
          types (Person, Facility, Organization, GPE, Location, Nationality,
          Product, Event, Work of Art, Law, Language, and Contact-Info),
          nine nominal entity types (Person, Facility, Organization, GPE,
          Product, Plant, Animal, Substance, Disease and Game), and seven
          numeric types (Date, Time, Percent, Money, Quantity, Ordinal and
          Cardinal). Several of these types are further divided into
          subtypes. Annotation for a total of 64 subtypes is provided.

    *

    Chinese Proposition Bank 1.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T23>
    is the first public release of the Penn Chinese Proposition Bank
    project, which aims to create a corpus of text annotated with
    information about basic semantic propositions. Specifically,
    predicate-argument relations have been added to the syntactic trees of
    Chinese Treebank 5.1 as an additional layer of annotation.

    Chinese Proposition Bank 1.0 includes annotations of the first 250K
    words of the Chinese TreeBank 5.1. There are a total of 37,183
    propositions. Auxiliary verbs are not annotated. Some verbs have light
    verb and non-light verbs uses and in these cases only the non-light
    verbs are annotated. All the annotations in this release are the result
    of double blind annotation followed by adjudication of differences.

    *

    Santa Barbara Corpus of Spoken American English Part-IV
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S25>
    is based on hundreds of recordings of natural speech from all over the
    United States, representing a wide variety of people of different
    regional origins, ages, occupations, and ethnic and social backgrounds.
    It reflects many ways that people use language in their lives:
    conversation, gossip, arguments, on-the-job talk, card games, city
    council meetings, sales pitches, classroom lectures, political speeches,
    bedtime stories, sermons, weddings, and more. The corpus was collected
    by theUniversity of California, Santa Barbara Center for the Study of
    Discourse.

    The audio data consists of 14 wave format speech files, recorded in
    two-channel pcm, at 22050Hz. The speech files total 5.75 hours of audio
    (1.5 GB), representing over 58000 words and over 6000 unique words in
    the transcribed text.

    The cost of the first 100 copies of this publication (not counting the
    copies distributed to LDC members) is covered by NSF Grant Number
    BCS-998009, and therefore free of charge to qualified researchers; a $30
    shipping and handling fee applies. After these first 100 copies are
    distributed, additional copies will be available for the production cost
    of $200 per DVD-ROM.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed Oct 05 2005 - 22:54:17 MET DST