[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Jul 07 2005 - 22:18:23 MET DST

  • Next message: Ute Römer: "RE: [Corpora-List] Lexical bundles - and meaningful items..."

    LDC2005T20
    Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>

    LDC2005T10
    Chinese English News Magazine Parallel Text
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10>

    LDC2005S14
    Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14>

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of three new corpora.

    ------------------------------------------------------------------------

    Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20>
    supports the development of data-driven approaches to natural language
    processing (NLP), human language technologies, automatic content
    extraction (topic extraction and/or grammar extraction), cross-lingual
    information retrieval, information detection, and other forms of
    linguistic research on Modern Standard Arabic in general. The LDC was
    sponsored to develop an Arabic POS and Treebank of 1,000,000 words, and
    this corpus is part three of that project. In this release, both
    syntactic (treebank) annotation and annotation on part of speech (POS),
    gloss, and word segmentation are provided.

    The current Arabic Treebank: Part 3 corpus consists of 600 stories from
    the An Nahar News Agency. The new features include complete vocalization
    of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive.

    *

    Chinese English News Magazine Parallel Text
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T10>
    contains Chinese news stories and their English translations drawn from
    Sinorama Magazine, Taiwan, from 1976 to 2004. The corpus totals 6,366
    story pairs, 365,568 sentence pairs, 20M Chinese characters and 9M
    English words. It is aligned at sentence level; the data obtained from
    Sinorama Magazine was aligned at the story level. The sentence alignment
    was done at the LDC using champollion v1.1. The Sinorama Chinese text is
    encoded in Big5.

    *

    Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S14>
    contains 901 calls, totaling 133.6 hours of telephone conversation
    speech in Levantine Arabic. The majority of speakers in this corpus are
    Lebanese. The corpus also includes 901 transcript files is UTF-8 format.
    Speaker information files are provided.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    1275.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Thu Jul 07 2005 - 22:53:37 MET DST