[Corpora-List] New Releases from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Mon Mar 07 2005 - 17:12:13 MET

  • Next message: Constantin Orasan: "Re: [Corpora-List] query on clauses"

    The Linguistic Data Consortium (LDC) would like to announce the
    availability of three new corpora.

    ------------------------------------------------------------------------

    (1) ACE Time Normalization (TERN) 2004 English Training Data
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T07>
    contains the English training data prepared for the 2004 Time Expression
    Recognition and Normalization (TERN) Evaluation. The purpose of this
    corpus and the TERN evaluation is to advance the state of the art in the
    automatic recognition and normalization of natural language temporal
    expressions. In most language contexts such expressions are indexical.
    For example, with "Monday", "last week", or "three months starting
    October 1", one must know the narrative reference time in order to
    pinpoint the time interval being conveyed by the expression.

    In addition, for data exchange purposes, it is essential that the
    identified interval be rendered according to an established standard,
    i.e., normalized. Accurate identification and normalization of temporal
    expressions is in turn essential for the temporal reasoning being
    demanded by advanced NLP applications such as question answering,
    information extraction, and summarization.

    (2) Arabic Treebank: Part 1 v 3.0 (POS with full vocalization and
    syntactic analysis)
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T02>
    is a re-release of LDC corpus, Arabic Treebank: Part 1 v 2.0, with the
    addition of improved morphological/part-of-speech annotation including
    full vocalization and case endings. The corpus supports the development
    of data-driven approaches to natural language processing (NLP), human
    language technologies, automatic content extraction, cross-lingual
    information retrieval, information detection, and other forms of
    linguistic research on Modern Standard Arabic.

    The project targets the description of a written Modern Standard Arabic
    corpus from the Agence France Presse (AFP) newswire archives for
    July-November 2000. This corpus includes 734 stories representing 145K
    words.

    (3) Multiple Translation Arabic (MTA) Part 2
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T05>
    supports the development of automatic means for evaluating translation
    quality. The corpus contains 4 sets of human translations and 2 sets of
    commercial-off-the-shelf systems (COTS) outputs for a single set of
    Arabic source materials. Additionally, there is one output set from a
    TIDES 2003 MT Evaluation participant, which is representative for the
    state-of-the-art research systems.

    To see if automatic evaluation systems, such as BLEU, track human
    assessment, the LDC performed human assessment on the two COTS outputs
    and the TIDES research system. The corpus includes the assessment
    results for one of the two COTS systems, the assessment result for the
    TIDES research system, and the specifications used for conducting the
    assessments.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Mon Mar 07 2005 - 17:15:02 MET