[Corpora-List] New LDC Corpora

From: ldc@ldc.upenn.edu
Date: Tue Sep 16 2003 - 17:31:49 MET DST

  • Next message: Joel Tetreault: "Re: [Corpora-List] New LDC Corpora"

                               LDC2003T11
                        * ACE-2 Version 1.0 *

                               LDC2003T13
                * Message Understanding Conference (MUC) 6 *

    The Linguistic Data Consortium (LDC) is pleased to announce the
    availability of two new corpora.

                                   *

    ACE-2 Version 1.0 supports the Automatic Content Extraction (ACE)
    program whose objective is to develop extraction technology to support
    automatic processing of source language data. This includes
    classification, filtering, and selection based on the language content
    of the source data, i.e., based on the meaning conveyed by the data.
    Thus, the ACE program requires the development of technologies that
    automatically detect and characterize this meaning. The ACE research
    objectives are viewed as the detection and characterization of Entities,
    Relations, and Events.

    Annotations for the ACE-2 corpus concern two research tasks: Entity
    Detection and Tracking (EDT) and Relation Detection and Characterization
    (RDC). ACE-2 contains two sets of data: training and devtest. Each of
    these sets is further divided by source: broadcast news, newspaper, and
    newswire. There are 179,007 words of source data in 519 files.

    For further information about this corpus, including a link to online
    documentation and the NIST ACE program site, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T11

    Institutions that have membership in the LDC during the 2003
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may license this publication for $500.

                                  *

    In the 1990s, the MUC evaluations funded the development of metrics and
    statistical algorithms to support government evaluations of emerging
    information extraction technologies. The Message Understanding
    Conference (MUC) 6 corpus contains 318 annotated Wall Street Journal
    articles, scoring software, and corresponding documentation used in the
    MUC 6 evaluation. Both the MUC 6 Additional News Text (LDC96T10) corpus
    and the MUC 6 corpus are necessary in order to replicate the evaluation.

    All the materials have been published as received from the corpus
    authors. No quality control has been conducted at the LDC; however, the
    text files have been uncompressed.

    For further information, including online documentation and a link to
    the NIST's MUC pages, please visit:

    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T13

    Institutions that have membership in the LDC during the 2003
    Membership Year will be able to receive this corpus free of charge.
    Nonmembers may license this publication for US$100.

                                  *

    MUC VI Text Collection (LDC96T10) has been renamed MUC 6 Additional News
    Text. The new title more accurately reflects the corpus data as it
    consists only of additional training materials for the MUC 6 evaluation.

    If you need additional information before placing your order, or
    would like to inquire about membership in the LDC, please send email to
     or call (215) 573-1275.

    ---------------------------------------------------------------------
    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 email: ldc@ldc.upenn.edu
    Philadelphia, PA 19104-2653 www: http://www.ldc.upenn.ed



    This archive was generated by hypermail 2b29 : Tue Sep 16 2003 - 17:42:45 MET DST