[Corpora-List] LDC Online and New Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Mar 30 2005 - 18:53:34 MET DST

  • Next message: Jean-Phi: "Re: [Corpora-List] Corpus from Blogs required."

    ** New LDC Online Services <https://online.ldc.upenn.edu/login.html>!
    <https://online.ldc.upenn.edu/login.html> **

    LDC2005T09*
    *** ACE 2004 Multilingual Training Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T09> *
    *
    LDC2005T06*
        * Chinese News Translation Text Part 1
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T06> *
    *

    LDC2005T08*
    * Discourse Graphbank
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T08> *

    *

    The LDC would like to announce the availability of a new LDC Online
    service and the release of three new corpora.

    ------------------------------------------------------------------------

    The LDC is pleased to announce that an improved LDC Online service is
    now available. LDC Online can be accessed at the following url:

    https://online.ldc.upenn.edu/login.html <https://online.ldc.upenn.edu/>

    Organizations that hold 2005 Membership in the LDC will be able to
    perform text searches on our entire English Gigaword corpus. This
    corpus is a comprehensive archive of newswire text data that has been
    acquired over several years by the LDC. Current members will also be
    able to access the American English Spoken Lexicon (AESL). AESL
    contains pronunciations in individual audio files for more than 50,000
    of the most common words in English

    Even if your organization is not a current member, you can access LDC
    Online through a guest account. As a guest, an LDC online user will be
    able to access the American English Spoken Lexicon.

    We will offer periodic updates to LDC Online to include new corpora and
    search functions. Please check in with us often as we anticipate this
    will be an exciting offering.

    ------------------------------------------------------------------------

    ACE 2004 Multilingual Training Corpus
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T09>
    contains the complete set of English, Arabic and Chinese training data
    for the 2004 Automatic Content Extraction (ACE) technology evaluation.
    The objective of the ACE program is to develop automatic content
    extraction technology to support automatic processing of human language
    in text form.

    Sites were evaluated on system performance in six areas: Entity
    Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR
    Co-reference, Relation Detection and Recognition (RDR), Relation Mention
    Detection (RMD), and RDR given reference entities. All tasks were
    evaluated in three languages: English, Chinese and Arabic.

    *

    Chinese News Translation Text Part 1
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T06>
    supports the development of automatic machine translation systems, the
    LDC was sponsored to solicit English translations for a single set of
    Chinese source materials.

    The source Chinese text and its English translations were selected and
    translated in different LDC projects. A total of about 474K Chinese
    characters were selected from two sources, namely Xinhua and AFP, and
    translation services were provided by seven translation agencies. Each
    Chinese news story was translated once.

    *

    Discourse Treebank
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T08>
    aims to define a descriptively adequate data structure for representing
    discourse coherence structures.. This project also investigates the
    impact of discourse coherence structures on other linguistic processes
    and natural language applications (e.g. anaphor
    resolution,summarization, information retrieval), to develop and test
    discourse parsing algorithms. The data consists of 135 texts from AP
    Newswire and Wall Street Journal, annotated with coherence relations.
    The source for data is TIPSTER Complete (LDC93T3A).

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

                            Linguistic Data Consortium Phone: (215) 573-1275
                            University of Pennsylvania Fax: (215) 573-2175
                            3600 Market St., Suite 810 ldc@ldc.upenn.edu
                            Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Mon May 30 2005 - 19:03:16 MET DST