[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Aug 24 2005 - 23:31:43 MET DST

  • Next message: Laura G Bright: "[Corpora-List] Call for Participation: OTM 2005"

      LDC2005T14
    Chinese Gigaword Release Second Edition
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>

    LDC2005S16
    MDE RT-04 Training Data Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>

    LDC2005T24
    MDE RT-04 Training Data Text/Annotations
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>

    The Linguistic Data Consortium (LDC) would like to announce the
    availability of three new corpora.

    ------------------------------------------------------------------------

    (1) Chinese Gigaword Release Second Edition
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>
    is a comprehensive archive of newswire text data in Chinese that has
    been acquired over several years by the LDC.
    This release includes all of the contents in the first release of the
    Chinese Gigaword corpus (LDC2003T09), material from one new source, as
    well as new materials from the other two sources. Thus, the corpus
    contains three distinct international sources of Chinese newswire -
    Central News Agency, Taiwan, Xinhua News Agency, and Zaobao.

    Some minor updates to the documents from the first release have been
    made; namely, the text portions of "story" type documents have been
    line-wrapped such that each line does not exceed 40 characters.
    Documents of the other types have not been modified.

    (2) MDE RT-04 Training Data Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>
    was created to provide training data for the RT-04 Fall Metadata
    Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
    Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to
    enable technology that can take raw Speech-to-Text output and refine it
    into forms that are of more use to humans and to downstream automatic
    processes. In simple terms, this means the creation of automatic
    transcripts that are maximally readable. This readability might be
    achieved in a number of ways: flagging non-content words like filled
    pauses and discourse markers for optional removal; marking sections of
    disfluent speech; and creating boundaries between natural breakpoints in
    the flow of speech so that each sentence or other meaningful unit of
    speech might be presented on a separate line within the resulting
    transcript. Natural capitalization, punctuation and standardized
    spelling, plus sensible conventions for representing speaker turns and
    identity are further elements in the readable transcript. LDC has
    defined a SimpleMDE annotation task specification and has annotated
    English telephone and broadcast news data to provide training data for
    MDE.

    (3) MDE RT-04 Training Data Text/Annotations
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>
    was created to provide training data for the RT-04 Fall Metadata
    Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
    Affordable, Reusable Speech-to-Text) Program. In this release, some
    original annotations have been re-mapped to new MDE elements to support
    better annotation consistency. In particular, the mapping affects
    Discourse Responses (DR), Discourse Markers (DM) and Backchannel SUs (BC).

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed Aug 24 2005 - 23:43:24 MET DST