[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Aug 24 2005 - 23:31:43 MET DST

Next message: Laura G Bright: "[Corpora-List] Call for Participation: OTM 2005"

Previous message: David Brooks: "Re: [Corpora-List] EVALB installation problems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

LDC2005T14
Chinese Gigaword Release Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>

LDC2005S16
MDE RT-04 Training Data Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>

LDC2005T24
MDE RT-04 Training Data Text/Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>

The Linguistic Data Consortium (LDC) would like to announce the
availability of three new corpora.

------------------------------------------------------------------------

(1) Chinese Gigaword Release Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T14>
is a comprehensive archive of newswire text data in Chinese that has
been acquired over several years by the LDC.
This release includes all of the contents in the first release of the
Chinese Gigaword corpus (LDC2003T09), material from one new source, as
well as new materials from the other two sources. Thus, the corpus
contains three distinct international sources of Chinese newswire -
Central News Agency, Taiwan, Xinhua News Agency, and Zaobao.

Some minor updates to the documents from the first release have been
made; namely, the text portions of "story" type documents have been
line-wrapped such that each line does not exceed 40 characters.
Documents of the other types have not been modified.

(2) MDE RT-04 Training Data Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S16>
was created to provide training data for the RT-04 Fall Metadata
Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
Affordable, Reusable Speech-to-Text) Program. The goal of MDE is to
enable technology that can take raw Speech-to-Text output and refine it
into forms that are of more use to humans and to downstream automatic
processes. In simple terms, this means the creation of automatic
transcripts that are maximally readable. This readability might be
achieved in a number of ways: flagging non-content words like filled
pauses and discourse markers for optional removal; marking sections of
disfluent speech; and creating boundaries between natural breakpoints in
the flow of speech so that each sentence or other meaningful unit of
speech might be presented on a separate line within the resulting
transcript. Natural capitalization, punctuation and standardized
spelling, plus sensible conventions for representing speaker turns and
identity are further elements in the readable transcript. LDC has
defined a SimpleMDE annotation task specification and has annotated
English telephone and broadcast news data to provide training data for
MDE.

(3) MDE RT-04 Training Data Text/Annotations
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T24>
was created to provide training data for the RT-04 Fall Metadata
Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient,
Affordable, Reusable Speech-to-Text) Program. In this release, some
original annotations have been re-mapped to new MDE elements to support
better annotation consistency. In particular, the mapping affects
Discourse Responses (DR), Discourse Markers (DM) and Backchannel SUs (BC).

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
2175.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 http://www.ldc.upenn.edu

Next message: Laura G Bright: "[Corpora-List] Call for Participation: OTM 2005"
Previous message: David Brooks: "Re: [Corpora-List] EVALB installation problems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Aug 24 2005 - 23:43:24 MET DST