[Corpora-List] New LDC Corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Aug 03 2005 - 22:57:18 MET DST

  • Next message: Andy Roberts: "[Corpora-List] New release: jTokeniser 1.2"

    LDC2005T12
    *English Gigaword Second Edition*
    <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12>

    LDC2005S15
    *HKUST Mandarin Telephone Speech, Part 1*
    <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15>

    LDC2005T32
    *HKUST Mandarin Telephone Transcript Data, Part 1*
    <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32>

    The Linguistic Data Consortium (LDC) would like to announce the
    availability of three new corpora.

    ------------------------------------------------------------------------

    English Gigaword Second Edition
    <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12>
    is a comprehensive archive of newswire text data in English that has
    been acquired over several years by the LDC. This release includes all
    of the contents in the first release of the English Gigaword corpus
    (LDC2003T05) as well as new data from July 2002 through Dec 2004. Some
    minor updates to these documents have been made; namely, the text
    portions of "story" type documents have been line-wrapped such that each
    line does not exceed 80 characters. Documents of the other types have
    not been modified. The corpus contains five distinct international
    sources of English newswire:

    Agence France Press English Service (afe)
    Associated Press Worldstream English Service (apw)
    Central News Agency of Taiwan English Service (cne)
    The New York Times Newswire Service (nyt)
    The Xinhua News Agency English Service (xie)

    *

    The Hong Kong University of Science and Technology (HKUST) collected and
    transcribed 200 hours of Mandarin Chinese conversational telephone
    speech from Mandarin speakers in mainland China. HKUST Mandarin
    Telephone Speech, Part 1
    <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005S15>
    contains the training and development sets with 873 and 24 calls,
    respectively.

    All calls were operator-assisted, namely, an operator would call two
    participants as scheduled to initiate a call. Subjects were asked about
    demographic questions before they were bridged for normal conversation.
    Their answers to the demographic questions were recorded on separate
    files. Subjects were allowed to talk up to 10 minutes. With a few
    exceptions, most calls are of the maximum length. Each side of a call
    was recorded on a separate wav file, sampled at 8 bits (a-law encoded),
    8Khz.

    *

    HKUST Mandarin Telephone Transcript Data, Part 1
    <http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T32>
    is the corresponding transcription for HKUST Mandarin Telephone Speech
    Data, Part 1. Standard simplified Chinese characters, encoded in GBK
    (CP-936), were used. The transcribed speech was segmented at natural
    boundaries wherever possible and each segment is no more than 10 seconds
    long. The Chinese text is not segmented into words, though there are
    occasional white spaces within some turns. HKUST Mandarin Telephone
    Transcript Data, Part 1 is distributed via web-download.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed Aug 03 2005 - 23:34:05 MET DST