[Corpora-List] New Releases from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Tue Feb 08 2005 - 21:37:55 MET

  • Next message: Jean Veronis: "[Corpora-List] TAL special issue on Spoken Corpora"

    LDC2005S08
    *BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts *

    LDC2005T01
    *Chinese Treebank 5.0*

    LDC2005S07
    *Levantine Arabic QT Training Data Set 3 Speech*

    LDC2005T03
    *Levantine Arabic QT Training Data Set 3 Transcripts*

    The Linguistic Data Consortium (LDC) would like to announce the
    availability of four new corpora.

    ------------------------------------------------------------------------

    (1) BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S08>
    consists of transcribed, spontaneous speech, recorded from subjects
    speaking in Levantine colloquial Arabic. Levantine Arabic is the dialect
    of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and
    Palestine. It is significantly different from Modern Standard Arabic
    (MSA), in that it is a spoken rather than a written language. It
    includes different word pronunciations, and even different words.

    The corpus would be useful for anyone attempting to do speech
    recognition in Levantine colloquial Arabic, including for speech
    translation and spoken dialog systems. BBN/AUB DARPA Babylon Levantine
    Arabic Speech and Transcripts is distributed on two DVD-ROM.

    (2) Chinese Treebank 5.0
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T01>
    is a 500K word corpus of Chinese text with syntactic bracketing. The
    corpus contains 824K Hanzi, 18K sentences, and 890 data files. The data
    is drawn from three sources: Xinhua (1994-1998), Information Services
    Department of HKSAR (1997), and Sinorama magazine, Taiwan (1996-1998 &
    2000-2001)

    All files are GB encoded. Chinese Treebank 5.0 provides four versions of
    files: bracketed, raw, segmented and POS tagged. The raw, segmented and
    POS tagged versions are generated from the bracketed version and so do
    not reflect the previous annotation stages. Chinese Treebank 5.0 is
    distributed on one CD-ROM.

    (3) Levantine Arabic QT Training Data Set 3 Speech
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005S07>
    contains 322 telephone conversations and totals about 50 hours of
    Levantine Arabic speech. Participants were instructed to speak on set
    topics. Unlike the previous training data corpora (Set 1 and 2) which
    are nearly 100% Jordanian speakers, this corpus is mostly Lebanese (72%)
    plus a combination of others Levantine speakers. Levantine Arabic QT
    Training Data Set 3 Speech is distributed on one DVD-ROM.

    (4) Levantine Arabic QT Training Data Set 3 Transcripts
    <http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T03>
    contains the transcription for the Levantine Arabic QT Training Data Set
    3. There are 322 files is UTF-8 format. The corpus also contains a word
    list and speaker information files. Levantine Arabic QT Training Data
    Set 3 Transcripts is distributed on one CD-ROM.

    ------------------------------------------------------------------------

    If you need further information, or would like to inquire about
    membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
    2175.

    --------------------------------------------------------------------

    Linguistic Data Consortium Phone: (215) 573-1275
    3600 Market Street Fax: (215) 573-2175
    Suite 810 ldc@ldc.upenn.edu
    Philadelphia, PA 19104 http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Tue Feb 08 2005 - 21:40:53 MET