[Corpora-List] New from the LDC

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Thu Sep 28 2006 - 19:10:47 MET DST

Next message: Daniel Wiechmann: "[Corpora-List] collocations and exact hypothesis tests"

Previous message: Jason Eisner: "[Corpora-List] JHU Summer Workshop on Language Engineering - Call for Proposals"
Next in thread: Linguistic Data Consortium: "[Corpora-List] New from the LDC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

LDC2006S43
*Gulf Arabic Conversational Telephone Speech*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S43>

LDC2006T15
*Gulf Arabic Conversational Telephone Speech, Transcripts*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T15>

LDC2006T13
*Web 1T 5-gram Version 1*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13>

The Linguistic Data Consortium (LDC) is pleased to announce the
availability of three new publications.

------------------------------------------------------------------------

*New Publications*
*

*(1) Gulf Arabic Conversational Telephone Speech
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S43>
contains 975 Gulf Arabic speakers taking part in spontaneous telephone
conversations in Colloquial Gulf Arabic. A total of 976 conversation
sides are provided (one speaker appears on two distinct calls). The
average duration per side is about 5.7 minutes. This corpus was
collected and transcribed in 2004 by Appen Pty Ltd. (Appen), Syndey,
Australia, working under a U.S. Government contract.

The single-channel files represent just one side of a normal
conversation. The "devtest" set represents a relatively balanced
(representative) sample drawn from the total pool of collected calls,
based on a test-set selection process applied by the National Institute
of Standards and Technology (NIST) and based on demographic, phone and
audit information as provided by Appen.

(2) Gulf Arabic Conversational Telephone Speech, Transcripts
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T15>
contains transcripts of 975 Gulf Arabic speakers taking part in
spontaneous telephone conversations in Colloquial Gulf Arabic. A total
of 976 conversation sides are provided (one speaker appears on two
distinct calls). The data was collected and transcribed in 2004 by
Appen Pty Ltd., Sydney, Australia, working under a U.S. Government contract.

Each transcript file is a tab-delimited flat table, where each line
contains information and text for a single contiguous utterance,
presented via the following fields:

   1. beginning time stamp in seconds, in square brackets ("[5.7189]")
   2. ending time stamp in seconds, in square brackets
   3. channel/speaker-ID ("A:" or "B:")
   4. "consonant skeleton" orthography for the utterance, in UTF-8
   5. "diacritized" orthography for the utterance, in ASCII

(3) Web 1T 5-gram Version 1
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13>
contains English word n-grams and their observed frequency counts. The
length of the n-grams ranges from unigrams (single words) to five-grams.
This data will be useful for statistical language modeling, e.g., for
machine translation or speech recognition, as well as for other uses.
The n-gram counts were generated from approximately 1 trillion word
tokens of text from publicly accessible web pages.

The input encoding of documents was automatically detected, and all text
was converted to UTF8. The data was tokenized in a manner similar to
the tokenization of the Wall Street Journal portion of the Penn
Treebank. Notable exceptions include the following:

    * Hyphenated word are usually separated, and hyphenated numbers
      usually form one token.
    * Sequences of numbers separated by slashes (e.g. in dates) form one
      token.
    * Sequences that look like urls or email addresses form one token.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
1275.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

Next message: Daniel Wiechmann: "[Corpora-List] collocations and exact hypothesis tests"
Previous message: Jason Eisner: "[Corpora-List] JHU Summer Workshop on Language Engineering - Call for Proposals"
Next in thread: Linguistic Data Consortium: "[Corpora-List] New from the LDC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Sep 28 2006 - 19:08:53 MET DST