[Corpora-List] New LDC Publications

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Feb 01 2006 - 22:48:01 MET

Next message: Mark Davies: "[Corpora-List] VIEW/BNC on the BBC program "Word of Mouth""

Previous message: roche@lri.fr: "[Corpora-List] Appel : DEFT'06"
Next in thread: Linguistic Data Consortium: "[Corpora-List] New LDC Publications"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

LDC2006T02
*Arabic Gigaword Second Edition*
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02>
*
*LDC2006S01*
CSLU: Voices
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S01>*

LDC2006T04*
**Multiple Translation Chinese (MTC) Part 4
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04>
*

The Linguistic Data Consortium (LDC) would is please to announce the
availability of three new publications.

------------------------------------------------------------------------

*New LDC Publications

(1) Arabic Gigaword Second Edition
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02>
is is a comprehensive archive of newswire text data that has been
acquired from Arabic news sources by the Linguistic Data Consortium
(LDC). Arabic Gigaword Second Edition includes all of the content of
the first edition of Arabic Gigaword (LDC2003T12) as well as new data.

Arabic Gigaword contains five distinct sources of Arabic newswire

Agence France Presse (afp_arb; formally afa)

Al Hayat News Agency (hyt_arb; formally alh)

An Nahar News Agency (nhr_arb; formally ann)

Ummah Press (umh_arb)

Xinhua News Agency (xin_arb; formally xia)

The seven-letter codes in the parentheses above consist of the
three-character source name IDs and the three-character language code
("arb") separated by an underscore ("_") character. The three-letter
language code represents the standard Arabic in the ISO 639-3 standard.
In the first edition of the Arabic Gigaword corpus, a simpler
three-character-code scheme was used to identify both the source and the
language. The new convention allows us to distinguish data sets by
source and language more naturally when a single newswire provider
distributes data in multiple languages.

Ummah Press is a new source added to the Second Edition. The following
table shows the new data that appear for the first time in the Second
Edition.

Agence France Presse 2003.01-2004.12 143766 documents

Al Hayat News Agency 2002.01-2003.12 64308 documents

An Nahar News Agency 2003.01-2004.01 16316 documents

Ummah Press 2003.01-2004.12 4641 documents

Xinhua News Agency 2003.06-2004.12 106236 documents

There are 423 files, totaling approximately 1.4GB in compressed form
(5359 MB uncompressed and 1591983 K-words).

(2) The CSLU: Voices
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S01>
corpus contains 12 speakers reading 50 phonetically rich sentences. The
recording procedure involved a "mimicking" approach which resulted in a
high degree of natural time-alignment between different speakers. The
acoustic wave and the concurrent laryngograph signal were recorded for 1
"free" and 2 "mimicked" renditions of each sentence. Pitch marks,
calculated from the laryngograph signal, and time marks, the output of a
forced-alignment algorithm, have been added to the corpus.

(3) Multiple-Translation Chinese (MTC) Part 4
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T04>
supports the development of automatic means for evaluating translation
quality. The LDC was sponsored to solicit four sets of human
translations for a single set of Chinese source materials. The LDC was
also asked to produce translations from various
commercial-off-the-shelf-systems (COTS, including commercial Machine
Translation (MT) systems as well as MT systems available on the
Internet). There are a total of five sets of COTS outputs, and six
output sets from TIDES 2003 MT Evaluation participants.

To see if automatic evaluation systems, such as BLEU, track human
assessment, the LDC has also performed human assessment on one COTS
output and the 6 TIDES research systems. The corpus includes the
assessment results for one of the 5 COTS systems, the assessment result
for the 6 TIDES research systems, and the specifications used for
conducting the assessments.

Multiple-Translation Chinese (MTC) Part 4 contains two sources of
journalistic Chinese text:

- Xinhua News Agency: 50 news stories
- AFP News Service: 50 news stories

There are 100 source files, and 1,100 translation files. All source data
were drawn from LDC's January and February 2003 collection of Xinhua
news Chinese data and AFP Chinese data. For the Chinese data, there are
approximately 21K-words, while for the English translations, there are
396K-words in total and 16K unique words.

------------------------------------------------------------------------

If you need further information, or would like to inquire about
membership to the LDC, please email ldc@ldc.upenn.edu or call +1 215 573
1275.

--------------------------------------------------------------------

Linguistic Data Consortium Phone: (215) 573-1275
University of Pennsylvania Fax: (215) 573-2175
3600 Market St., Suite 810 ldc@ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

Next message: Mark Davies: "[Corpora-List] VIEW/BNC on the BBC program "Word of Mouth""
Previous message: roche@lri.fr: "[Corpora-List] Appel : DEFT'06"
Next in thread: Linguistic Data Consortium: "[Corpora-List] New LDC Publications"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Feb 01 2006 - 23:01:31 MET