Re: [Corpora-List] multilingual comparable corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Feb 02 2005 - 18:17:49 MET

Next message: Valia Kordoni: "[Corpora-List] [sem] Last CfP: Int. Conf. on Head-Driven Phrase Structure Grammar"

Previous message: Sylviane Granger: "[Corpora-List] Phraseology conference: second call for papers and registration"
In reply to: pascale@cs.ust.hk: "Re: [Corpora-List] multilingual comparable corpora"
Next in thread: Grazia Russo-Lassner: "[Corpora-List] English parser in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I would just like to make a correction to the earlier post. You do not
need to be a member of the LDC to license the TDT and Broadcast News data.

A few LDC corpora that fit the bill include:

LDC94T5 ECI Multilingual Text
LDC94T4A UN Parallel Text (Complete)
LDC95T20 Hansard French/English
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T57 TDT3 Multilanguage Text Version 2.0
LDC2004T08 Hong Kong Parallel Text - note - this does require membership
LDC2004T18 Arabic English Parallel News Part 1

Information on the above is available at:

http://www.ldc.upenn.edu/Catalog/ByYear.jsp

Best,

Ilya

pascale@cs.ust.hk wrote:

>Try TDT data and Broadcast News from the LDC. You must be an LDC member to
>license the corpora.
>
>However, be reminded that these "comparable" corpora still need to be
>topic aligned to make them really comparable as they contain both on-topic
>and off-topic documents (i.e. documents not on the same topic and
>therefore not comparable).
>
>Our paper on "Mining very non parallel corpora: Parallel sentence and
>lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
>2004 describes our methodology and contains some usefual references.
>
>Regards,
>Pascale
>
>
>>hi all,
>>
>>are there multilingual comparable corpora suitable for research on
>>paraphrases ?
>>for instance, two collections of articles from different sources
>>describing
>>same events *and* in different languages .
>>
>>Any suggestions on how to build this kind of resources would be helpful
>>too.
>>
>>thank you,
>>Grazia
>>
>>
>>
>>
>
>
>
>

Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: (215) 573-1275 3600 Market Street Fax: (215) 573-2175 Suite 810 email: ldc@ldc.upenn.edu Philadelphia, PA 19104 www: http://www.ldc.upenn.edu

Next message: Valia Kordoni: "[Corpora-List] [sem] Last CfP: Int. Conf. on Head-Driven Phrase Structure Grammar"
Previous message: Sylviane Granger: "[Corpora-List] Phraseology conference: second call for papers and registration"
In reply to: pascale@cs.ust.hk: "Re: [Corpora-List] multilingual comparable corpora"
Next in thread: Grazia Russo-Lassner: "[Corpora-List] English parser in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Feb 02 2005 - 18:35:43 MET