I would just like to make a correction to the earlier post. You do not
need to be a member of the LDC to license the TDT and Broadcast News data.
A few LDC corpora that fit the bill include:
LDC94T5 ECI Multilingual Text
LDC94T4A UN Parallel Text (Complete)
LDC95T20 Hansard French/English
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T57 TDT3 Multilanguage Text Version 2.0
LDC2004T08 Hong Kong Parallel Text - note - this does require membership
LDC2004T18 Arabic English Parallel News Part 1
Information on the above is available at:
http://www.ldc.upenn.edu/Catalog/ByYear.jsp
Best,
Ilya
pascale@cs.ust.hk wrote:
>Try TDT data and Broadcast News from the LDC. You must be an LDC member to
>license the corpora.
>
>However, be reminded that these "comparable" corpora still need to be
>topic aligned to make them really comparable as they contain both on-topic
>and off-topic documents (i.e. documents not on the same topic and
>therefore not comparable).
>
>Our paper on "Mining very non parallel corpora: Parallel sentence and
>lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
>2004 describes our methodology and contains some usefual references.
>
>Regards,
>Pascale
>
>
>>hi all,
>>
>>are there multilingual comparable corpora suitable for research on
>>paraphrases ?
>>for instance, two collections of articles from different sources
>>describing
>>same events *and* in different languages .
>>
>>Any suggestions on how to build this kind of resources would be helpful
>>too.
>>
>>thank you,
>>Grazia
>>
>>
>>
>>
>
>
>
>
--Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: (215) 573-1275 3600 Market Street Fax: (215) 573-2175 Suite 810 email: ldc@ldc.upenn.edu Philadelphia, PA 19104 www: http://www.ldc.upenn.edu
This archive was generated by hypermail 2b29 : Wed Feb 02 2005 - 18:35:43 MET