Re: [Corpora-List] multilingual comparable corpora

From: Linguistic Data Consortium (ldc@ldc.upenn.edu)
Date: Wed Feb 02 2005 - 18:17:49 MET

  • Next message: Valia Kordoni: "[Corpora-List] [sem] Last CfP: Int. Conf. on Head-Driven Phrase Structure Grammar"

    I would just like to make a correction to the earlier post. You do not
    need to be a member of the LDC to license the TDT and Broadcast News data.

    A few LDC corpora that fit the bill include:

    LDC94T5 ECI Multilingual Text
    LDC94T4A UN Parallel Text (Complete)
    LDC95T20 Hansard French/English
    LDC2001T57 TDT2 Multilanguage Text Version 4.0
    LDC2001T57 TDT3 Multilanguage Text Version 2.0
    LDC2004T08 Hong Kong Parallel Text - note - this does require membership
    LDC2004T18 Arabic English Parallel News Part 1

    Information on the above is available at:

    http://www.ldc.upenn.edu/Catalog/ByYear.jsp

    Best,

    Ilya

    pascale@cs.ust.hk wrote:

    >Try TDT data and Broadcast News from the LDC. You must be an LDC member to
    >license the corpora.
    >
    >However, be reminded that these "comparable" corpora still need to be
    >topic aligned to make them really comparable as they contain both on-topic
    >and off-topic documents (i.e. documents not on the same topic and
    >therefore not comparable).
    >
    >Our paper on "Mining very non parallel corpora: Parallel sentence and
    >lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
    >2004 describes our methodology and contains some usefual references.
    >
    >Regards,
    >Pascale
    >
    >
    >>hi all,
    >>
    >>are there multilingual comparable corpora suitable for research on
    >>paraphrases ?
    >>for instance, two collections of articles from different sources
    >>describing
    >>same events *and* in different languages .
    >>
    >>Any suggestions on how to build this kind of resources would be helpful
    >>too.
    >>
    >>thank you,
    >>Grazia
    >>
    >>
    >>
    >>
    >
    >
    >
    >

    -- 
    

    Ilya Ahtaridis Membership Coordinator -------------------------------------------------------------------- Linguistic Data Consortium Phone: (215) 573-1275 3600 Market Street Fax: (215) 573-2175 Suite 810 email: ldc@ldc.upenn.edu Philadelphia, PA 19104 www: http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Wed Feb 02 2005 - 18:35:43 MET