Re: [Corpora-List] multilingual comparable corpora

From: pascale@cs.ust.hk
Date: Wed Feb 02 2005 - 17:27:46 MET

  • Next message: Sylviane Granger: "[Corpora-List] Phraseology conference: second call for papers and registration"

    Try TDT data and Broadcast News from the LDC. You must be an LDC member to
    license the corpora.

    However, be reminded that these "comparable" corpora still need to be
    topic aligned to make them really comparable as they contain both on-topic
    and off-topic documents (i.e. documents not on the same topic and
    therefore not comparable).

    Our paper on "Mining very non parallel corpora: Parallel sentence and
    lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
    2004 describes our methodology and contains some usefual references.

    Regards,
    Pascale
    >
    >
    > hi all,
    >
    > are there multilingual comparable corpora suitable for research on
    > paraphrases ?
    > for instance, two collections of articles from different sources
    > describing
    > same events *and* in different languages .
    >
    > Any suggestions on how to build this kind of resources would be helpful
    > too.
    >
    > thank you,
    > Grazia
    >
    >



    This archive was generated by hypermail 2b29 : Wed Feb 02 2005 - 17:33:57 MET