[Corpora-List] Answers to domain corpora request

From: Carlos Rodriguez (crodriguezp@gmail.com)
Date: Fri Apr 01 2005 - 18:12:19 MET DST

  • Next message: Jason Eisner: "[Corpora-List] Unnatural Language Processing Workshop"

    Thanks to everyone who answer my request for open-source domain corpora.
    Leonel Ruiz and Stella Tagnin pointed me to corpora in Spanish and
    Brazilian Portuguese. For English, Ylva Berglund mentioned OPUS (an open
    source parallel corpus). From the text mining front, big textual
    collections of Bio-Medical full-text articles are now available, as
    pointed out by Paul Buitelaar (http://muchmore.dfki.de/resources1.htm)
    and Kevin Cohen (http://www.biomedcentral.com/info/about/datamining/
    [8,000 plus articles in xml]), among other data collections. Also, the
    Linux Documentation Project provides a quite big, typological
    homogeneous collection.
    Unfortunately, big textual collections from other disciplines are more
    difficult to obtain in dowloadable form. I am now compiling a 300
    article collection from Sociology journals, in case anyone is also
    interested in cross-genre comparatives and lexical acquisition.

    Carlos Rodríguez
    National Autonomous University, Mexico



    This archive was generated by hypermail 2b29 : Fri Apr 01 2005 - 18:19:10 MET DST