[Corpora-List] Compiling an engineering (paper mill) corpus

From: Jaakko Nyrölä (jnyrola@cc.hut.fi)
Date: Tue Apr 04 2006 - 09:30:24 MET DST

  • Next message: allauzen@limsi.fr: "Re: [Corpora-List] French word list"

    We'd like to compile a medium-sized corpus of texts related to engineering
    and paper mills: their design, maintenance, installation, etc., and would
    like to do so (mostly) automatically, by collecting relevant documents
    from the Web.

    The corpus will then be used for the purpose of automatic mining of
    terminology.

    We don't care in which format the documents are; html, pdf, doc, all
    should be ok, as long as text can be extracted from them.

    Are there established methods for gathering such collections of documents
    reasonably quickly and with not too much manual effort?

    Thanks,

    Jaakko

    --
    

    Jaakko Nyrölä Student at the Helsinki University of Technology jnyrola@cc.hut.fi



    This archive was generated by hypermail 2b29 : Tue Apr 04 2006 - 09:53:10 MET DST