Re: [Corpora-List] Compiling an engineering (paper mill) corpus

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Tue Apr 04 2006 - 11:27:38 MET DST

  • Next message: Mathieu Valette: "Re: [Corpora-List] French word list"

    Jaako,

    You could use Google to track down websites of Web-as-Corpus practioners
    eg Marco Baroni, Delphine Bernhard, Adam Kilgarriff, Jan Pomikalek,
    Antionette Renouf, Serge Sharoff etc ...

    ... and then use the methods and tools they advocate for this task.
    BootCaT etc can cope with PDF, doc etc as it uses Google (or Yahoo) to trawl
    the web, and these convert to text automatically.

    I have just set a student courswork exercise to collect a web-corpus on
    a specific domain, using their tools; so you could just use the
    Coursework Instructions as a "crib-sheet" telling you what to do:

    http://www.comp.leeds.ac.uk/eric/db32cw.doc

    The big challenge is to identify the websites which represent your
    domain. You could "manually" (eg using Google) identify some likely
    websites whcih you think realte to paper mills engineering, and then
    "mine" these. Or you could try to identify some key terminology
    specific to paper mills engineering, and then use BootCat or similar
    to find other websites with these terms.

    Have fun! (my students did!)

    Eric Atwell, Leeds University

    On Tue, 4 Apr 2006, Jaakko Nyrölä wrote:

    > We'd like to compile a medium-sized corpus of texts related to engineering
    > and paper mills: their design, maintenance, installation, etc., and would
    > like to do so (mostly) automatically, by collecting relevant documents from
    > the Web.
    >
    > The corpus will then be used for the purpose of automatic mining of
    > terminology.
    >
    > We don't care in which format the documents are; html, pdf, doc, all should
    > be ok, as long as text can be extracted from them.
    >
    > Are there established methods for gathering such collections of documents
    > reasonably quickly and with not too much manual effort?
    >
    > Thanks,
    >
    > Jaakko
    >
    > --
    >
    > Jaakko Nyrölä
    > Student at the Helsinki University of Technology
    > jnyrola@cc.hut.fi
    >
    >
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Language research group, School of Computing,
    Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
    TEL: +44-113-3435430  FAX: +44-113-3435468  http://www.comp.leeds.ac.uk/eric
    



    This archive was generated by hypermail 2b29 : Tue Apr 04 2006 - 11:27:09 MET DST