Re: [Corpora-List] Corpora for EAP: Architecture...?

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Mon Jan 16 2006 - 14:18:40 MET

  • Next message: P Resnik: "Re: [Corpora-List] Corpora for EAP: Architecture...?"

    Hi Eric.

    For smallish specialized corpora, I suppose the following Python-based
    solution would work, and it probably would not take more than one day to
    implement...

    - Write a script to do random combinations of potentially relevant terms
    from a list

    - Use a python module to retrieve web pages from google via the API, e.g.:
    http://pygoogle.sourceforge.net/, using each of the random combinations as
    a query string

    - Use the python BTE module (http://www.smi.ucd.ie/hyppia/) to clean the
    pages you retrieve (it's slower than our perl implementation, but for small
    corpora that should not be a problem).

    - Use the NLTK or other python/java tools to process the corpus constructed
    in this way

    Regards,

    Marco



    This archive was generated by hypermail 2b29 : Mon Jan 16 2006 - 14:33:26 MET