Re: [Corpora-List] language sort

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Wed Jan 10 2007 - 22:45:26 MET

  • Next message: Daniel Zeman: "Re: [Corpora-List] language sort"

    Maria,

    this is probably only a last resort if noone else comes up with a better
    solution: why not follow the Web-as-Corpus trend and use Google?

    Specifically: copy all your files to a website, say http://mysite.atu.edu
    ... then use Google Advanced search,
         Domain set to return results from your website http://mysite.atu.edu
         Language set to return results in Spanish
          ... (and then in English, then in French, then in Portuguese)...

    This should return the URLs of the Spansih texts first time, then the
    English texts, Frecnh texs, Portuguese texts; then you need to download
    and collate the files from each google search.

    Of course, it would be nice not to have to do all this using the Google
    interface, but instead using a web-as-corpus tool such as BootCat...

    Eric Atwell, Leeds University

    On Wed, 10 Jan 2007, Maria Esteva wrote:

    > Dear all,
    >
    > I am wondering if somebody knows of a program that will recognize and sort
    > large sets of files according to language. For my text mining project, I need
    > to sort sets of files that contain electronic texts mostly in Spanish and
    > English (although there is some French and some Portuguese as well).There are
    > many free language recognition programmes out there but they work on a file
    > by file bases. Let me know if you have some advice.
    >
    > Thanks,
    >
    > Maria Esteva
    > PhD Candidate
    > School of Information
    > University of Texas at Austin
    >

    Eric Atwell,
    Senior Lecturer, Language research group leader, School of Computing,
    Faculty of Engineering, University of Leeds, LEEDS LS2 9JT, England
    TEL: +44-113-3435430 FAX: +44-113-3435468 http://www.comp.leeds.ac.uk/eric



    This archive was generated by hypermail 2b29 : Wed Jan 10 2007 - 22:58:34 MET