Re: [Corpora-List] language sort

From: Daniel Zeman (zeman@ufal.mff.cuni.cz)
Date: Wed Jan 10 2007 - 23:17:24 MET

  • Next message: Raffaella Bernardi: "[Corpora-List] European Masters Program in Language and Communication Technologies (LCT)"

    Oh, I see. I was thinking about a language recognizer that would not
    require you to open a file manually but would read files specified on
    command line instead (and then do something reasonable with them, like
    putting the lang id into their name, moving them to a directory etc.) I
    do not know whether any of-the-shelf recognizers behave that way;
    however, some time ago I tried to write such a thing myself and I have
    been assigning language recognition as a student exercise, too. I just
    have to look whether I have something that other people could use
    without my spending hours on adjusting and documenting it first. Stay tuned,

    Dan

    Maria Esteva napsal(a):
    > Daniel
    >
    > I have tons and tons of files so it will be very time consuming for me
    > to load each file to the programme. I might just as well open the file
    > and read the content to recognize the language.
    >
    > I do have more than one language within one file but I will deal with
    > that. Many files are in spanish but have names, titles, addresses,
    > etc. in other language. I guess that will not bother me as much.
    >
    > any ideas?
    >
    > Maria
    >
    > At 03:07 PM 1/10/2007, you wrote:
    >> Maria,
    >>
    >> why does file-by-file approach not work for you? Does that mean that
    >> you have potentially more than one language within one file?
    >>
    >> Dan
    >>
    >> Maria Esteva napsal(a):
    >>> Dear all,
    >>>
    >>> I am wondering if somebody knows of a program that will recognize
    >>> and sort large sets of files according to language. For my text
    >>> mining project, I need to sort sets of files that contain electronic
    >>> texts mostly in Spanish and English (although there is some French
    >>> and some Portuguese as well).There are many free language
    >>> recognition programmes out there but they work on a file by file
    >>> bases. Let me know if you have some advice.
    >>>
    >>> Thanks,
    >>>
    >>> Maria Esteva
    >>> PhD Candidate
    >>> School of Information
    >>> University of Texas at Austin



    This archive was generated by hypermail 2b29 : Wed Jan 10 2007 - 23:16:19 MET