Re: [Corpora-List] determining the correct character encoding

From: David Evans (devans@cs.columbia.edu)
Date: Mon Oct 10 2005 - 15:35:23 MET DST

  • Next message: ELDA: "[Corpora-List] LREC2006 - [Reminder submission deadline October 14, 2005]"

    I've had fairly good success with the Java port of Mozilla's chardet
    code http://jchardet.sourceforge.net/
    See http://www.mozilla.org/projects/intl/chardet.html for the original
    C++ source code.
    You could also look at TextCat (a perl implementation of an n-gram based
    language guesser - you could train it for encodings, but it probably
    isn't nearly as effective as the code above for charset detection)
    http://odur.let.rug.nl/~vannoord/TextCat/

    Hope that helps,

    dave

    Alexander Schutz wrote:

    > Dear List,
    >
    > I was wondering whether there exist some Java-class that deals
    > adequately with determining the correct character encoding for a
    > given text.
    > Formerly I was using the shell tool "file" as a perl system call, in order
    > to identify the source encoding, which was the input for "iconv", but
    > ever since I switched to Java, character encodings are really buggin
    > me. For instance, when I extract the body text of some websites from
    > the web, their character encoding may differ
    > (mainly between ISO-8859-1 and UTF-8). However, internally, I'd like
    > to deal with UTF-8 only, so I need a convenient way to transform from
    > ISO-8859-1 to UTF-8. The InputStreamReader class provides the means
    > for that undertaking, still I need to specify the original charset.
    > For once,
    > I could try to get the information from the HTML source code, but then,
    > this is not specified all the time. Now in Java-terms, is there a way to
    > know which charset for a text is used by looking at the text only?
    > Did anybody encounter that kind of problem before? (anyone? maybe
    > the web-as-corpus guys?)
    > Anyways, your help would be very much appreciated,
    > thanks a million in advance,
    > Alex
    > --
    > Alexander Schutz
    > Student of Computational Linguistics
    > University of Saarland, Germany



    This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 15:40:48 MET DST