[Corpora-List] determining the correct character encoding

From: Alexander Schutz (goalscoringsuperstarhero@gmail.com)
Date: Mon Oct 10 2005 - 14:08:12 MET DST

  • Next message: Michael Hess: "[Corpora-List] PhD studentship in Computational Linguistics"

    Dear List,

    I was wondering whether there exist some Java-class that deals
    adequately with determining the correct character encoding for a
    given text.
    Formerly I was using the shell tool "file" as a perl system call, in order
    to identify the source encoding, which was the input for "iconv", but
    ever since I switched to Java, character encodings are really buggin
    me. For instance, when I extract the body text of some websites from
    the web, their character encoding may differ
    (mainly between ISO-8859-1 and UTF-8). However, internally, I'd like
    to deal with UTF-8 only, so I need a convenient way to transform from
    ISO-8859-1 to UTF-8. The InputStreamReader class provides the means
    for that undertaking, still I need to specify the original charset. For
    once,
    I could try to get the information from the HTML source code, but then,
    this is not specified all the time. Now in Java-terms, is there a way to
    know which charset for a text is used by looking at the text only?
    Did anybody encounter that kind of problem before? (anyone? maybe
    the web-as-corpus guys?)
    Anyways, your help would be very much appreciated,
    thanks a million in advance,
    Alex

    --
    Alexander Schutz
    Student of Computational Linguistics
    University of Saarland, Germany
    



    This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 14:18:22 MET DST