[Corpora-List] determining the correct character encoding

From: Alexander Schutz (goalscoringsuperstarhero@gmail.com)
Date: Mon Oct 10 2005 - 14:08:12 MET DST

Next message: Michael Hess: "[Corpora-List] PhD studentship in Computational Linguistics"

Previous message: uclegan@ucl.ac.uk: "[Corpora-List] Corpus of Advertising?"
Next in thread: David Evans: "Re: [Corpora-List] determining the correct character encoding"
Reply: David Evans: "Re: [Corpora-List] determining the correct character encoding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear List,

I was wondering whether there exist some Java-class that deals
adequately with determining the correct character encoding for a
given text.
Formerly I was using the shell tool "file" as a perl system call, in order
to identify the source encoding, which was the input for "iconv", but
ever since I switched to Java, character encodings are really buggin
me. For instance, when I extract the body text of some websites from
the web, their character encoding may differ
(mainly between ISO-8859-1 and UTF-8). However, internally, I'd like
to deal with UTF-8 only, so I need a convenient way to transform from
ISO-8859-1 to UTF-8. The InputStreamReader class provides the means
for that undertaking, still I need to specify the original charset. For
once,
I could try to get the information from the HTML source code, but then,
this is not specified all the time. Now in Java-terms, is there a way to
know which charset for a text is used by looking at the text only?
Did anybody encounter that kind of problem before? (anyone? maybe
the web-as-corpus guys?)
Anyways, your help would be very much appreciated,
thanks a million in advance,
Alex

--
Alexander Schutz
Student of Computational Linguistics
University of Saarland, Germany

Next message: Michael Hess: "[Corpora-List] PhD studentship in Computational Linguistics"
Previous message: uclegan@ucl.ac.uk: "[Corpora-List] Corpus of Advertising?"
Next in thread: David Evans: "Re: [Corpora-List] determining the correct character encoding"
Reply: David Evans: "Re: [Corpora-List] determining the correct character encoding"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 14:18:22 MET DST