> One situation where your approach may not work so well, is when a
> language's websites use multiple character encodings. Unfortunately,
> this is quite common in languages that have non-Roman writing systems,
At least for Japanese, our way to get around this problem in our
web-mining scripts was to look for the charset declaration in the html
code of each page, and then to convert (inside the script) the page from
that charset to utf8.
I would be interested in hearing about other ways to deal with multiple
encodings.
Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
and iso-2002-jp), but the situation you describe for Hindi sounds like a
true encoding nightmare!
> I gave a talk at the ALLC/ACH meeting in June on our search technique,
> including its pros and cons. The abstract was published, but not the
> full paper. I suppose I should post it somewhere...
Please do!
Regards,
Marco
-- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni
This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 19:06:21 MET DST