Re: [Corpora-List] language-specific harvesting of texts from the Web

From: Marco Baroni (baroni@einstein.sslmit.unibo.it)
Date: Tue Aug 31 2004 - 18:51:46 MET DST

Next message: Kevin Patrick Scannell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"

Previous message: Gloria : "[Corpora-List] Searching BNC for adverbs followed by verb"
In reply to: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Next in thread: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Next in thread: Kevin Patrick Scannell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Stuart A Yeates: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Mª Belén Díez Bedmar: "[Corpora-List] looking for first names..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> One situation where your approach may not work so well, is when a
> language's websites use multiple character encodings. Unfortunately,
> this is quite common in languages that have non-Roman writing systems,

At least for Japanese, our way to get around this problem in our
web-mining scripts was to look for the charset declaration in the html
code of each page, and then to convert (inside the script) the page from
that charset to utf8.

I would be interested in hearing about other ways to deal with multiple
encodings.

Btw: I thought Japanese was tough (as you can find euc-jp, shiftjis, utf8
and iso-2002-jp), but the situation you describe for Hindi sounds like a
true encoding nightmare!

> I gave a talk at the ALLC/ACH meeting in June on our search technique,
> including its pros and cons. The abstract was published, but not the
> full paper. I suppose I should post it somewhere...

Please do!

Regards,

Marco

-- 
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

Next message: Kevin Patrick Scannell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Previous message: Gloria : "[Corpora-List] Searching BNC for adverbs followed by verb"
In reply to: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Next in thread: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Next in thread: Kevin Patrick Scannell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Mike Maxwell: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Stuart A Yeates: "Re: [Corpora-List] language-specific harvesting of texts from the Web"
Reply: Mª Belén Díez Bedmar: "[Corpora-List] looking for first names..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 31 2004 - 19:06:21 MET DST