[Corpora-List] Grep for Unicode (was: Grep for Windows)

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Sun Dec 17 2006 - 04:26:15 MET

Next message: Mvogo Kuna: "[Corpora-List] corpus d'ancien français / galloroman corpus"

Previous message: Andy Roberts: "Re: [Corpora-List] Grep for Windows"
In reply to: Rob Malouf: "Re: [Corpora-List] Grep for Windows"
Next in thread: Tony Abou-Assaleh: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Next in thread: Florian Leitner: "Re: [Corpora-List] Grep for Windows"
Reply: Tony Abou-Assaleh: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Reply: Rob Malouf: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Rob Malouf wrote:
> On Dec 15, 2006, at 8:36 AM, maxwell@ldc.upenn.edu wrote:
>> Besides, none of the standard grep implementations that I know of
>> handle Unicode (at least not in any useful way).
>
> Gnu grep 2.5.1 supports Unicode, though I guess it's debatable just how
> useful it is. The next version is supposed to be much better on that
> front.

I suspect this has been hashed over somewhere, and if so just point me
in the right direction. But I don't see the string 'unicode' (upper or
lower case) anywhere in the Gnu grep 2.5.1 that I just downloaded, save
in the .po files (which are messages, and haven't been updated in a long
time anyway). I did google some Red Hat info on updates to grep, which
do speak about a Unicode issue (apparently an earlier version had an
extreme inefficiency in the way it searched UTF-8 streams). Since I
thought Linux distros usually came with the GNU tools, I'm a little puzzled.

Stepping back a bit: I can think of two ways one might want to use grep
with Unicode files.

One is to search for a particular byte sequence, and I presume grep can
do that.

The other is to search for a particular character sequence. For that,
two things seem to be necessary: it needs to know the encoding of the
incoming stream (UTF-8, UTF-16 big-end/little-end,...), and it needs to
handle normalization. (And it needs to know what to do with these in
the output.) I think the normalization issue is doable, provided the
encoding issue is correctly handled. But there are numerous issues with
determining the encoding of an input stream, and I'm not knowledgeable
enough to know whether it is always possible to reliably tell from
looking at a stream of bytes which one knows to be Unicode which
encoding it is.

At any rate, I don't see anything that tells me how Gnu grep deals with
Unicode encodings and normalization. Am I missing something?

-- 
	Mike Maxwell
	maxwell@ldc.upenn.edu

Next message: Mvogo Kuna: "[Corpora-List] corpus d'ancien français / galloroman corpus"
Previous message: Andy Roberts: "Re: [Corpora-List] Grep for Windows"
In reply to: Rob Malouf: "Re: [Corpora-List] Grep for Windows"
Next in thread: Tony Abou-Assaleh: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Next in thread: Florian Leitner: "Re: [Corpora-List] Grep for Windows"
Reply: Tony Abou-Assaleh: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Reply: Rob Malouf: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Dec 17 2006 - 04:24:57 MET