Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)

From: Rob Malouf (rmalouf@mail.sdsu.edu)
Date: Sun Dec 17 2006 - 16:42:53 MET

Next message: Ajith Abraham: "[Corpora-List] IAS'07 - the First Call for Papers"

Previous message: Tony Abou-Assaleh: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
In reply to: Mike Maxwell: "[Corpora-List] Grep for Unicode (was: Grep for Windows)"
Next in thread: Brett Powley: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Next in thread: Trond Trosterud: "Re: [Corpora-List] Grep for Windows"
Next in thread: Florian Leitner: "Re: [Corpora-List] Grep for Windows"
Reply: Brett Powley: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On Dec 16, 2006, at 7:26 PM, Mike Maxwell wrote:
>> Gnu grep 2.5.1 supports Unicode, though I guess it's debatable
>> just how useful it is. The next version is supposed to be much
>> better on that front.
>
> I suspect this has been hashed over somewhere, and if so just point
> me in the right direction. But I don't see the string
> 'unicode' (upper or lower case) anywhere in the Gnu grep 2.5.1 that
> I just downloaded, save in the .po files (which are messages, and
> haven't been updated in a long time anyway).

It doesn't do anything special with unicode itself, but if the locale
is set to a multibyte encoding it uses the wide character support
routines in libc. So, for example, if the LANG environment variable
is set to en_US.utf8, it treats the input as UTF-8. It works, in the
sense that "." matches a single character rather than a single byte,
the character classes like "[:alpha:]" and "[:lower:]" are handled
correctly, and so on, but it's not as flexible one might like.

> I did google some Red Hat info on updates to grep, which do speak
> about a Unicode issue (apparently an earlier version had an extreme
> inefficiency in the way it searched UTF-8 streams).

Using mbstowcs and co. is much, much slower than grep's internal byte
matching, which makes grep somethng like 100 times slower if the
locale is set to use wide characters. I just tried this on a machine
running Fedora Core 5:

bulba% egrep -V
egrep (GNU grep) 2.5.1

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There
is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

bulba% export LANG=en_US.iso8859-1
bulba% time egrep "a[^a]+b" nyt9601.txt >/dev/null

real 0m0.068s
user 0m0.062s
sys 0m0.006s
bulba% export LANG=en_US.utf8
bulba% time egrep "a[^a]+b" nyt9601.txt >/dev/null

real 0m2.695s
user 0m2.688s
sys 0m0.007s

This is supposed to be fixed in the next version.

> The other is to search for a particular character sequence. For
> that, two things seem to be necessary: it needs to know the
> encoding of the incoming stream (UTF-8, UTF-16 big-end/little-
> end,...), and it needs to handle normalization.

It doesn't really do either of these, unfortunately. It gets the
encoding from the locale, not the input file, and as far as I know it
doesn't do any normalization at all. As I say, it's debatable just
how useful it is.

---
Rob Malouf <rmalouf@mail.sdsu.edu>
Department of Linguistics and Asian/Middle Eastern Languages
San Diego State University

Next message: Ajith Abraham: "[Corpora-List] IAS'07 - the First Call for Papers"
Previous message: Tony Abou-Assaleh: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
In reply to: Mike Maxwell: "[Corpora-List] Grep for Unicode (was: Grep for Windows)"
Next in thread: Brett Powley: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Next in thread: Trond Trosterud: "Re: [Corpora-List] Grep for Windows"
Next in thread: Florian Leitner: "Re: [Corpora-List] Grep for Windows"
Reply: Brett Powley: "Re: [Corpora-List] Grep for Unicode (was: Grep for Windows)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Dec 17 2006 - 16:41:01 MET