Re: [Corpora-List] fast string replacement

From: Anssi Yli-Jyra (aylijyra@ling.Helsinki.FI)
Date: Fri Mar 11 2005 - 20:35:48 MET

Next message: Andrew Kehoe: "RE: [Corpora-List] Query about nomenclature"

Previous message: Normunds Gruzitis: "RE: [Corpora-List] Query about nomenclature"
In reply to: js@cis.uni-muenchen.de: "[Corpora-List] fast string replacement"
Next in thread: Jörg Schuster: "Re: [Corpora-List] fast string replacement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 11 Mar 2005 js@cis.uni-muenchen.de wrote:
> I am looking for a program that
> - takes as input a string (!) rewriting dictionary and and a corpus
> - applies all rewriting rules to all lines of the corpus
> - is fast, stable and free
> - works under Linux

The fastest tool around is LEX or its newer version FLEX available
in all Linuxes. It can take a list of patterns and the associated
print statements and it compiles it into an C/C++ program that
does the between std input and std output. When used carefully
it can be almost as fast as unix word count program (wc), so it is
very fast.

Lex looks for the longest leftmost match and then applies the cases where
you can print a replacement string. All the rules are matched in
parallel, but you can also define several "states" that indicate
which subsets of the rules are being used.

I would say that the best tool for many (>500) strings and long
(xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....) strings matching is
the Beta program, but I do not know how free it is. Lingsoft sells
commercial licenses. It's a quite old program but uses state machines and
packed transitions very efficiently and should not be kept in mind when
considering such tools. I used it when Lex (or Gnu Lex=Flex) could not
compile its rules into automa. Typically the limit of Flex is somewhere
between 500 rules after which the machine grows too big.

If you want full transducers, try RWTH FSA utilities. It is free and
very efficient.

-- A Yli-Jyrä

Next message: Andrew Kehoe: "RE: [Corpora-List] Query about nomenclature"
Previous message: Normunds Gruzitis: "RE: [Corpora-List] Query about nomenclature"
In reply to: js@cis.uni-muenchen.de: "[Corpora-List] fast string replacement"
Next in thread: Jörg Schuster: "Re: [Corpora-List] fast string replacement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 11 2005 - 21:19:17 MET