Re: [Corpora-List] fast string replacement

From: Anssi Yli-Jyra (aylijyra@ling.Helsinki.FI)
Date: Fri Mar 11 2005 - 20:35:48 MET

  • Next message: Andrew Kehoe: "RE: [Corpora-List] Query about nomenclature"

    On Fri, 11 Mar 2005 js@cis.uni-muenchen.de wrote:
    > I am looking for a program that
    > - takes as input a string (!) rewriting dictionary and and a corpus
    > - applies all rewriting rules to all lines of the corpus
    > - is fast, stable and free
    > - works under Linux

    The fastest tool around is LEX or its newer version FLEX available
    in all Linuxes. It can take a list of patterns and the associated
    print statements and it compiles it into an C/C++ program that
    does the between std input and std output. When used carefully
    it can be almost as fast as unix word count program (wc), so it is
    very fast.

    Lex looks for the longest leftmost match and then applies the cases where
    you can print a replacement string. All the rules are matched in
    parallel, but you can also define several "states" that indicate
    which subsets of the rules are being used.

    I would say that the best tool for many (>500) strings and long
    (xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....) strings matching is
    the Beta program, but I do not know how free it is. Lingsoft sells
    commercial licenses. It's a quite old program but uses state machines and
    packed transitions very efficiently and should not be kept in mind when
    considering such tools. I used it when Lex (or Gnu Lex=Flex) could not
    compile its rules into automa. Typically the limit of Flex is somewhere
    between 500 rules after which the machine grows too big.

    If you want full transducers, try RWTH FSA utilities. It is free and
    very efficient.

    -- A Yli-Jyrä



    This archive was generated by hypermail 2b29 : Fri Mar 11 2005 - 21:19:17 MET