Re: [Corpora-List] fast string replacement

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Fri Mar 11 2005 - 16:28:50 MET

  • Next message: Leif Arda Nielsen: "Re: [Corpora-List] fast string replacement"

    > I am looking for a program that
    >
    > - takes as input a string (!) rewriting dictionary and and a corpus
    > - applies all rewriting rules to all lines of the corpus
    > - is fast, stable and free
    > - works under Linux
    >

    Two further questions:

    - What exactly do you mean by "fast"?

    Perl is very good at doing that sort of thing and it is usually quite
    fast. However, whether Perl is a feasible option or not depends on
    your answer to my second question (Perl is good at word replacement
    but fairly slow for string replacement).

    - Do you mean string replacement (arbitrary substrings in a line of
    text) or word replacement?

    If you do string replacement then

      Eunice from the bookstand.

    would become

      Eunice/adj from the books/v:3:pres;n:plurtand

    after transduction. If you work on white-space delimited words, on the
    other hand, you can split lines in Perl, look up each word in a hash
    that stores rewriting rules, and insert the replacement if applicable.

    If you're really interested in string replacement (probably with some
    additional code to identify word boundaries), you should be looking at
    finite-state transducers. Two open-source solutions I know are Helmut
    Schmid's FST toolkit (see http://www.ims.uni-stuttgart.de/~schmid) and
    Steve Abney's cascaded parser CASS (you'll have to search Google for
    the source code).

    Cheers,
    Stefan.

    > Example:
    >
    > Some rewriting rules:
    >
    > book3, books/v:3:pres;n:plur
    > nice, nice/adj
    >
    > A "corpus" before transduction:
    >
    > John reads nice books.
    >
    > The same corpus after transduction:
    >
    > John reads nice/adj books/v:3:pres;n:plur
    >
    > Does anyone know such a program?
    >
    > Jörg Schuster
    >



    This archive was generated by hypermail 2b29 : Fri Mar 11 2005 - 16:44:42 MET