Re: [Corpora-List] fast string replacement

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Fri Mar 11 2005 - 16:28:50 MET

Next message: Leif Arda Nielsen: "Re: [Corpora-List] fast string replacement"

Previous message: js@cis.uni-muenchen.de: "[Corpora-List] fast string replacement"
In reply to: js@cis.uni-muenchen.de: "[Corpora-List] fast string replacement"
Next in thread: Rob Malouf: "Re: [Corpora-List] fast string replacement"
Next in thread: Leif Arda Nielsen: "Re: [Corpora-List] fast string replacement"
Reply: Rob Malouf: "Re: [Corpora-List] fast string replacement"
Reply: Jörg Schuster: "Re: [Corpora-List] fast string replacement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> I am looking for a program that
>
> - takes as input a string (!) rewriting dictionary and and a corpus
> - applies all rewriting rules to all lines of the corpus
> - is fast, stable and free
> - works under Linux
>

Two further questions:

- What exactly do you mean by "fast"?

Perl is very good at doing that sort of thing and it is usually quite
fast. However, whether Perl is a feasible option or not depends on
your answer to my second question (Perl is good at word replacement
but fairly slow for string replacement).

- Do you mean string replacement (arbitrary substrings in a line of
text) or word replacement?

If you do string replacement then

Eunice from the bookstand.

would become

Eunice/adj from the books/v:3:pres;n:plurtand

after transduction. If you work on white-space delimited words, on the
other hand, you can split lines in Perl, look up each word in a hash
that stores rewriting rules, and insert the replacement if applicable.

If you're really interested in string replacement (probably with some
additional code to identify word boundaries), you should be looking at
finite-state transducers. Two open-source solutions I know are Helmut
Schmid's FST toolkit (see http://www.ims.uni-stuttgart.de/~schmid) and
Steve Abney's cascaded parser CASS (you'll have to search Google for
the source code).

Cheers,
Stefan.

> Example:
>
> Some rewriting rules:
>
> book3, books/v:3:pres;n:plur
> nice, nice/adj
>
> A "corpus" before transduction:
>
> John reads nice books.
>
> The same corpus after transduction:
>
> John reads nice/adj books/v:3:pres;n:plur
>
> Does anyone know such a program?
>
> Jörg Schuster
>

Next message: Leif Arda Nielsen: "Re: [Corpora-List] fast string replacement"
Previous message: js@cis.uni-muenchen.de: "[Corpora-List] fast string replacement"
In reply to: js@cis.uni-muenchen.de: "[Corpora-List] fast string replacement"
Next in thread: Rob Malouf: "Re: [Corpora-List] fast string replacement"
Next in thread: Leif Arda Nielsen: "Re: [Corpora-List] fast string replacement"
Reply: Rob Malouf: "Re: [Corpora-List] fast string replacement"
Reply: Jörg Schuster: "Re: [Corpora-List] fast string replacement"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 11 2005 - 16:44:42 MET