[Corpora-List] Re: fast string replacement

From: stahl@germanistik.uni-wuerzburg.de
Date: Tue Mar 15 2005 - 13:10:44 MET

  • Next message: Jörg Schuster: "[Corpora-List] Summary: fast string replacement"

    Jörg Schuster wrote:
     
    > I mean really REALLY fast. The size of my rewriting dictionary is 1
    > million lines at the moment. (But it will grow larger). The size of my
    > corpus is 80GB. And I would like to be able to tag often.

    To manipulate really large files I use the "TUebingen System of
    TExt processing Programms" (Tustep), which contains a module that
    can be used - among many others things - to exchange many source-strings
    into new target-strings. You find infos about Tustep unter this URL:
       http://www.uni-tuebingen.de/zdv/tustep

    To answer you question I created a test file holding 1048576 lines and
    a script file with two strings to exchange.

    The target file contains 1 million lines with the text:
       Dies ist eine Datei mit 1 Million Zeilen.

    A script with the lines
       #create,test2,confirm=-
       #copy,test,test2,-,+,*
       xx .datei.file file file file.
       xx .zeilen.lines.
       *eof
    creates a new target file (test2), copies test into test2 and
    exchanges the string "datei" into "file file file file" as well
    as the string "zeilen" into "lines". Please excuse the simplicity
    of my text. Executing the script took 2 seconds.

    Each line in the target file test2 then looks like this:
       Dies ist eine file file file file mit 1 Million lines.

    Copying and manipulating 2 million lines took 4 seconds.

    The Tustep-replacement strings pretty much look like regular
    expressions that you can enrich with exceptions and abstract
    patterns. And you can replace thousands of strings in one single script.
    Maybe this can give you some ideas.

    Best regards
    Peter Stahl
    University of Wuerzburg



    This archive was generated by hypermail 2b29 : Tue Mar 15 2005 - 13:36:50 MET