Re: [Corpora-List] fast string replacement

From: Paul Bijnens (paul.bijnens@xplanation.com)
Date: Mon Mar 14 2005 - 14:00:37 MET

  • Next message: L. Alfonso Ureña: "[Corpora-List] SEPLN 2005 Second Call for Papers"

    Jörg Schuster wrote:

    > I mean really REALLY fast. The size of my rewriting dictionary is 1
    > million lines at the moment. (But it will grow larger). The size of my
    > corpus is 80GB. And I would like to be able to tag often.

    Attached you'll find a little C-program that replaces fixed strings,
    that I wrote about 15 years ago. I'm still using it however.

    [ attachment: http://torvald.aksis.uib.no/corpora/repl.zip ]

    I've never tried it on a replacement set of 1 million lines,
    but I'm very interested to see how it behaves on such large input. :-)

    There is no man page, but in the source there is some more information.

    Quick getting started:

    make a file having the following syntax:

    ====cut here=====
    # This is a comment
    /search/replace/

    # the longest search string will be replaced
    /searchsomethingelse/replace this too/

    # blank lines are ignored

    # The first non-alfabetic char is the separator:
    !/this/contains/slashes!/THIS/CONTAINS/SLASHES/!

    # A search or replacement string can contain newlines
    # or any bytes (includeing null, better encode this \000)
    /some
    line/some line/

    /need to split/need
    to split/

    # You can encode bytes with backslash notation like
    # \n, \t, ...etc, \007 (octoal) or \xC4 (hexadecimal)
    /élève/\xe9l\xe8ve/
    ========== cut here ===========

    Execute with:

    $ repl /name/of/repl/table infile > outfile

    You can also specify replacements on the command line:

    $ repl -e '/\r\n/\n/' infile > outfile

    At least the program is very simple... (and fast for me!)

    If really needed, the tree implementation could be replaced
    by a trie implementation to make it even faster, at the expense of
    being more complicated (that's probably what the commercial progs do).

    -- 
    Paul Bijnens, Xplanation                            Tel  +32 16 397.511
    Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
    http://www.xplanation.com/          email:  Paul.Bijnens@xplanation.com
    ***********************************************************************
    * I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
    * quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
    * stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
    * PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
    * kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
    * ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
    ***********************************************************************
    



    This archive was generated by hypermail 2b29 : Tue Mar 15 2005 - 09:45:06 MET