RE: Corpora: a program needed

From: Walker, Daniel (Daniel.Walker@bowneglobal.com)
Date: Fri May 31 2002 - 00:05:08 MET DST

  • Next message: Scott Sadowsky: "Corpora: Windows binary of Transcriber 1.4.4"

    Actually, I believe the numbers are supposed to be incremented when a new
    type is encountered and otherwise stay the same: the numbers change less
    frequently towards the end of the file, and the last one printed is the
    number of different types. So, an even terser one-liner (got to love perl)
    ...

    $ cat file
    this
    is
    a
    test
    this
    really
    is
    a
    test

    $ cat file | perl -pe 's/.+/$t{$_}?$i:($t{$_}=++$i)/e'
    1
    2
    3
    4
    4
    5
    5
    5
    5

    Cordially,
    Daniel Walker

    -----Original Message-----
    From: David Graff [mailto:graff@unagi.cis.upenn.edu]
    Sent: Thursday, May 30, 2002 7:35 AM
    To: Sampo Nevalainen
    Cc: corpora@hd.uib.no
    Subject: Re: Corpora: a program needed

    Sampo,

    The command line perl script I sent you earlier (which I failed to copy
    to the list), could actually be expressed more briefly. Again, granting
    that the data is already tokenized to one word token per line:

    cat token.stream | \
     perl -pe 's/(\S+)/exists($t{$1}) ? $t{$1}:($t{$1}=++$tc)/e'

        Best regards,

            Dave Graff



    This archive was generated by hypermail 2b29 : Fri May 31 2002 - 00:21:45 MET DST