Corpora: PC-based programs to create lists of n-grams

From: Mark Davies (mdavies@ilstu.edu)
Date: Mon Oct 15 2001 - 15:57:22 MET DST

  • Next message: araceli.alonso@iula.upf.es: "Corpora: Corpora Q: Text length differences in parallel text"

    As I mentioned in a related message last week, I'm in the process of
    creating a list of 1, 2, and 3-grams (maybe 4 and 5-grams too) in a 100
    million word corpus of Spanish.

    What I'm looking for is a program that will allow me to create these lists
    of n-grams more efficiently than what I have presently. I need a solution
    that has the following features:

    ** PC-based (DOS or Windows)
    ** Output in non-propriety ASCII format
    ** Can easily handle input files as large as 1,000,000 words (hopefully,
    much larger)
    ** Can be run in "batch file" mode, i.e. without human intervention,
    process a list of 40 different 1,000,000 word input files, and return 40
    output files with the lists of n-grams.

    I've been using WordSmith, which can be run in "batch file" mode, and which
    has been quite useful. The problem with WordSmith, however, is that it
    exports the list of n-grams in a proprietary format, which then have to
    manually be converted -- one by one -- to standard ASCII files. In
    addition, it doesn't much like input files much larger than about one
    million words.

    I already know that there are some very nice Unix/Linux-based solutions,
    but I'm really looking for something that is PC-based, since my students
    will also be using something like this in the near future, and all we have
    here are PC's :-(.

    In addition, I've seen reference to Perl scripts that can be run on a PC,
    such as the <bigram-generate.prl> script that comes with the Brill tagger,
    and which can be run with Windows ActivePerl. While I may very well end up
    using this or a similar Perl script, I'm also very interested in
    "stand-alone" solutions.

    Thanks in advance for your help. I'll post a summary if there is interest.

    Mark Davies

    ====================================================
    Mark Davies, Associate Professor, Spanish Linguistics
    4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
    309-438-7975 (voice) / 309-438-8083 (fax)
    http://mdavies.for.ilstu.edu/

    ** Corpus design and use / Web-database programming and optimization **
    ** Historical and dialectal Spanish and Portuguese syntax / Distance
    education **
    =====================================================



    This archive was generated by hypermail 2b29 : Mon Oct 15 2001 - 14:09:08 MET DST