Re: Corpora: PC-based programs to create lists of n-grams

From: Dragomir Radev (radev@si.umich.edu)
Date: Mon Oct 15 2001 - 22:08:06 MET DST

  • Next message: Maria Gavrilidou: "Re: Corpora: the At sign"

    Check the CMU-Cambridge Language Modeling toolkit:

    http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

    Drago

    Mark Davies wrote:
    >
    > As I mentioned in a related message last week, I'm in the process of
    > creating a list of 1, 2, and 3-grams (maybe 4 and 5-grams too) in a 100
    > million word corpus of Spanish.
    >
    > What I'm looking for is a program that will allow me to create these lists
    > of n-grams more efficiently than what I have presently. I need a solution
    > that has the following features:
    >
    > ** PC-based (DOS or Windows)
    > ** Output in non-propriety ASCII format
    > ** Can easily handle input files as large as 1,000,000 words (hopefully,
    > much larger)
    > ** Can be run in "batch file" mode, i.e. without human intervention,
    > process a list of 40 different 1,000,000 word input files, and return 40
    > output files with the lists of n-grams.
    >
    > I've been using WordSmith, which can be run in "batch file" mode, and which
    > has been quite useful. The problem with WordSmith, however, is that it
    > exports the list of n-grams in a proprietary format, which then have to
    > manually be converted -- one by one -- to standard ASCII files. In
    > addition, it doesn't much like input files much larger than about one
    > million words.
    >
    > I already know that there are some very nice Unix/Linux-based solutions,
    > but I'm really looking for something that is PC-based, since my students
    > will also be using something like this in the near future, and all we have
    > here are PC's :-(.
    >
    > In addition, I've seen reference to Perl scripts that can be run on a PC,
    > such as the <bigram-generate.prl> script that comes with the Brill tagger,
    > and which can be run with Windows ActivePerl. While I may very well end up
    > using this or a similar Perl script, I'm also very interested in
    > "stand-alone" solutions.
    >
    > Thanks in advance for your help. I'll post a summary if there is interest.
    >
    > Mark Davies
    >
    >
    > ====================================================
    > Mark Davies, Associate Professor, Spanish Linguistics
    > 4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
    > 309-438-7975 (voice) / 309-438-8083 (fax)
    > http://mdavies.for.ilstu.edu/
    >
    > ** Corpus design and use / Web-database programming and optimization **
    > ** Historical and dialectal Spanish and Portuguese syntax / Distance
    > education **
    > =====================================================
    >
    >
    >

    -- 
    Dragomir R. Radev                                         radev@umich.edu
    Assistant Professor of Information, Electrical Engineering and
    Computer Science, and Linguistics, the University of Michigan, Ann Arbor
    Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev
    



    This archive was generated by hypermail 2b29 : Tue Oct 16 2001 - 09:24:46 MET DST