Re: [Corpora-List] producing n-gram lists in java

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Tue Oct 11 2005 - 12:34:48 MET DST

  • Next message: Timad Kahena: "[Corpora-List] comparing two IR systems using statistical tests?"

    > Is Java a requirement? There are some good utilities for this in Perl
    > such as:
    > http://search.cpan.org/~vlado/Text-Ngrams-1.7/Ngrams.pm
    > (shameless plug for one of my profs :P)
    > Seriously though, it is a good utility and if you are just doing text
    > processing it shouldn't really matter what language you are doing it in.
    >

    To put in another plug, if you aren't tied to Java and Windows, and if
    you're looking for a quick solution, you might try the IMS Corpus
    Workbench (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/).
    The cwb-scan-corpus program included in the Workbench can handle
    shorter n-grams from corpora of BNC dimensions, especially when the
    parts of speech of the component words are restricted (the Corpus
    Encoding Tutorial on the "Users' Corner" page gives some examples of
    how the program is used).

    If you need to handle very long n-grams (or very large corpora), you
    should go for suffix trees. You should be aware, though, that the
    sorting step in Yamamoto & Church's implementation is a very expensive
    operation and will take it's time. There are other implementations of
    suffix trees that build frequency lists in memory (you have to limit
    the maximal size of the n-grams, though), but I don't know how well
    they handle very large data sets.

    Hope this hilft,
    Stefan.



    This archive was generated by hypermail 2b29 : Tue Oct 11 2005 - 13:03:42 MET DST