Re: [Corpora-List] producing n-gram lists in java

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Tue Oct 11 2005 - 12:34:48 MET DST

Next message: Timad Kahena: "[Corpora-List] comparing two IR systems using statistical tests?"

Previous message: Pincemin: "[Corpora-List] XML/TEI Human Rights Corpus"
In reply to: Chris Jordan: "Re: [Corpora-List] producing n-gram lists in java"
Next in thread: Constantin Orasan: "Re: [Corpora-List] producing n-gram lists in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Is Java a requirement? There are some good utilities for this in Perl
> such as:
> http://search.cpan.org/~vlado/Text-Ngrams-1.7/Ngrams.pm
> (shameless plug for one of my profs :P)
> Seriously though, it is a good utility and if you are just doing text
> processing it shouldn't really matter what language you are doing it in.
>

To put in another plug, if you aren't tied to Java and Windows, and if
you're looking for a quick solution, you might try the IMS Corpus
Workbench (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/).
The cwb-scan-corpus program included in the Workbench can handle
shorter n-grams from corpora of BNC dimensions, especially when the
parts of speech of the component words are restricted (the Corpus
Encoding Tutorial on the "Users' Corner" page gives some examples of
how the program is used).

If you need to handle very long n-grams (or very large corpora), you
should go for suffix trees. You should be aware, though, that the
sorting step in Yamamoto & Church's implementation is a very expensive
operation and will take it's time. There are other implementations of
suffix trees that build frequency lists in memory (you have to limit
the maximal size of the n-grams, though), but I don't know how well
they handle very large data sets.

Hope this hilft,
Stefan.

Next message: Timad Kahena: "[Corpora-List] comparing two IR systems using statistical tests?"
Previous message: Pincemin: "[Corpora-List] XML/TEI Human Rights Corpus"
In reply to: Chris Jordan: "Re: [Corpora-List] producing n-gram lists in java"
Next in thread: Constantin Orasan: "Re: [Corpora-List] producing n-gram lists in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Oct 11 2005 - 13:03:42 MET DST