[Corpora-List] producing n-gram lists in java

From: martincd@aston.ac.uk
Date: Mon Oct 10 2005 - 18:15:08 MET DST

Next message: peetm: "RE: [Corpora-List] producing n-gram lists in java"

Previous message: Dominic Widdows: "[Corpora-List] Research Engineer position at Pittsburgh-based MAYA Design"
Next in thread: peetm: "RE: [Corpora-List] producing n-gram lists in java"
Reply: peetm: "RE: [Corpora-List] producing n-gram lists in java"
Reply: Chris Callison-Burch: "Re: [Corpora-List] producing n-gram lists in java"
Reply: Chris Jordan: "Re: [Corpora-List] producing n-gram lists in java"
Reply: Constantin Orasan: "Re: [Corpora-List] producing n-gram lists in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Corpora List,

I am currently trying to develop a Java programme to produce a list of the
most frequently occurring ngrams.
The problem I have is that the amount of data that needs to be stored in
memory (currently stored in a hashMap) becomes unmanageably large for any
corpus greater than about 5 millions words.
I have attempted to overcome this problem by splitting the corpus into
batches of 1 million tokens and then collecting all of the smaller ngram
list files into the final list, but this process was far too slow and
would have taken many many hours (if not days) to complete.
I have also created an index of the corpus in the form of an MySql
database that stores token positions, but I'm unsure of how I could query
it to produce n-grams (since querying to list for each individual n-gram
will only lead to the same problems).
Does anyone know how I might go about creating the ngram-list java programme?
Thank you for your help,
Chris

--------------------------------------------------------------------------
Christopher Martin
Computer Science student
Aston University, Birmingham, UK

Next message: peetm: "RE: [Corpora-List] producing n-gram lists in java"
Previous message: Dominic Widdows: "[Corpora-List] Research Engineer position at Pittsburgh-based MAYA Design"
Next in thread: peetm: "RE: [Corpora-List] producing n-gram lists in java"
Reply: peetm: "RE: [Corpora-List] producing n-gram lists in java"
Reply: Chris Callison-Burch: "Re: [Corpora-List] producing n-gram lists in java"
Reply: Chris Jordan: "Re: [Corpora-List] producing n-gram lists in java"
Reply: Constantin Orasan: "Re: [Corpora-List] producing n-gram lists in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 18:48:31 MET DST