RE: [Corpora-List] producing n-gram lists in java

From: peetm (peet.morris@comlab.ox.ac.uk)
Date: Mon Oct 10 2005 - 18:57:05 MET DST

  • Next message: Chris Callison-Burch: "Re: [Corpora-List] producing n-gram lists in java"

    Do you really know where the bottleneck[s] is, i.e., have you profiled the
    code?

    Could simply, say, doubling the memory in the machine help (it's pretty
    cheap these days) - or is it perhaps the hashing algorithm -or-
    implementation of the same?

    I'd be very interested in the results - as I'm planning on building
    something similar soon!

    peetm

    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of martincd@aston.ac.uk
    Sent: 10 October 2005 17:15
    To: CORPORA@uib.no
    Subject: [Corpora-List] producing n-gram lists in java

    Dear Corpora List,

    I am currently trying to develop a Java programme to produce a list of the
    most frequently occurring ngrams.
    The problem I have is that the amount of data that needs to be stored in
    memory (currently stored in a hashMap) becomes unmanageably large for any
    corpus greater than about 5 millions words.
    I have attempted to overcome this problem by splitting the corpus into
    batches of 1 million tokens and then collecting all of the smaller ngram
    list files into the final list, but this process was far too slow and
    would have taken many many hours (if not days) to complete.
    I have also created an index of the corpus in the form of an MySql
    database that stores token positions, but I'm unsure of how I could query
    it to produce n-grams (since querying to list for each individual n-gram
    will only lead to the same problems).
    Does anyone know how I might go about creating the ngram-list java
    programme?
    Thank you for your help,
    Chris

    --------------------------------------------------------------------------
    Christopher Martin
    Computer Science student
    Aston University, Birmingham, UK



    This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 19:06:06 MET DST