Re: [Corpora-List] producing n-gram lists in java

From: Constantin Orasan (C.Orasan@wlv.ac.uk)
Date: Mon Oct 10 2005 - 20:18:30 MET DST

Next message: Eva Forsbom: "Re: [Corpora-List] Swedish world list"

Previous message: Chris Jordan: "Re: [Corpora-List] producing n-gram lists in java"
In reply to: martincd@aston.ac.uk: "[Corpora-List] producing n-gram lists in java"
Next in thread: Allauzen Alexandre: "Re: [Corpora-List] producing n-gram lists in java"
Reply: Allauzen Alexandre: "Re: [Corpora-List] producing n-gram lists in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Do you have any particular reason why you what to implement this in
java? If you work in Unix or you have cygwin installed, you can produce
very efficiently lists of ngrams sorted by frequency by piping the
output of a program which prints n words on each line to:
| sort | uniq -c | sort -nr

All you need to do is to produce a program which prints groups n words
on every line. This can be easily achieve by moving a window of n words
across the corpus.

A perl program which produces these lines is the following (the program
assumes that there is one word on each line):

#!/usr/bin/perl

$n = @ARGV[0]; # the length of ngrams
@list = ();
$i = 0;

while(<STDIN>) {
$line = $_;

chop($line);

    # is it a punctuation mark?
    if(($line eq ".") ||
       ($line eq ",") ||
       ($line eq ";") ||
       ($line eq "_") ||
       ($line eq "\/") ||
       ($line eq "gt") ||
       ($line eq "!") ||
       ($line eq "?") ||
       ($line eq "\/\/") ||
       ($line eq "=") ||
       ($line eq "-") ||
       ($line eq "*") ||
       ($line eq "\$") ||
       ($line eq "\#") ||
       ($line eq ":") ||
       ($line eq "\"") ||
       ($line eq "\'")) {
        # do not include punctuation in ngrams
        $i = 0;
        next;
    }

    if($line =~ /^\s+$/) {
        next;
    }

    if($line =~ /^\s*$/) {
        next;
    }

    if($i == $n) {
        for($j = 0; $j < $n; $j++) {
            print "@list[$j] ";
        }
        print "\n";
        @list[$i] = $line;
        shift @list;
    } else {
        @list[$i] = $line;
        $i++;
    }
}

Regards,

Constantin

> Dear Corpora List,
>
> I am currently trying to develop a Java programme to produce a list of the
> most frequently occurring ngrams.
> The problem I have is that the amount of data that needs to be stored in
> memory (currently stored in a hashMap) becomes unmanageably large for any
> corpus greater than about 5 millions words.
> I have attempted to overcome this problem by splitting the corpus into
> batches of 1 million tokens and then collecting all of the smaller ngram
> list files into the final list, but this process was far too slow and
> would have taken many many hours (if not days) to complete.
> I have also created an index of the corpus in the form of an MySql
> database that stores token positions, but I'm unsure of how I could query
> it to produce n-grams (since querying to list for each individual n-gram
> will only lead to the same problems).
> Does anyone know how I might go about creating the ngram-list java programme?
> Thank you for your help,
> Chris
>
> --------------------------------------------------------------------------
> Christopher Martin
> Computer Science student
> Aston University, Birmingham, UK
>
>

-- 
Constantin Orasan
Lecturer in Computational Linguistics
University of Wolverhampton
http://www.wlv.ac.uk/~in6093/

Next message: Eva Forsbom: "Re: [Corpora-List] Swedish world list"
Previous message: Chris Jordan: "Re: [Corpora-List] producing n-gram lists in java"
In reply to: martincd@aston.ac.uk: "[Corpora-List] producing n-gram lists in java"
Next in thread: Allauzen Alexandre: "Re: [Corpora-List] producing n-gram lists in java"
Reply: Allauzen Alexandre: "Re: [Corpora-List] producing n-gram lists in java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 20:23:39 MET DST