Re: [Corpora-List] producing n-gram lists in java

From: Constantin Orasan (C.Orasan@wlv.ac.uk)
Date: Mon Oct 10 2005 - 20:18:30 MET DST

  • Next message: Eva Forsbom: "Re: [Corpora-List] Swedish world list"

    Hi,

    Do you have any particular reason why you what to implement this in
    java? If you work in Unix or you have cygwin installed, you can produce
    very efficiently lists of ngrams sorted by frequency by piping the
    output of a program which prints n words on each line to:
     | sort | uniq -c | sort -nr

    All you need to do is to produce a program which prints groups n words
    on every line. This can be easily achieve by moving a window of n words
    across the corpus.

    A perl program which produces these lines is the following (the program
    assumes that there is one word on each line):

    #!/usr/bin/perl

    $n = @ARGV[0]; # the length of ngrams
    @list = ();
    $i = 0;

    while(<STDIN>) {
        $line = $_;

        chop($line);

        # is it a punctuation mark?
        if(($line eq ".") ||
           ($line eq ",") ||
           ($line eq ";") ||
           ($line eq "_") ||
           ($line eq "\/") ||
           ($line eq "gt") ||
           ($line eq "!") ||
           ($line eq "?") ||
           ($line eq "\/\/") ||
           ($line eq "=") ||
           ($line eq "-") ||
           ($line eq "*") ||
           ($line eq "\$") ||
           ($line eq "\#") ||
           ($line eq ":") ||
           ($line eq "\"") ||
           ($line eq "\'")) {
            # do not include punctuation in ngrams
            $i = 0;
            next;
        }

        if($line =~ /^\s+$/) {
            next;
        }

        if($line =~ /^\s*$/) {
            next;
        }

        if($i == $n) {
            for($j = 0; $j < $n; $j++) {
                print "@list[$j] ";
            }
            print "\n";
            @list[$i] = $line;
            shift @list;
        } else {
            @list[$i] = $line;
            $i++;
        }
    }

    Regards,

    Constantin

    > Dear Corpora List,
    >
    > I am currently trying to develop a Java programme to produce a list of the
    > most frequently occurring ngrams.
    > The problem I have is that the amount of data that needs to be stored in
    > memory (currently stored in a hashMap) becomes unmanageably large for any
    > corpus greater than about 5 millions words.
    > I have attempted to overcome this problem by splitting the corpus into
    > batches of 1 million tokens and then collecting all of the smaller ngram
    > list files into the final list, but this process was far too slow and
    > would have taken many many hours (if not days) to complete.
    > I have also created an index of the corpus in the form of an MySql
    > database that stores token positions, but I'm unsure of how I could query
    > it to produce n-grams (since querying to list for each individual n-gram
    > will only lead to the same problems).
    > Does anyone know how I might go about creating the ngram-list java programme?
    > Thank you for your help,
    > Chris
    >
    > --------------------------------------------------------------------------
    > Christopher Martin
    > Computer Science student
    > Aston University, Birmingham, UK
    >
    >

    -- 
    Constantin Orasan
    Lecturer in Computational Linguistics
    University of Wolverhampton
    http://www.wlv.ac.uk/~in6093/
    



    This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 20:23:39 MET DST