Re: [Corpora-List] Query about nomenclature

From: Damon Allen Davison (allolex@gmail.com)
Date: Wed Mar 09 2005 - 22:46:30 MET

  • Next message: John F. Sowa: "Re: [Corpora-List] Query about nomenclature"

    Dear John,

    Here are some rather unscientific results. My corpus was a page of
    Google results limited to 100 for the search term "n gram". Doing both
    "ngram" and "n gram" was slightly problematic because their is a Perl
    CPAN module called Text::Ngram, so that weights the results for
    "ngram" quite a bit.

    n-gram : 128 times
    N-gram : 126 times
    ngram : 57 times
    N-Gram : 34 times
    Ngram : 10 times
    N-GRAM : 9 times
    NGRAM : 8 times
    n-Gram : 7 times
    NGram : 5 times

    I did this using this Perl script after doing "links --dump
    results.html > results.txt" to the results file I had saved.

    #!/usr/bin/perl
    # syntax: findword <filename>
    use warnings;
    use strict;
    my %total;
    my @matches;
    while ( <> ) {
            @matches = /(n-?gram)/i; # case-insensitive, case-preserving
    matching, dash optional
            $total{$_}++ foreach @matches;
    }
    print map { "$_ : $total{$_} times\n" } reverse sort { $total{$a} <=>
    $total{$b} } keys %total;

    Anyway, I hope that helps a little. You can use the same script to do
    searches on other files. :)

    I like to use "n-gram".

    Warm regards,

    Damon

    -- 
    

    Damon Allen Davison http://allolex.net



    This archive was generated by hypermail 2b29 : Wed Mar 09 2005 - 22:51:53 MET