[Corpora-List] The story so far on Ngram

From: John Mckenny (john.mckenny@unn.ac.uk)
Date: Sun Mar 13 2005 - 14:28:54 MET

  • Next message: Jörg Schuster: "Re: [Corpora-List] fast string replacement"

    Dear Corpusians

    I summarize because some opinions may not have been posted to the list.
    Apologies for not having time to make this shorter and to Pascal for
    stealing such a good line.
    A combination of empirical and rationalist approaches points to
    N-gram/n-gram (with a hyphen) being more acceptable and more used than the
    hyphenless ngram/Ngram (personally ngram looks neater to me but I won't even
    think about it in the future).
    John Sowa writes
    <I treat a variable such as N or a number such as 4 in the same way I would
    treat a word. Therefore, I would apply the same rules for inserting a hyphen
    between two words. If the variable is N, I would write N-gram. But if the
    variable were x, y, or n, I would write x-gram, y-gram, or n-gram. And by
    the same rule, I would write 4-gram.>

    < Harold Somers writes :As editor of a journal which often has articles that
    mention n-grams, my house style is to have n-gram with a hyphen, and the n
    in italics. Although I feel it is not quite right, I guess I would
    capitalize the n if it starts a sentence. As for nomenclature, it seems to
    me that we hear about unigrams, bigrams and trigrams, but after that use
    numbers: 4-grams, 5-grams etc., with a numeral and a hyphen. That's my
    preference, based on what I have seen or heard>.

    Noah Smith writes: <Not sure on hyphenation, but in my view the "N" or "n"
    is an algebraic variable and should be in italics/math typeface. There's a
    paper by Kneser and Ney in which they actually call them "m-grams"! "N" or
    "n" is arguably just the conventional choice of the variable's name, like
    lambda and mu for Lagrangean multipliers, alpha for interpolation
    coefficients, etc. As for higher-order -grams, some tend to avoid the
    vocabulary question by referring to (for example) 4-gram models as
    third-order Markov models (generally a p-gram model is a (p-1)th-order
    Markov model). If you get an empirical result that supports a consensus,
    maybe we won't have to resort to this workaround!
    Chris Brew writes:
    <The sequence could have been monogram, digram, trigram, tetragram,
    pentagram, hexagram, ...with fairly uniform (Greek) etymology, but someone
    chose unigram,bigram,trigram,...these look like Latin numerical prefixes, so
    my guess is that the intended extrapolation is
    quadrigram,quintagram,....which replicates the mixed Latin/Greek etymology
    of bigram through the series. Pretty yukky...
    Geoffrey Sampson writes
    <Well if it's pentagrams and hexagrams it surely should be tetragrams rather
    than "quadrigrams", in order to avoid mixing Latin and Greek.
    But then if you want to avoid pentagram because of Satanism, you might
    equally want to avoid tetragram because it might be taken to refer to the
    unspeakable Hebrew four-letter name of God. You can't win!
    I think most people would write 4-gram, 5-gram etc after "trigram", and
    whether you capitalize the N of N-gram must surely be a matter of taste
    only. (Though missing out the hyphen would be confusing, I'd have
    thought.)>

    On the more empirical side, Damon Allen Davison writes <My corpus was a page
    of Google results limited to 100 for the search term "n gram". Doing both
    "ngram" and "n gram" was slightly problematic because their is a Perl CPAN
    module called Text::Ngram, so that weights the results for "ngram" quite a
    bit.

    n-gram : 128 times
    N-gram : 126 times
    ngram : 57 times
    N-Gram : 34 times
    Ngram : 10 times
    N-GRAM : 9 times
    NGRAM : 8 times
    n-Gram : 7 times
    NGram : 5 times
    I did this using this Perl script after doing "links --dump
    results.html > results.txt" to the results file I had saved.
    #!/usr/bin/perl
    # syntax: findword <filename>
    use warnings;
    use strict;
    my %total;
    my _AT_matches;
    while ( <> ) {
    _AT_matches = /(n-?gram)/i; # case-insensitive, case-preserving
    matching, dash optional
    $total{$_}++ foreach _AT_matches;
    }
    print map { "$_ : $total{$_} times\n" } reverse sort { $total{$a} <=>
    $total{$b} } keys %total;
    Anyway, I hope that helps a little. You can use the same script to do
    searches on other files. :)
    I like to use "n-gram">.
    John F. Sowa replies:
    <Damon Davison's use of Google inspired me to try
    a variation. I just typed three queries and
    got the following number of hits:

    Search string Hits
    ------------- ------
    ngram 21,100
    ngram not perl 540
    n-gram 85,700

    This seems to provide overwhelming evidence for
    a hyphen between "n" and "gram". Since Google
    doesn't distinguish capitals, that leaves the
    capitalization question unresolved.

    But Stefan Evert then admonishes caution <you do not realise that "ngram not
    perl" found approx. 540 pages that
    contain all three words ("ngram", "not" and "perl"), don't you?
    You can see this quite clearly when you look at the result page where
    the matching keywords are highlighted.>

    Yannick Versley elaborated: <Asking google for n-gram may not do what you
    intended, since your query will match all of ngram, n-gram and n gram. Even
    then, looking for "n gram" (which will match n-gram and n gram) returns
    68.900 hits, so n-gram is probably still the right one.
    What I got from google:
    search str. hits
    -------------- ---------
    ngram 20 400
    ngram -perl 16 100
    "n gram" 68 500
    "n gram" -perl 63 100

    Andrew Kehoe advises:
    <You need to use the search term "ngram -perl" rather than "ngram not perl"
    because, as Stefan Evert pointed out, "ngram not perl" just returns pages
    containing all 3 of those words.

    Another problem with your method is that Google ignores hyphens in search
    terms. One of the pages returned for the term "n-gram" is
    http://cpan.dei.uc.pt/authors/id/J/JH/JHI/ngram.pl-1.48&e=8092
    <http://cpan.dei.uc.pt/authors/id/J/JH/JHI/ngram.pl-1.48&e=8092> but this
    page does not contain the word "n-gram" at all, only "ngram" without the
    hyphen.>

    It looks like the searchers will come up with, or tell us how to come up
    with, reliable frequency counts. If so, and always bearing in mind the GIGO
    principle, I wonder is Noah Smith right when he surmises above: <If you get
    an empirical result that supports a consensus, maybe we won't have to resort
    to this workaround>. The issues of capitalization and italicization might
    be measurable. Nonetheless I suspect that editors and writers in the cluster
    of discourse communities on CORPORA (see Harold Somers above) will continue
    with their current usage unless shown overwhelming counterevidence.
    Best wishes
    John McKenny
    Ps FINAL APPEAL: could you please send me your MWU/formulaic sequence/
    chunking answer by 17 March St. Patrick's Day. Thanks

    ====
    This e-mail is intended solely for the addressee. It may contain private and
    confidential information. If you are not the intended addressee, please take
    no action based on it nor show a copy to anyone. Please reply to this e-mail
    to highlight the error. You should also be aware that all electronic mail
    from, to, or within Northumbria University may be the subject of a request
    under the Freedom of Information Act 2000 and related legislation, and
    therefore may be required to be disclosed to third parties.
    This e-mail and attachments have been scanned for viruses prior to leaving
    Northumbria University. Northumbria University will not be liable for any
    losses as a result of any viruses being passed on.



    This archive was generated by hypermail 2b29 : Mon Mar 14 2005 - 10:16:28 MET