Re: [Corpora-List] problems with Google counts

From: Ring Low (mlow@acsu.buffalo.edu)
Date: Wed Mar 16 2005 - 19:26:48 MET

  • Next message: Ergin ALTINTAS: "[Corpora-List] Specificty and Similarity of Words"

    A few years ago I did a study of the uses of the definite article THE in
    English using Google search (the data was collected in 2003). I used
    Internet search engine to conduct the study partially because I wanted
    to get the page-counts, which would exclude repeat instances in the same
    text (i.e., rather than the absolute frequencies).
    I gathered about 1500 nouns and put it into the search engine using two
    strings "the * N" and "the N". I also did the same for other
    pre-nominal elements such as "a", "this", "that", "my", "his", "her".
    Other criteria I used at that time were "in text only" and "English only".

    The inconsistency I found, at that time, was that the sum of the
    frequencies I obtained for all the nouns with one element is always much
    more than the frequency reported in a single search for that element,
    i.e., the sum of all "the N" was much larger than the search of the word
    "the" alone in the Google database, which did puzzle me.

    On the other hand, I did find some consistencies on the data. First,
    the ratio of the frequencies among each search are always about the
    same, even I did all the search a couple times among several months. In
    addition, the relative frequencies among the nouns at that time, as far
    as the ones that I could check, was consistent with the data I found in
    some other corppora I found (e.g., if one find that a word is of a
    relatively high frequency in Google, one would also find that word
    having a relative high frequency in other texts).

    I agree that using Google to conduct linguistic studies has gotten more
    and more difficult since then, as the design of the search engine has
    been changing due to commercial reasons. We do need a search engine
    design specically for linguistic studies. On the other hand, before
    such a search engine is available, some other ways to avoid problmetic
    results might be to adjust the design of the study according to some
    known weaknesses of the engine and to cross-check the results manually
    with tranditional corpora and other search engines.

    -- 
    ==============================
    Ring Low
    mlow@acsu.buffalo.edu
    http://www.acsu.buffalo.edu/~mlow/
    ==============================
    

    Lillian Lee wrote:

    >Dear list members, > >You might be interested to know that until approximately March 8th, >Google counts appear to have been quite off (inflation rates of a >factor of 66%?), according to Jean Veronis. > >In a blog post of February 8th >( http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html ), >Veronis summarized his earlier findings: > > # If you type Chirac OR Sarkozy, you get half the number results of > Chirac alone, which may have a political explanation... but is a > weird approach to boolean logic. > > # If you search the in the English pages, you get 1% of the number > you get for the all languages together. Does this mean that the is > 99 times more frequent in languages other than English? Of course > not. > >He gave a possible explanation and noted that "if you want to know the >real index count for any word, simply type it twice". > >On March 13th, he noted that the counts seem to have been adjusted, >that is "changed in a major way": >http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html > >Related posts indicate problems with MSN, the possibility that Yahoo >indexes more pages than Google, and more details on calculations. > >________________________________________________________________ >Lillian Lee, Assoc. Prof. tel: 607-255-8119 >Dept of Computer Science fax: 607-255-4428 >Cornell University llee@cs.cornell.edu >Ithaca, NY 14853-7501 USA www.cs.cornell.edu/home/llee >________________________________________________________________ > > > > > > >



    This archive was generated by hypermail 2b29 : Wed Mar 16 2005 - 19:28:26 MET