[Corpora-List] problems with Google counts

From: Lillian Lee (llee@cs.cornell.edu)
Date: Mon Mar 14 2005 - 16:46:57 MET

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] problems with Google counts"

    Dear list members,

    You might be interested to know that until approximately March 8th,
    Google counts appear to have been quite off (inflation rates of a
    factor of 66%?), according to Jean Veronis.

    In a blog post of February 8th
    ( http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html ),
    Veronis summarized his earlier findings:

      # If you type Chirac OR Sarkozy, you get half the number results of
        Chirac alone, which may have a political explanation... but is a
        weird approach to boolean logic.

      # If you search the in the English pages, you get 1% of the number
        you get for the all languages together. Does this mean that the is
        99 times more frequent in languages other than English? Of course
        not.

    He gave a possible explanation and noted that "if you want to know the
    real index count for any word, simply type it twice".

    On March 13th, he noted that the counts seem to have been adjusted,
    that is "changed in a major way":
    http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html

    Related posts indicate problems with MSN, the possibility that Yahoo
    indexes more pages than Google, and more details on calculations.

    ________________________________________________________________
    Lillian Lee, Assoc. Prof. tel: 607-255-8119
    Dept of Computer Science fax: 607-255-4428
    Cornell University llee@cs.cornell.edu
    Ithaca, NY 14853-7501 USA www.cs.cornell.edu/home/llee
    ________________________________________________________________



    This archive was generated by hypermail 2b29 : Mon Mar 14 2005 - 17:05:13 MET