Re: [Corpora-List] problems with Google counts

From: Matthew Hurst (mhurst@intelliseek.com)
Date: Mon Mar 14 2005 - 17:41:59 MET

  • Next message: Linda Bawcom: "[Corpora-List] Newspapers Texts-SUMMARY"

    You may want to look at g-metrics.com which provides counts over time for
    goolge searches. The counts are quite different from those on the goolge search
    page as it uses the google API (there is also some discussion there about why
    these counts are different).

    As for Lillian's original post, I notice that Google's language classifier,
    at least for Japanese, is not very good...

    Matt Hurst

    Adam Kilgarriff wrote:
    > Both problem and solution are both simple (intellectually, if not
    > technically):
    >
    > Problem:
    > Google's goals are keeping its customers happy, and we (NLP/web
    > research community) are not a significant proportion of its customers,
    > and we are the only people who care about the accuracy of counts.
    >
    > Solution:
    > don't use Google to get web counts: set up and use a search
    > engine with a scientific, not a commercial, mission instead.
    >
    > This is my current research agenda (see eg
    > http://www.lexmasterclass.com/people/Publications/2003-K-LSEsprolac.pdf
    > ) see also http://wacky.sslmit.unibo.it/
    >
    > Adam Kilgarriff
    >
    > -----Original Message-----
    > From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    > Behalf Of Lillian Lee
    > Sent: 14 March 2005 15:47
    > To: CORPORA@uib.no
    > Subject: [Corpora-List] problems with Google counts
    >
    >
    > Dear list members,
    >
    > You might be interested to know that until approximately March 8th,
    > Google counts appear to have been quite off (inflation rates of a
    > factor of 66%?), according to Jean Veronis.
    >
    > In a blog post of February 8th
    > (
    > http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.htm
    > l ),
    > Veronis summarized his earlier findings:
    >
    > # If you type Chirac OR Sarkozy, you get half the number results of
    > Chirac alone, which may have a political explanation... but is a
    > weird approach to boolean logic.
    >
    > # If you search the in the English pages, you get 1% of the number
    > you get for the all languages together. Does this mean that the is
    > 99 times more frequent in languages other than English? Of course
    > not.
    >
    > He gave a possible explanation and noted that "if you want to know the
    > real index count for any word, simply type it twice".
    >
    > On March 13th, he noted that the counts seem to have been adjusted,
    > that is "changed in a major way":
    > http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html
    >
    > Related posts indicate problems with MSN, the possibility that Yahoo
    > indexes more pages than Google, and more details on calculations.
    >
    > ________________________________________________________________
    > Lillian Lee, Assoc. Prof. tel: 607-255-8119
    > Dept of Computer Science fax: 607-255-4428
    > Cornell University llee@cs.cornell.edu
    > Ithaca, NY 14853-7501 USA www.cs.cornell.edu/home/llee
    > ________________________________________________________________
    >
    >
    >



    This archive was generated by hypermail 2b29 : Mon Mar 14 2005 - 17:35:03 MET