Re: [Corpora-List] problems with Google counts

From: Nancy Ide (ide@cs.vassar.edu)
Date: Wed Mar 16 2005 - 19:48:38 MET

  • Next message: Mark Sanderson: "Re: [Corpora-List] Specificty and Similarity of Words"

    Are people aware of the Linguist's Search Engine developed at
    University of Maryland, for doing linguistic searches on internet data?
    URL is http://lse.umiacs.umd.edu

    On Mar 16, 2005, at 1:26 PM, Ring Low wrote:

    > A few years ago I did a study of the uses of the definite article THE
    > in English using Google search (the data was collected in 2003). I
    > used Internet search engine to conduct the study partially because I
    > wanted to get the page-counts, which would exclude repeat instances in
    > the same text (i.e., rather than the absolute frequencies).
    > I gathered about 1500 nouns and put it into the search engine using
    > two strings "the * N" and "the N". I also did the same for other
    > pre-nominal elements such as "a", "this", "that", "my", "his", "her".
    > Other criteria I used at that time were "in text only" and "English
    > only".
    >
    > The inconsistency I found, at that time, was that the sum of the
    > frequencies I obtained for all the nouns with one element is always
    > much more than the frequency reported in a single search for that
    > element, i.e., the sum of all "the N" was much larger than the search
    > of the word "the" alone in the Google database, which did puzzle me.
    >
    > On the other hand, I did find some consistencies on the data. First,
    > the ratio of the frequencies among each search are always about the
    > same, even I did all the search a couple times among several months.
    > In addition, the relative frequencies among the nouns at that time, as
    > far as the ones that I could check, was consistent with the data I
    > found in some other corppora I found (e.g., if one find that a word is
    > of a relatively high frequency in Google, one would also find that
    > word having a relative high frequency in other texts).
    > I agree that using Google to conduct linguistic studies has gotten
    > more and more difficult since then, as the design of the search engine
    > has been changing due to commercial reasons. We do need a search
    > engine design specically for linguistic studies. On the other hand,
    > before such a search engine is available, some other ways to avoid
    > problmetic results might be to adjust the design of the study
    > according to some known weaknesses of the engine and to cross-check
    > the results manually with tranditional corpora and other search
    > engines.
    >
    >
    >
    > --
    > ==============================
    > Ring Low
    > mlow@acsu.buffalo.edu
    > http://www.acsu.buffalo.edu/~mlow/
    > ==============================
    >
    >
    >
    > Lillian Lee wrote:
    >
    >> Dear list members,
    >>
    >> You might be interested to know that until approximately March 8th,
    >> Google counts appear to have been quite off (inflation rates of a
    >> factor of 66%?), according to Jean Veronis.
    >>
    >> In a blog post of February 8th
    >> (
    >> http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-
    >> mystery.html ),
    >> Veronis summarized his earlier findings:
    >>
    >> # If you type Chirac OR Sarkozy, you get half the number results of
    >> Chirac alone, which may have a political explanation... but is a
    >> weird approach to boolean logic.
    >>
    >> # If you search the in the English pages, you get 1% of the number
    >> you get for the all languages together. Does this mean that the is
    >> 99 times more frequent in languages other than English? Of course
    >> not.
    >>
    >> He gave a possible explanation and noted that "if you want to know the
    >> real index count for any word, simply type it twice".
    >>
    >> On March 13th, he noted that the counts seem to have been adjusted,
    >> that is "changed in a major way":
    >> http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html
    >>
    >> Related posts indicate problems with MSN, the possibility that Yahoo
    >> indexes more pages than Google, and more details on calculations.
    >> ________________________________________________________________
    >> Lillian Lee, Assoc. Prof. tel: 607-255-8119
    >> Dept of Computer Science fax: 607-255-4428 Cornell University
    >> llee@cs.cornell.edu Ithaca, NY 14853-7501 USA
    >> www.cs.cornell.edu/home/llee
    >> ________________________________________________________________
    >>
    >>
    >>
    >>
    >>
    >>
    >
    >
    >
    >
    =======================================================

    Nancy Ide

    Professor of Computer Science
    Vassar College
    Poughkeepsie, NY 12604-0520 USA
    Tel: +1 845 437-5988 Fax: +1 845 437-7498
    ide@cs.vassar.edu

    Chercheur Associe
    Equipe Langue et Dialogue, LORIA/CNRS
    Campus Scientifique - BP 239
    54506 Vandoeuvre-les-Nancy FRANCE
    Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
    ide@loria.fr

    =======================================================



    This archive was generated by hypermail 2b29 : Wed Mar 16 2005 - 19:40:38 MET