Re: [Corpora-List] Re: problems with Google counts

From: Jean Veronis (Jean.Veronis@up.univ-mrs.fr)
Date: Thu Mar 17 2005 - 08:31:43 MET

  • Next message: Stefan Evert: "Re: [Corpora-List] Re: problems with Google counts"

    FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE a écrit :

    > Hi, Corpora Guys,
    > Sorry I don't remember who wrote suggesting simply repeating the word
    > in Google to get a supposedly more realistic count of pages with the
    > word in it

    Me ;-)

    http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

    > (I had deleted all those messages after reading them). I tried this
    > yesterday on a couple of Spanish words (eficaz, eficiente). (By the
    > way, the results were apparently consonant with a student's search of
    > the 100,000,000 word corpusdelespañol site.) Anyway, what repeating
    > the word apparently does is limit the results to those sites which
    > have the word at least two times, in this case cutting down on the
    > numbers by roughly 10%.

    Actually that's not the case. When you repeat the word, Google ranks
    first pages that contain the multiword expression you type. For example,
    if you type A B C, you'll see first pages that contain "A B C" exactly,
    if any. In the case of A A, you will see pages that contain exactly "A
    A" first, but pages where A appear only once appera later on.

    > If that is what is happening, this implies serious problems for
    > relatively rare words, which may not occur twice in very many pages at
    > all. At any rate, the decrease in pages encountered seemed to be
    > about the same proportionately in both cases. (We're talking here
    > about roughly 1.5M original hits.) If I'm missing the point of the
    > suggestion, please straighten me out.
    >
    I think you'll find the whole logic explained in my post cite above.
    Google counts were inflated artifically by 66%. Therefore, proportions
    stay identical.

    However, if you test Google again these days, you will see MAJOR changes
    in the counts. My post did a lot of noise (it was written in early
    February). It has been relayed on many forums, etc. and I know that the
    Googlers have read it with great care (and other search engine makers as
    well ;-). In February that have started making major changes in the
    counts in order to reduce the inconsistencies I have spotted -- and
    close the backdoors they had left open inadvertendly.

    Just to give an example. when you typed "the" previously, you used to get

    * 8 billions for "all the web"
    * 80 millions for "the" restricted to English pages

    i.e; 1% which doesn't make sense.

    This morning, I tried again, and I get 3.6 billions in both cases, which
    does make sense. (this can change again if you try: for a week or so,
    Google is totally unstable, due to the major update process).

    I explained these recent changes last week at:

    http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html

    Since then, more changes have occured. Google tries to get close to
    credible figures. I am afraid that it's not the index that's fixed, but
    jus the extrapolation formulas. In any case, we will never know, and
    that's the problem. you can't do science with instruments you don't
    understand and can't trust.

    By the way, Yahoo gives very reliable and consistent results (including
    for booleans), which I have cross-checked with English and French
    corpora. The only problem was it lack of the wildcard operator, but
    Google dropped it.

    I personnaly use it quite satifactorily -- so far:

    http://aixtal.blogspot.com/2005/02/lexique-yahoo-et-les-yahoourts.html
    http://aixtal.blogspot.com/2005/03/lexique-glissance-et-pntrance.html
    (in French and on French, sorry)

    And they released a very nice API which enables getting 25 times more
    results that Google (5000 queries a day x 50 results a page, instead of
    1000 x 10 for Google). However, I hope Yahoo won't start playing weird
    marketing games too:
    http://aixtal.blogspot.com/2005/03/web-yahoo-doubles-its-counts.html

    --j
      http://aixtal.blogspot.com

     



    This archive was generated by hypermail 2b29 : Thu Mar 17 2005 - 08:38:43 MET