Re: [Corpora-List] Re: problems with Google counts

From: Stefan Evert (evert@IMS.Uni-Stuttgart.DE)
Date: Thu Mar 17 2005 - 10:01:02 MET

  • Next message: Fco. Mario Barcala Rodríguez: "[Corpora-List] New version of CORGA (Reference Corpus of Present-day Galician Language)"

    > > (I had deleted all those messages after reading them). I tried this
    > > yesterday on a couple of Spanish words (eficaz, eficiente). (By the
    > > way, the results were apparently consonant with a student's search of
    > > the 100,000,000 word corpusdelespañol site.) Anyway, what repeating
    > > the word apparently does is limit the results to those sites which
    > > have the word at least two times, in this case cutting down on the
    > > numbers by roughly 10%.
    >
    > Actually that's not the case. When you repeat the word, Google ranks
    > first pages that contain the multiword expression you type. For example,
    > if you type A B C, you'll see first pages that contain "A B C" exactly,
    > if any. In the case of A A, you will see pages that contain exactly "A
    > A" first, but pages where A appear only once appera later on.

    Well, that can't quite be the case either, at least not today. Things
    get really funny (in its "weird" sense, I'm afraid) when you start
    looking for more than two repetitions. These are the numbers I just
    got from Google 5 minutes ago:

    3,560,000,000 the
    3,600,000,000 the the
    2,800,000,000 the the the
    2,830,000,000 the the the the
    2,820,000,000 the the the the the
    etc.

    When you look for non-stop-words, Google seems to make a distinction
    between one occurrence and two or more occurrences:

    3,110,000 fink
    1,970,000 fink fink
    1,970,000 fink fink fink
    etc.

    It would seem that in response to Jean's post, Google has changed
    something to enforce consistent results (unless this is just a
    side-effect of a new search engine that doesn't support wildcards).

    If you go to the German Google site (www.google.de), for instance, you
    will still find the old search engine in place (funny that google.de
    seems to find more English pages than google.com ...):

    8,000,000,000 the
       88,100,000 the the
       87,500,000 the the the
       86,700,000 the the the the
    etc.

    At least we still have the wildcard "*" for an arbitrary word. For
    non-stop-words, the results are consistently inconsistent:

    3,460,000 fink
    1,900,000 fink fink
    1,920,000 fink fink fink
    1,870,000 fink fink fink fink
    1,910,000 fink fink fink fink fink
     
    I am quite convinced that there is no sensible interpretation of these
    queries for which the Google numbers are even remotely plausible.

    Stefan.
    http://wacky.sslmit.unibo.it/

    -- 
    I'm not a nerd. I'm a specialist.
                                       -- from Full Metal Panic, Episode 8
    ______________________________________________________________________
    Stefan Evert                                     purl.org/stefan.evert
    http://www.collocations.de/                        stefan.evert@uos.de
    



    This archive was generated by hypermail 2b29 : Thu Mar 17 2005 - 10:05:52 MET