Re: [Corpora-List] problems with Google counts

From: Jean Veronis (Jean.Veronis@up.univ-mrs.fr)
Date: Mon Mar 14 2005 - 20:24:23 MET

  • Next message: Linda Bawcom: "[Corpora-List] Newspaper Summary-errors"

    Thanks, Lillian, for citing this study (a series of studies, indeed,
    since the saga continues).

    I think that it is very important that we, linguists, analyse very
    closely what engines offer to us, if we are (as more and more of us are
    tempted) going to do "Google linguistics". My conclusion, unfortunately,
    is that counts are totally unreliable with Google. When I say
    unreliable, is not just a few percent uncertainty, as you can see in my
    posts. MSN seems to cheat us as well:

    http://aixtal.blogspot.com/2005/02/web-msn-cheating-too.html

    Yahoo delivers more credible results, and so far, I have been able to
    use it satisfactorily. Unfortunately, last week, I found that, all of a
    sudden, they have exactly doubled their index size (without announcing
    it officially). So far, so good, but if you look at the figures, you'll
    see that the correlation between the previous ones and the new is so
    high (R2 > 0.99) that it is very difficult to accept that the doubling
    is due to natural growth:

    http://aixtal.blogspot.com/2005/03/web-yahoo-double-ses-comptes.html

    The solution is, as Adam says, to build our own open engine, and I am
    deeply convinced that such a project is one of the highest priorities
    for our community.

    --j
    http://aixtal.blogspot.com

    ps: It's probably off-topic on this list, but I find it extremely scary
    that our access to the world information goes through the bottleneck of
    not even a handful of extremely opaque search engines. Beyond counts, they
    can just decide what we see, or don't. Big Brother feeling.



    This archive was generated by hypermail 2b29 : Mon Mar 14 2005 - 20:19:56 MET