Re: [Corpora-List] WebCorp counts

From: Jean Veronis (Jean.Veronis@up.univ-mrs.fr)
Date: Wed Apr 27 2005 - 14:04:43 MET DST

  • Next message: rodolfo delmonte: "[Corpora-List] VIT 1.0 - First Release"

    Antoinette Renouf a écrit :

    >Problems with Google counts were discussed recently on this list: http://torvald.aksis.uib.no/corpora/2005-1/0191.html <http://torvald.aksis.uib.no/corpora/2005-1/0191.html> .
    >
    >
    Right, and unfortunately, despite major turbulence since February
    (indicating major sofware and database changes) Google's counts are
    still completely mess up.

    Just an example from a few minutes ago :

    *94,200,000* for **bush
    ***85,600,000* for **bush*
    <http://www.google.com/url?sa=X&oi=dict&q=http://www.answers.com/bush%26r%3D67>*
    OR **corpora*
    <http://www.google.com/url?sa=X&oi=dict&q=http://www.answers.com/corpora%26r%3D67>*.

    George Boole "doit se retourner dans sa tombe" as we say-- I don't know
    how "turning in his grae" translates in English, but you get the
    picture. I have no financial links with Yahoo, but I would like to point
    out that I've switched to Yahoo Search for all my linguistic work, and
    they hit counts seem quite reliable (I don't mean true nor honest,
    simply that they seem correlated with some kind of corpus reality).

    I agree with Antoinette that hit counts are not the same as word counts,
    but they are still usable in many studies, for instance when you compare
    term frequency between subsets of the Web, in which you can assume (more
    or less safely) that the average document length is comparable. If you
    read French, you can find an example of this in my morning post about
    Yes or No in the European Constitution related pages :

    http://aixtal.blogspot.com/2005/04/web-cest-plutt-non.html

    However, the real solution for us would be our own crawler and search
    engine as discussed before.

    --j
      http://aixtal.blogspot.com

     



    This archive was generated by hypermail 2b29 : Wed Apr 27 2005 - 14:19:46 MET DST