Re: [Corpora-List] Re: problems with Google

From: Tom Emerson (tree@basistech.com)
Date: Sat Mar 19 2005 - 21:14:38 MET

  • Next message: Mirella Lapata: "[Corpora-List] PhD Studentship, University of Edinburgh"

    Pascal Soucy writes:
    > Googles does that with all stopwords. If you search for:
    > what does "the" "the" mean, you'll get the same behavior. Google ignores
    > stopwords (and * seems to managed as a stopword).

    Not really. Two identical stopwords in succession are kept. Try a
    search for "The The" (a band from the late '80s) and you will get hits
    on the determiner usage in isolation. You also get different hits for
    a search of simply "the".

        -tree

    > Both the queries:
    >
    > what does "*" mean
    >
    > and
    >
    > what does "*" "*" mean
    >
    > results in about the same list of documents. The difference between the two
    > occurs in the ranking process. The ranking algorithm likely use term proximity
    > so to better match the query as it is written and it keep the position of
    > stopwords in the query to do that.
    >
    > Pascal Soucy
    > Coveo
    >
    > Selon John Milton <lcjohn@ust.hk>, 17.03.2005:
    >
    > > I just discovered that Google seems to have retained some use of the
    > > wildcard for words if you use double quotes with the asterisk. A search
    > > for "what does "*" mean" and "what does "*" "*" mean" results MAINLY in
    > > any one and two words respectively. If anyone else is using web searches
    > > as language learning/teaching resources, this also looks promising:
    > > http://www.findforward.com/
    > >
    > > John Milton
    > > Hong Kong University of Science & Technology
    > >
    > >
    > >
    > >
    >
    >
    >
    >

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever"
    



    This archive was generated by hypermail 2b29 : Sat Mar 19 2005 - 21:11:12 MET