Re: [Corpora-List] Query on the use of Google for corpus research

From: Dominic Widdows (widdows@maya.com)
Date: Fri May 27 2005 - 15:46:28 MET DST

  • Next message: Ylva Berglund: "[Corpora-List] EuroCALL pre-conference workshop"

    >> Does anyone have any
    >> experience/insight on this?
    >>
    >
    > Well... yes! I made a series of in-depth analyses of Google counts.
    > They are totally bogus, and unusable for any kind of serious research.
    > There is a summary here :
    > http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-
    > mystery.html

    Dear All,

    While I agree with the points made in Jean's excellent summary, I think
    it's fair to point out that this was partly motivated by the way
    researchers had been using "Google counts" more and more, and coming up
    with more and more problems. As a community of researchers and
    peer-reviewers, I still don't think that we've been able to agree on
    best practices. I have come across reviews on both sides of the fence,
    saying on the one hand:

    1. Your method didn't get a very big yield on your fixed corpus, why
    didn't you use the Web?

    or on the other:

    2. Your use of web search engines to get results is unreliable, you
    should have used a fixed corpus.

    The main problem is that "using the Web" on a large scale puts you at
    the mercy of the commercial search engines, which leads to the grim
    mess that Jean documents, especially with Google. This situation may
    hopefully change as WebCorp (http://www.webcorp.org.uk/) teams up with
    a dedicated search engine. In the meantime, it's clearly true that you
    can get more results from the web, but you can't vouch for them
    properly, and so a community that values both recall and precision is
    left reeling.

    At the same time, the fact that you can use search engines to get a
    rough count of language use in many cases has thrown the door open to a
    lot of researchers who have every reason to be interested in language
    as a form of data, but have never tried doing much language processing
    before. Over the decades, linguists have often been very sniffy about
    researchers from other disciplines muscling in out their turf, but this
    often results in articles that talk about language just getting
    published elsewhere (e.g. in more mainstream media), where the
    reviewers are perhaps more favourable. A recent and typical example may
    be the "Google Distance" hype
    (http://www.newscientist.com/article.ns?id=dn6924) - we've had
    conceptual distance, latent semantic analysis, mutual information, etc.
    for decades, a couple of mathematicians come along and call something
    the "Google distance", and the New Scientist magazine concludes that
    the magic of Google has made machines more intelligent.

    All right, there's a trace of bitterness here, I wouldn't mind being in
    New Scientist for computing semantic distances, but there's a more
    serious danger as well - we've been doing a lot of pretty good work for
    a long while in different areas of corpus and computational
    linguistics, and it would be a shame if other folks went off and
    reinvented everything, just because there are more widely available
    tools that enable a wider community to "give it a go" and come up with
    something that may do pretty well, especially if you're going for
    recall. It breaks come fundamental principles such as "do your
    experiments in a way that others can replicate them", but this is
    naturally on the increase as big-dataset empiricism comes to the
    forefront of many scientific problems. For example, there's the recent
    research in ocean temperatures that made 7 million temperature readings
    at different stations, and none of us can go and replicate that data
    collection, but it doesn't invalidate the research.

    If we just tell people that search-engine based research is bogus,
    people will just keep doing it and publishing it elsewhere, and who
    knows, in 10 years time someone using Google or Yahoo counts may invent
    part-of-speech tagging, and that will be another amazing thing that
    makes computers more intelligent.

    Sorry, I haven't got any answers, but I'm writing this in the hope that
    someone else on the list has!
    Best wishes,
    Dominic



    This archive was generated by hypermail 2b29 : Fri May 27 2005 - 16:21:04 MET DST