RE: [Corpora-List] WebCorp counts

From: j_kurjian@hotmail.com
Date: Thu Apr 28 2005 - 22:50:24 MET DST

  • Next message: Aleem Hossain: "[Corpora-List] semantic tagging"

    Thanks; yes, that helps. I now know what the upper cut off is - and that's
    fine. As I said, the limit makes things more manageable.

    Regards,
    Jerry

    >
    >Dear Jerry Kurjian
    >Apologies for the difficulties you are having with WebCorp-generated
    >counts, but they are only temporary, we promise. A new version of WebCorp,
    >to be released soon, will incorporate our own purpose-built search engine,
    >and thus be able to offer accurate frequency counts, type/token ratios,
    >collocational profiles and other statistics.
    >
    >To explain the problem you have had:
    >at the moment WebCorp takes the first 200 hits for your search term from
    >your chosen search engine (Google by default) and extracts concordances
    >from those pages. Unless you choose the 'one concordance line per site'
    >option, there is no limit on the number of concordance lines extracted from
    >each of these 200 pages.
    >
    >However, you will sometimes get fewer than 200 concordance lines in the
    >WebCorp output for your search term. This happens if you have chosen
    >additional filtering options (which will filter out some of the 200 hits
    >from Google), or if certain pages are not accessible when WebCorp tries to
    >access them or have changed since they were indexed by Google and no longer
    >contain your search term.
    >
    >Statistics extracted from the Web are inherently unreliable. AltaVista no
    >longer returns word counts, and the number of 'hits' returned by Google is
    >the number of pages containing your search term, not the number
    >of occurrences of your search term on the Web. Problems with Google counts
    >were discussed recently on this list:
    >http://torvald.aksis.uib.no/corpora/2005-1/0191.html
    ><http://torvald.aksis.uib.no/corpora/2005-1/0191.html> .
    >
    >Hope this helps.
    >Andrew Kehoe and Antoinette Renouf
    >
    >-----------------------------------------
    >Research and Development Unit for English Studies
    >School of English
    >University of Central England, Birmingham
    >http://rdues.uce.ac.uk/ <http://rdues.uce.ac.uk/>
    >
    >
    >
    >http://www.webcorp.org.uk/
    >-----Original Message-----
    >From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    >Behalf Of j_kurjian@hotmail.com
    >Sent: 23 April 2005 17:02
    >To: corpora@uib.no
    >Subject: [Corpora-List] WebCorp counts
    >
    >Hi all,
    >I have a question about the concordance counts produced by the WebCorp
    >site:
    >
    >http://www.webcorp.org.uk/wcadvanced.html
    >
    >For example, if I search ''suggest you don't'' vs. ''suggest that you
    >don't'' using WebCorp (via Google) I get, at the bottom of the page, a
    >concordance count of 187 vs. 96 kwics respectively. However, if I search
    >the same two terms, in quotes, on Google, I get 34,200 vs. 16,200 hits.
    >The ratios are similar though not the same.
    >
    >Does anyone have insight into how WebCorp calculates/filters its
    >concordances or why these two engines are so different in the number of
    >hits they return?
    >
    >In fact, it is nice to have the more manageable number produced by
    >WebCorp,
    >and the external collocate counts it creates. But, for example, if I am
    >interested in
    >the frequency of ''I'' collocating with the two search terms based on
    >WebCorp, I'd like to be clearer how those two counts are derived.
    >
    >Jerry
    >
    >_________________________________________________________________
    >Express yourself instantly with MSN Messenger! Download today it's FREE!
    >
    >http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
    >
    >
    >
    >
    >

    _________________________________________________________________
    Express yourself instantly with MSN Messenger! Download today it's FREE!
    http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



    This archive was generated by hypermail 2b29 : Thu Apr 28 2005 - 23:08:24 MET DST