[Corpora-List] WebCorp counts

From: Antoinette Renouf (Antoinette.Renouf@uce.ac.uk)
Date: Wed Apr 27 2005 - 13:21:38 MET DST

  • Next message: Jean Veronis: "Re: [Corpora-List] WebCorp counts"

    Dear Jerry Kurjian
    Apologies for the difficulties you are having with WebCorp-generated counts, but they are only temporary, we promise. A new version of WebCorp, to be released soon, will incorporate our own purpose-built search engine, and thus be able to offer accurate frequency counts, type/token ratios, collocational profiles and other statistics.
     
    To explain the problem you have had:
    at the moment WebCorp takes the first 200 hits for your search term from your chosen search engine (Google by default) and extracts concordances from those pages. Unless you choose the 'one concordance line per site' option, there is no limit on the number of concordance lines extracted from each of these 200 pages.

    However, you will sometimes get fewer than 200 concordance lines in the WebCorp output for your search term. This happens if you have chosen additional filtering options (which will filter out some of the 200 hits
    from Google), or if certain pages are not accessible when WebCorp tries to access them or have changed since they were indexed by Google and no longer contain your search term.

    Statistics extracted from the Web are inherently unreliable. AltaVista no longer returns word counts, and the number of 'hits' returned by Google is the number of pages containing your search term, not the number
    of occurrences of your search term on the Web. Problems with Google counts were discussed recently on this list: http://torvald.aksis.uib.no/corpora/2005-1/0191.html <http://torvald.aksis.uib.no/corpora/2005-1/0191.html> .

    Hope this helps.
    Andrew Kehoe and Antoinette Renouf
     
    -----------------------------------------
    Research and Development Unit for English Studies
    School of English
    University of Central England, Birmingham
    http://rdues.uce.ac.uk/ <http://rdues.uce.ac.uk/>



    http://www.webcorp.org.uk/
    -----Original Message-----
    From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no] On
    Behalf Of j_kurjian@hotmail.com
    Sent: 23 April 2005 17:02
    To: corpora@uib.no
    Subject: [Corpora-List] WebCorp counts

    Hi all,
    I have a question about the concordance counts produced by the WebCorp
    site:

    http://www.webcorp.org.uk/wcadvanced.html

    For example, if I search ''suggest you don't'' vs. ''suggest that you
    don't'' using WebCorp (via Google) I get, at the bottom of the page, a
    concordance count of 187 vs. 96 kwics respectively. However, if I search
    the same two terms, in quotes, on Google, I get 34,200 vs. 16,200 hits.
    The ratios are similar though not the same.

    Does anyone have insight into how WebCorp calculates/filters its
    concordances or why these two engines are so different in the number of
    hits they return?

    In fact, it is nice to have the more manageable number produced by
    WebCorp,
    and the external collocate counts it creates. But, for example, if I am
    interested in
    the frequency of ''I'' collocating with the two search terms based on
    WebCorp, I'd like to be clearer how those two counts are derived.

    Jerry

    _________________________________________________________________
    Express yourself instantly with MSN Messenger! Download today it's FREE!

    http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/







    This archive was generated by hypermail 2b29 : Wed Apr 27 2005 - 13:44:29 MET DST