Re: [Corpora-List] Web/Corpora Questions

From: William Fletcher (
Date: Mon Oct 20 2003 - 20:04:18 MET DST

  • Next message: Mike Maxwell: "Re: [Corpora-List] Web/Corpora Questions"

    Peet, you'll find several of these questions addressed (not necessary answered satisfactorily) in papers on my website
    Some of the papers I cite in the references will be useful as well.
    (see esp. i.a. CavagliÓ and Kilgarriff, and Ide, Reppen and Suderman)

    I haven't seen any recent estimates of the total number pages on the Web, distribution of text types and languages -- follow up the stale bibliography in .
    (I intend to search more assiduously for recent estimates for an update of that paper later this year, and have concrete plans to proceed with the linguistic search engine / web archive outlined in the TaLC paper during a sabbatical in 2004-05.)

    Personally I believe for the major languages the Web is most useful for compiling ad-hoc corpora of texts dealing with specific domains or emerging usage, or else for answering specific questions such as the ongoing discussion about "personal price", where even large reference corpora such as the BNC have too few citations to give the whole picture. De Schryver makes a useful distinction between "Web as corpus" and "Web for (compiling a) corpus", in his case as a source of data for African languages with little if any machine-readable data.

    ( De Schryver, Gilles-Maurice, 2002. Web for / as Corpus: a Perspective for the African Languages. Nordic Journal of African Studies 11(2): 266-282. )

    I'm looking forward to other responses to this posting!

    Best regards,
    Bill Fletcher

    >>> "peetm" <> 10/20/03 10:37 AM >>>
    I, like a lot of people, am interested in the idea of using the web as a
    data source for corpus construction.


    Saying that, I have some basic questions that I'd really appreciate hearing
    views on.


    1. What do (various groups of) users of corpora actually want, need or
    wish for from a corpus: and, would 'web-text' meet these requirements?
    2. What are user's selection criteria - in choosing a corpus?
    3. Does anyone know: what kinds of texts are available on the web, of
    what quality, and in what quantities (is there any data on this)?
    4. How would one estimate the necessary size of a corpus (to be useful
    for some purpose) built from web-texts using sampling theory etc?


    If anyone knows of any papers on any/all of this - please do tell!


    I'd also be interested in opinions on the statement (in answer to '3'), 'who
    can tell?', i.e. it's nonsensical to even ask '3', because, as the web is
    constantly changing, what can really be said about quantity, quality and the
    text-types available etc?? Does this also invalidate the second part of '1'
    - if one cannot tell what one might find, how could one judge ahead of time
    whether or not it'd meet 'any' requirement?


    Lastly, I think that the web contains some text-types that are unique to it,
    e.g., chat-room and blog texts. However, I'm on a sticky wicket as I have
    no proof that that such text-types actually differ from texts found in
    conventional corpora. Does anyone know if there has been any examination of
    this type of prose at all? OR, if there hasn't, can someone suggest how
    such an examination could be achieved?


    Many thanks,



    addr: Computational Linguistics Group

          University of Oxford

          The Clarendon Institute

          Walton Street


          OX1 2HG


    Important: This email is intended for the use of the individual addressee(s)
    named above and may contain information that is confidential, privileged or
    unsuitable for overly sensitive persons with low self-esteem, no sense of
    humour or irrational religious beliefs.

    If you are not the intended recipient, then social etiquette demands that
    you fully appropriate the message without trace of the former sender and
    triumphantly claim it as your own. Leaving a former sender's signature on a
    "forwarded" email is very bad form and, while being only a technical breach
    of the Olympic ideal, does in fact constitute an irritating social faux pas.

    Further, sending this email to a colleague does not appear to breach the
    provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
    Commonwealth, because chances are none of the thoughts contained in this
    email are in any sense original...

    Finally, if you have received this email in error, shred it immediately,
    then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
    peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
    let it stand for 2 hours before adding the decorative kiwi fruit and cream.
    Then notify me immediately by return email and eat the original message.


    This archive was generated by hypermail 2b29 : Mon Oct 20 2003 - 20:05:15 MET DST