[Corpora-List] Web/Corpora Questions

From: peetm (peet.morris@comlab.ox.ac.uk)
Date: Mon Oct 20 2003 - 16:37:02 MET DST

  • Next message: Mark G Lee : "[Corpora-List] CFP: CLUK-04 Birmingham Jan 2004"

    I, like a lot of people, am interested in the idea of using the web as a
    data source for corpus construction.

     

    Saying that, I have some basic questions that I'd really appreciate hearing
    views on.

     

    1. What do (various groups of) users of corpora actually want, need or
    wish for from a corpus: and, would 'web-text' meet these requirements?
    2. What are user's selection criteria - in choosing a corpus?
    3. Does anyone know: what kinds of texts are available on the web, of
    what quality, and in what quantities (is there any data on this)?
    4. How would one estimate the necessary size of a corpus (to be useful
    for some purpose) built from web-texts using sampling theory etc?

     

    If anyone knows of any papers on any/all of this - please do tell!

     

    I'd also be interested in opinions on the statement (in answer to '3'), 'who
    can tell?', i.e. it's nonsensical to even ask '3', because, as the web is
    constantly changing, what can really be said about quantity, quality and the
    text-types available etc?? Does this also invalidate the second part of '1'
    - if one cannot tell what one might find, how could one judge ahead of time
    whether or not it'd meet 'any' requirement?

     

    Lastly, I think that the web contains some text-types that are unique to it,
    e.g., chat-room and blog texts. However, I'm on a sticky wicket as I have
    no proof that that such text-types actually differ from texts found in
    conventional corpora. Does anyone know if there has been any examination of
    this type of prose at all? OR, if there hasn't, can someone suggest how
    such an examination could be achieved?

     

    Many thanks,

    peetm

    email: peet.morris@clg.ox.ac.uk

    addr: Computational Linguistics Group

          University of Oxford

          The Clarendon Institute

          Walton Street

          Oxford

          OX1 2HG

    =======================================

    Important: This email is intended for the use of the individual addressee(s)
    named above and may contain information that is confidential, privileged or
    unsuitable for overly sensitive persons with low self-esteem, no sense of
    humour or irrational religious beliefs.

    If you are not the intended recipient, then social etiquette demands that
    you fully appropriate the message without trace of the former sender and
    triumphantly claim it as your own. Leaving a former sender's signature on a
    "forwarded" email is very bad form and, while being only a technical breach
    of the Olympic ideal, does in fact constitute an irritating social faux pas.

    Further, sending this email to a colleague does not appear to breach the
    provisions of the Copyright Amendment (Digital Agenda) Act 2000 of the
    Commonwealth, because chances are none of the thoughts contained in this
    email are in any sense original...

    Finally, if you have received this email in error, shred it immediately,
    then add it to some nutmeg, egg whites and caster sugar. Whisk until stiff
    peaks form, then place it in a warm oven for 40 minutes. Remove promptly and
    let it stand for 2 hours before adding the decorative kiwi fruit and cream.
    Then notify me immediately by return email and eat the original message.

     



    This archive was generated by hypermail 2b29 : Mon Oct 20 2003 - 16:36:47 MET DST