RE: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Martin Wynne (martin.wynne@ota.ahds.ac.uk)
Date: Mon Sep 30 2002 - 13:04:39 MET DST

  • Next message: Adam Kilgarriff: "RE: [Corpora-List] ACL proceedings paper in the American National Corpus"

    Nancy's posting set off some very different alarm bells for me. I would like
    to draw attention to what I think would be another problem with the
    inclusion of texts from ACL proceedings in the American National Corpus.

    Let me start with an interesting case which I came across some years ago.
    After hearing someone repeat the well-known fact that people don't say
    'powerful tea' in English, I thought it would be worth checking for
    empirical evidence for this. I searched for the phrase in the BNC, and got 3
    hits. All are from a text source listed as follows:

     "Large vocabulary semantic analysis for text recognition.
     Rose, Tony Gerard, u.p.. Sample containing about 42217 words of unpublished
    miscellanea (domain: applied science)"

    and they are discussions of exactly the same point, i.e. the fact that you
    don't say 'powerful tea'.

    (Incidentally, I also searched in the whole Bank of English and found no
    hits for "powerful tea", and 39 hits for "weak tea", so the original point
    is not disproven.)

    In ACL articles you will also get citations of made-up examples like this,
    plus listings of 'ungrammatical' sentences. Basically, this problem seems to
    boil down to the fact that you get a lot of 'mention' rather than 'use' of
    words and phrases in academic linguistic literature, and this could have a
    fairly significant effect on the results of linguistic analysis of the
    corpus. If one of the main reasons for building the corpus is to enable
    researchers to analyse naturally occurring American English, in order to see
    what does occur and what doesn't, then letting in lots of made-up example
    sentences and phrases would make it less fit for the proposed purpose.

    One way of avoiding this, and many other potential problems which can be
    found in specialised language, would be to apply a criterion for inclusion
    of texts in the corpus that they should not be too technical in nature.

    __
    Martin Wynne
    martin.wynne@ota.ahds.ac.uk
    Linguistics Officer
    Oxford Text Archive

    Oxford University Computing Services
    13 Banbury Road
    Oxford
    UK - OX2 6NN
    Tel: +44 1865 283299
    Fax: +44 1865 273275



    This archive was generated by hypermail 2b29 : Mon Sep 30 2002 - 13:24:36 MET DST