Re: [Corpora-List] ACL proceedings paper in the American National Corpus

From: Nancy Ide (ide@cs.vassar.edu)
Date: Tue Oct 01 2002 - 20:27:10 MET DST

  • Next message: Josephine Lo: "[Corpora-List] Concordancer for Chinese (Summary of reply)"

    On Tuesday, October 1, 2002, at 04:08 AM, Michal Sulc wrote:

    > I have read some remarks to the first question given by Ms Ide. But
    > nowhere there was any distinction between "points of view" that I
    > consider important here. Distinction between the corpus of "production"
    > (where we are interested who wrote the text in the question - whether
    > he
    > or she is "really" American) and "reception" (where we are interested
    > in
    > texts that are read by Americans and has an influence on them).
    > What do ANC-builders prefer?

    I think that we "ANC-builders" are working to satisfy the "ANC-users"
    ;-), but this is my own take on the issue:

    The idea is to have a corpus that includes data from which one can
    gather information about how American English is commonly used, perhaps
    in particular in various mainstream publications. Likely, you are
    trying to produce some publication that will provide guidance on word
    use, spelling, syntactic constructions, etc. that would most make you
    sound like a native speaker and able to fully understand texts written
    by and for American English speakers. Or, in the case of a
    computational linguist, you want to be able to recognize or generate
    lexical items or syntactic constructions that are common in, or typical
    of, American English--especially those which differ from, say, British
    English. Beyond this, you get into things that are correct, by American
    "rules" of grammar and usage, and perfectly understandable, but "just
    not the way we would phrase it". This is usually the way in which even
    the most proficient non-native speaker will eventually betray him or
    herself, so it is certainly of interest for ESL.

    So I would say that "production" is what we should be interested in for
    the ANC. While Americans may be exposed to lots of material that shows
    marks of being non-native American (we are certainly exposed to a lot
    of British English texts), the interest, at least for those who want to
    describe, recognize/understand, or generate American English would only
    arise after the influence, if there is any, becomes evident by cropping
    up significantly in texts produced by native speakers of American
    English.

    Footnote to the above: the plan for the ANC (dependent, of course, on
    funding) is to add at least 10 million words every five years,
    comprised of data produced during those five years. This would yield a
    sort of "archaeological store" of American English in temporal layers
    and enable consideration of the "reception" influence you mention
    (albeit after the fact).

    =======================================================

    Nancy Ide

    Professor and Chair
    Department of Computer Science, Vassar College
    Poughkeepsie, NY 12604-0520 USA
    Tel: +1 845 437-5988 Fax: +1 845 437-7498
    ide@cs.vassar.edu

    Chercheur Associe
    Equipe Langue et Dialogue, LORIA/CNRS
    Campus Scientifique - BP 239
    54506 Vandoeuvre-les-Nancy FRANCE
    Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
    ide@loria.fr

    =======================================================



    This archive was generated by hypermail 2b29 : Tue Oct 01 2002 - 20:35:25 MET DST