Re: Corpora: Question about a Brown Corpus tag

From: E S Atwell (eric@comp.leeds.ac.uk)
Date: Thu Sep 14 2000 - 12:37:44 MET DST

  • Next message: Frank Henrik Mueller: "Re: Corpora: Question about a Brown Corpus tag"

    Dirk,
    I can see fairly simple "linguistic common sense criteria" to explain the
    distinction you query:

    - a preposition introduces a noun phrase
    - a subordinating conjunction introduces a subordinate clause
    - SOME words can belong to more than one class, eg "until" can intro both
    - but since there are many other words which only introduce NPs (eg
    "with") or only introduce clauses (eg "unless") we need seaprate classes
    for these 2 cases

    - coordinating conjunctions "and", "or" (and arguably "but") can connect 2
    words/phrases of virtually any class (and even 2 different classes, eg
    "until tomorrow and the morning comes"), you might suggest there should be
    separate PoS for NP_coord and Phrase_coord
    - but there aren't lots of words (at least in English) which are ONLY
    NP_coord or ONLY Phrase_coord, so there's no point in creating separate
    classes and saying "and", "or", "but" are all ambiguous between the two.

    I'm not a German linguist, but my guess about "entlang" is that if you
    accept the more general definition "a preposition introduces a noun
    phrase" then it's covered; it just happens that PoS-tagging seems to have
    got off the ground first for English, and consequently PoS-tagsets for
    other languages have adapted English PoS categories and nomenclature.

    I'm not really a theoretical linguist at all - I hope there's a
    theoretical linguist out there who can give a better explanation than
    mine!
          Eric

    Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
    School of Computing, University of Leeds, LEEDS LS2 9JT
    TEL: (44)113-2335430 FAX: (44)113-2335468
    WWW: http://www.comp.leeds.ac.uk/eric EMAIL: eric@comp.leeds.ac.uk

    cf:

    >
    > So what could be the linguistic reasons that Eric was mentioning? For me
    > (with a rather limited linguistic background) the "traditional" criteria
    > for POS determination look quite arbitrary or let's say heuristic.
    >
    > I cannot, for instance, see any advantage of separating "until" in:
    > * until tomorrow (preposition)
    > * until the morning comes (subordinating conjunction)
    >
    > while not separating "and" in:
    > * you and me (coordinating conjunction)
    > * I go and see (coordinating conjunction)
    >
    > or "with" in:
    > * to see with a telescope (preposition)
    > * the man with the telescope (preposition).
    >
    > Or why should I call the German "entlang" (along) a PREposition,
    > even if it is behind the noun phrase:
    > * den Fluss entlang (along the river)
    >
    > --------------------------
    >
    > But, I am sure that there is theoretic linguistic work about POS
    > categorization without these kinds of inconsistencies. And I am almost
    > sure that people who tag corpora not only think about the accuracy of
    > their results, but also about the needs of future users or at least
    > about linguistic credibility.
    >
    > And therefore I don't understand why connective Parts of Speech (like
    > relative pronouns, conjunctions, conjunctive adverbs... ) are modelled
    > in such a neglectful way in all the corpora I have seen so far.
    >
    > Or are there maybe approaches I am not aware of?
    > Or is it maybe too difficult or even impossible to make it "good"?
    >
    > --------------------------
    >
    > Dirk Ludtke
    >
    > Language Media Lab
    > Kyoto University
    >
    >
    >



    This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 12:40:13 MET DST