Re: [Corpora-List] Syntactic zeros in a corpus: possible solutions

From: Ken Litkowski (ken@clres.com)
Date: Mon Aug 28 2006 - 18:45:27 MET DST

  • Next message: Ralf Steinberger: "[Corpora-List] Call for Papers: Cross Lingual Information Access"

    While I would agree that syntactic zeros are not theory-neutral and that
    it is necessary to itemize them in some way, I don't think the tag-based
    approach is so dire as suggested. The parser that I use, developed by
    Ned Irons (a co-inventor of syntax-directed compiling), envisions their
    recognition in quite a regular fashion. The parser is an augmented
    transition network, in which a key part of a transition to a next parse
    state is "additional processing". The additional processing takes two
    primary forms: (1) tests that particular conditions are met (e.g.,
    subject-verb agreement) and (2) annotations to be attached to
    (potential) nodes of the parse tree. Among the many annotations, there
    is one labeled "filler". Attached to this label are sublabels, giving
    specifications of whether the filler should be optional, an object, or
    an adjective (say a question filler). In the tests that are performed,
    checks are made on whether the fillers are indeed filled elsewhere in
    the sentence (and perhaps they're not, but rather constitute an
    elision). It seems to me that this approach can be primarily
    data-driven. If the parser doesn't grok, a good likelihood is that
    syntactic zeroes are present and the grammar needs to be modified
    accordingly. (In parsing hundreds of thousands of sentences, where I
    generally only have time to get an "impression" of what's going wrong,
    these cases seem quite prevalent. Unfortunately, I can't give a more
    precise estimate.)

            Ken

    Mikhail Kopotev wrote:

    > Dear List-members.
    >
    > Thanks to all who answered me.
    >
    > Summarizing the answers, I will provide some possible solutions.
    >
    > Syntactic zeros are, with no doubts, a question of a theory we use to
    > annotate material. The spectrum of the opinions differs from a
    > “complete” list of syntactic zeros to the negation of the phenomenon. As
    > far as our corpus (as many other corpora) is used by many users like
    > teachers, interpreters, students etc. that might be not familiar with
    > modern syntactic theories we should consider a more “traditional”
    > annotation scheme. In other words, speaking of syntactic annotation we
    > should follow a principle, formulated by G. Leech. I mean “consensual,
    > theory-neutral analysis of the data”. In case of the Russian language
    > the matter seems to be even more complicated than that of English,
    > because there are at least three predominant theories circulating in
    > Russian linguistics. All three postulate syntactic zeros and all three
    > have different lists of them.
    > Thus, as far as the theoretical question has no common answer I think it
    > would be better stop discussing it in order not to flame here. Let’s
    > consider that “the Holy Grail” does exist at least within certain
    > theoretical frames. So, how to locate it?
    >
    > Two approaches seem to be relevant in this respect.
    > 1. A tag-based approach postulates a list of zeros or (pre)formulated
    > rules, according to which a NLP system can (automatically or manually)
    > recognize a zero element and insert a special “zero”-tag into text. This
    > is, in fact, a commonly used way to work with zeros. Its advantages are:
    > systematic way of annotation that can be introduced in a user-friendly
    > form; and a possibility (?) to tune up a system for recognizing clauses
    > that contain zeros.
    > Its weakness is that a user should be familiar (and should agree, in all
    > probability) with the theory an annotation scheme is based on. As far
    > as a theory-neutral annotation scheme does not exist, such a corpus will
    > be rather a field of a battle, then a place to search and collect material.
    >
    > 2. A search-based approach is grounded on using a query language, that
    > allows users searching clauses NOT containing some elements (such as
    > {SELECT “all clauses” FROM “the text” WHERE “verb” <> “y”} for the verb
    > ellipsis). This approach is the more usable, the more accurate and clear
    > an annotation is. Its advantage is a theory-independent search (to be
    > more precise, a user can search according to his/her own theoretical
    > background). The main disadvantage is that a query will return (a lot
    > of) irrelevant examples. Another weakness is that in a rather big corpus
    > such a query takes a lot of time to respond, but it is a technical not
    > linguistic problem.
    > Of course, it is possible to create a corpus that integrates both
    > approaches.
    >
    > Any comments will be warmly appreciated.
    >
    > Mikhail Kopotev
    > Researcher
    > Department of Slavonic
    > and Baltic Languages and Literatures
    > University of Helsinki
    >
    >
    >
    >

    -- 
    Ken Litkowski                     TEL.: 301-482-0237
    CL Research                       EMAIL: ken@clres.com
    9208 Gue Road
    Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
    



    This archive was generated by hypermail 2b29 : Mon Aug 28 2006 - 19:02:26 MET DST