[Corpora-List] Syntactic zeros in a corpus: possible solutions

From: Mikhail Kopotev (mihail.kopotev@helsinki.fi)
Date: Mon Aug 28 2006 - 13:23:32 MET DST

  • Next message: Ken Litkowski: "Re: [Corpora-List] Syntactic zeros in a corpus: possible solutions"

    Dear List-members.

    Thanks to all who answered me.

    Summarizing the answers, I will provide some possible solutions.

    Syntactic zeros are, with no doubts, a question of a theory we use to
    annotate material. The spectrum of the opinions differs from a
    “complete” list of syntactic zeros to the negation of the phenomenon. As
    far as our corpus (as many other corpora) is used by many users like
    teachers, interpreters, students etc. that might be not familiar with
    modern syntactic theories we should consider a more “traditional”
    annotation scheme. In other words, speaking of syntactic annotation we
    should follow a principle, formulated by G. Leech. I mean “consensual,
    theory-neutral analysis of the data”. In case of the Russian language
    the matter seems to be even more complicated than that of English,
    because there are at least three predominant theories circulating in
    Russian linguistics. All three postulate syntactic zeros and all three
    have different lists of them.
    Thus, as far as the theoretical question has no common answer I think it
    would be better stop discussing it in order not to flame here. Let’s
    consider that “the Holy Grail” does exist at least within certain
    theoretical frames. So, how to locate it?

    Two approaches seem to be relevant in this respect.
    1. A tag-based approach postulates a list of zeros or (pre)formulated
    rules, according to which a NLP system can (automatically or manually)
    recognize a zero element and insert a special “zero”-tag into text. This
    is, in fact, a commonly used way to work with zeros. Its advantages are:
    systematic way of annotation that can be introduced in a user-friendly
    form; and a possibility (?) to tune up a system for recognizing clauses
    that contain zeros.
    Its weakness is that a user should be familiar (and should agree, in all
    probability) with the theory an annotation scheme is based on. As far
    as a theory-neutral annotation scheme does not exist, such a corpus will
    be rather a field of a battle, then a place to search and collect material.

    2. A search-based approach is grounded on using a query language, that
    allows users searching clauses NOT containing some elements (such as
    {SELECT “all clauses” FROM “the text” WHERE “verb” <> “y”} for the verb
    ellipsis). This approach is the more usable, the more accurate and clear
    an annotation is. Its advantage is a theory-independent search (to be
    more precise, a user can search according to his/her own theoretical
    background). The main disadvantage is that a query will return (a lot
    of) irrelevant examples. Another weakness is that in a rather big corpus
    such a query takes a lot of time to respond, but it is a technical not
    linguistic problem.
    Of course, it is possible to create a corpus that integrates both
    approaches.

    Any comments will be warmly appreciated.

    Mikhail Kopotev
    Researcher
    Department of Slavonic
    and Baltic Languages and Literatures
    University of Helsinki



    This archive was generated by hypermail 2b29 : Mon Aug 28 2006 - 14:01:19 MET DST