two queries

Marco Antonio Da Rocha (marco@cogs.susx.ac.uk)
Fri, 26 Jul 1996 10:51:10 +0100 (BST)

Dear all,

I apologise for repeated messages you may receive, I am posting that in two
lists.

I have two problems at present which I must eventually solve to carry on with
the research I am doing. I have created an annotation to analyse anaphora in
spoken language corpora. I am working with the London Lund and a corpus of
Brazilian Portuguese dialogues. Each case of anaphora is analysed according to
four properties, namely:

type of anaphor - categories used combine traditional grammar (such as
personal pronoun) with some additions (such as nonpronominal
NPs and VP ellipsis); I include all forms of anaphora, as
the frequency in which they occur is an important element in
the research

type of antecedent - basically the explicit/implicit dichotomy with the
addition of nonreferential (meaning there is no
antecedent to speak of) and discourse implicit (for
constructed antecedents)

topical status - classifies the antecedent according to a hierarchy of
elements in a topical structure; these elements include
discourse topic (global topic for the dialogue), segment
topic, subsegment topic and thematic elements (elements
related to the current topic)

processing strategy - an attempt to introduce a psycholinguistic slant in the
classification; the purpose is to improve the quality of
information given in type of anaphor, as anaphors of the
same type and even identical verbatim may be processed
in different ways (consider `it' and `that')

I have now gathered approximately 3000 cases for English and will do the same
for Portuguese. My plan is to use statistical procedures to analyse the data
as classified by the four properties and attempt to establish associations,
interactions and whatever may be useful for the analysis of anaphoric
relations. However, the cross-tabulation unavoidably leads to low-frequency
cells and several null cells. I doubt this would be much improved by adding
more data - and of course there is a limit to the number of cases, this is
just a PhD thesis. Implicit antecedents virtually never occur for possessives.
This causes the usual problems for the use of most if not all statistic
procedures appropriate for categorial data.
The solution I thought of was to actually exclude categories whenever they
were irrelevant for the analysis of a certain type of anaphor and reintroduce
them whenever they were not irrrelevant, that is, the cells produced in
cross-tabulation were not null. Therefore, the implicit antecedent category
would be crossed out when possessives are being analysed. I might even
conclude that antecedents are invariably explicit for possessives, except
for a few unusual cases in collocations. I am afraid this - the
withdrawing and reintroduction of categories in variables - may be
inappropriate as a methodology for the use of statistic techniques. I am
unsure also about the validity of comparing results obtained with this sort of
manipulation.
The second query concerns the third property, related to topicality. I begin
analysing the data by instinctively deciding what the topic was. I soon
realised my decision could be revised next day and then again in the next. I
finally developed some personal idiossincrasies to make such decisions, but
other people are quite likely to come to different conclusions. This may
render the annotation scheme and the results obtained useless for other people
that not me. I decided I should choose an existing method of dealing with
topic tracking in machines, although we know they may be often unsuccessful. I
came across Hoey (1991) as a relatively straightforward way of dealing with
the problem. More sophisticated approaches - such as the centering theory -
are extremely difficult to adapt to real-life dialogues. I have the feeling I
would end up with the same inaccuracy problem, with different choices by
different people. I should like to know about other ways of handling topic
tracking, particularly those which can be "easily transported", as far as this
is possible at all.
I would like to "hear" suggestions from you on both accounts or any one of
them. I tried to keep this message as short as possible, so that many of you
may be unable to understand what I have in mind. I am ready to clarify as
needed. I think suggestions may be addressed directly to me, but I can post
them in the end if there is interest. Perhaps it might be worth doing so
regarding the second query. Thank you in advance.

Marco A E Rocha
University of Sussex
School of Cognitive and Computing Sciences
Falmer, Brighton
BN1 9QH - U.K.
tel.: +44 +01273 678052
fax: +44 +01273 671320
e-mail: marco@cogs.susx.ac.uk