Re: Corpora: Lexicon development for MT

Ted E. Dunning (ted@aptex.com)
Thu, 10 Sep 1998 09:38:38 -0700

>>>>> "ak" == Adam Kilgarriff <Adam.Kilgarriff@itri.brighton.ac.uk> writes:

ak> I am aware of the widespread use of templates, and the use (eg
ak> at New Mexico) of inheritance, and of sophisticated techniques
ak> based on parallel corpora for extracting translation
ak> equivalents for terminology, but these are only accidentally
ak> likely to help with this particular problem.

your sentence somewhat ambiguous here. it seems that you are saying
that templates, inheritance and use of parallel corpora are all and
singly only accidentally likely to help with word form selection.

if that is what you meant, then your opinion of the field is odd and
idiosyncratic.

for instance, the use of inheritance and other similar techniques at
New Mexico State is intended precisely to minimize the number of rules
which must be crafted to build real MT systems.

similarly, the use of parallel corpora at NMSU is intended primarily
to assist in semi-automated translation, and for cross-lingual text
retrieval. for the cross-lingual text retrieval, parallel corpora are
the *primary* mechanism in the NMSU system for word translation
disambiguation.

more on this later.

ak> Much Word Sense Disambiguation work is in principle relevant,
ak> but, with the honourable exception of Dagan and Itai (CL 20
ak> (4), 1994) it is not clear whether any of it can be tailored
ak> to the specific needs of an MT system (and I do not believe
ak> any of it has been).

excuse me? i must not understand what you mean by word sense
disambiguation or perhaps what you mean by MT system.

what about the work of Mercer, Brown and the others at IBM who crafted
an entire MT system around the concept that parallel corpora could
provide both lexicon and disambiguation? if you look at their work,
their methods are fundamentally designed to resolve ambiguity in
translation via the use of parallel corpora.

the methods pioneered by the IBM group have been extended greatly by
many others. their basic methods were used a number of other
researchers including Gale, Church, Yarovsky, Bruce, Stevenson, Wilks
and others. not all of these researchers used the same definition of
word sense (some used dictionary senses rather than alternative
translations), but essentially all of them used context of usage to
statistically resolve ambiguity. all of these systems could easily
have been integrated back into the IBM Candide system if desired.

how is it that you claim that little of this work can be tailored to
the specific needs of an MT system?