Design, Construction, and Evaluation of Systems with an MT Component
Wednesday, October 28, 1998 (preceding the AMTA 98 conference)
Sheraton Bucks County Hotel, Langhorne, Pennsylvania
As the strengths and weaknesses of machine translation (MT) engines
have become better understood and accepted, there has been a marked
increase in the development of computer systems with an embedded
MT component. One consequence of this shift to "embedded MT" is that
researchers, developers, as well as users have begun pushing the limits on
the input that such systems will accept for translation. In so doing, a
class of problems has surfaced: any input---whether it appears in
form on paper, in electronic form on-line, or mixed in with another
such as graphics or video---will bring with it some unknown mix of
noisy natural language data as well as non-linguistic data. How
are systems with an MT component to be designed and evaluated
given the challenge this input brings?
The objective of this workshop is to examine and evaluate techniques
for adjusting this "linguistic impedance mismatch" between the
real-world input and the natural language input expected by various
MT engines. Thus the workshop will focus on computational
approaches to preprocessing system input for MT engines and on
statistical methods for evaluating systems with an embedded MT
Linguistic Preprocessing In Image Data
For researchers working with image data, there is currently underway
an effort to augment OCR (optical character recognition) engines with
linguistic data as they recognize and convert bitmap data into
characters---similar to what has already been done in speech
recognition with linguistic data in HMMs (hidden Markov models).
Other OCR researchers have also experimented with image-level
early topic detection using word-shape recognition. In principle,
this could provide a first-step filtering of documents into a
more homogeneous MT input set, a desirable goal for MT evaluation.
Thus we expect that individuals working with or intending to
incorporate OCR into their computer systems will be interested in
this new area.
Linguistic Preprocessing in Online Data
For those working with online input, even though the characters are
already present, there often still remains the task of preprocessing
meaningful, symbolic character strings that are not a part of the text
to be translated. For some systems, the rules for identifying and
encapsulating or removing such strings may need to be hand-crafted
over time as MT engine limitations surface. For others, a combination
of hand-crafted rules and statistically trained NL models has worked.
Many have observed that the HTML annotations, alphanumeric items,
spreadsheet and word processing codes are harder to weed out than
originally expected.
Research efforts with the low-density and less-commonly taught
languages, as well as more common ones, encounter a substantial
problem with variation in spelling conventions and transcription
preferences. For those natural languages that are primarily spoken
and not written, for example, this is frequently the case. Researchers
working on this class of problem have built variants on spell checkers
components that standardize words to one orthography (spelling
convention) before submitting it to an MT engine. An idea that has
arisen for this component is to build in an option to adjust the level of
correction---as would be relevant when input after OCR nonetheless
varies from very noisy to relatively clean.
Evaluation of Embedded MT Systems
Among those working on statistical methods for evaluating
systems with an embedded MT component, we have seen two distinct
trends. One group of statisticians has begun looking for
appropriate models from outside the world of MT evaluation,
examining the efforts by others to take distinct metrics for
components and combine them for an overall system-level
metric using fuzzy mathematics. Another group of researchers
is looking instead at developing a one-dimensional scale
for ranking MT engines along a continuum defined by system-level
function. That approach, for example, might rank one engine
as good enough for filtering documents, while another engine
deemed more linguistically robust would be ranked higher because
it could generate a good enough initial translation for
subsequent post-editing. We welcome other functional evaluations
of MT components and computer systems with embedded MT components
as well.
Submitters are invited to send in a short paper, not more than 5 pages,
addressing one or more of the three areas discussed above. Papers
should define the problem in an embedded MT system that is the focus
of the work, describe the embedded MT system design (a simple sketch)
with sample input data where relevant, and present their approach
to the problem. Work at various stages of completion is acceptable;
we expect the current status of the work to be made clear. Submission of
end-to-end output of an embedded MT system is especially encouraged.
The papers will be collected and distributed to participants of the
Ideally, the result of the workshop will be a clearer delineation of:
(1) the range of linguistic preprocessing problems
(2) the range of designs in embedded MT systems
(3) how these problems are treated in different embedded MT systems and
(4) the metrics that are being used to evaluate these systems and their
Notice of interest in participation: July 10, 1998
Please identify which of the three areas you intend to address:
preprocessing in image data, preprocessing in online data,
evaluation of embedded MT systems.
Position paper submission: August 10, 1998 NOTE: Now, August 24, 1998
Notifications: September 10, 1998 NOTE: Now, September 17,
Final copies of papers: October 10, 1998
Workshop: October 28, 1998
Submissions may be in printed or electronic form.
Submissions should be sent to:
Clare Voss
Army Research Laboratory
2800 Powder Mill Road
Adelphi, MD 20783
phone: (301) 394-5615
fax: (301) 394-3903
The registration fee for the conference is $50. Non-presenters will
be accepted on a first-come, first served basis. We strongly encourage
the participation of embedded MT system users, as well as members of
the research and development communities.
A copy of the call, the registration form, and further
update information is available via a link at:
Look for the Conferences and Workshop link.