Translation Corpus Aligner, version 2

An interactive sentence aligner

Knut Hofland, Øystein Reigem

The Department of Culture, Language and Information Technology (Aksis)/UNIFOB, University of Bergen – Norway

{knut.hofland,oystein.reigem}@aksis.uib.no

Keywords: sentence alignment, translation corpus, anchor list, XML, Java

1. Introduction

We will demonstrate an interactive sentence alignment program. The program uses different kinds of information to link sentences in an original text and sentences in a translation of this text. The main source of information used is a small bilingual lexicon (called anchor list). It is possible to adjust the suggested alignments while the program runs by means of a graphical user interface. The program is a new version of the program used in the English Norwegian Parallel Corpus (ENPC) project.

2. The old version

TCA, Translation Corpus Aligner, (Hofland & Johansson 1998), was written to the ENPC project (1993-97) and was used to align sentences in hundred pairs of texts (of a total of 2.6 million words). The program was also used by similar projects in Sweden and Finland and has also been used for other language pairs such as English-Dutch, English-Portuguese and English-German. The program is currently used in the Oslo Multilingual Corpus (OMC) project (English, Norwegian, German and French texts). The program has a command line interface and the result files have to be edited manually in a text editor by changing the value for the corresp attribute for the sentences with incorrect alignment. The program has shortcomings with regards to the size of the anchor list and the encoding of the texts.

3. The new version

The main features of the new program:

Supports different character encodings (iso-8859 or utf-8)
The texts have to be marked-up in XML
Written in Java, runs under Windows, Macintosh and Linux
General truncation in the anchor list (regular expressions) and no limit of size
Similarity of words and proper nouns used as additional anchor “points”
User selectable units to align
Interactive correction of alignment
Several output formats, TEI and Paraconc/Multiconcord

The program requires the input texts to be divided into sentences (and/or other units to align). Each unit must have a unique id. Two utility programs are included for these purposes.

The most important change from the old version is the interactive user interface. In its basic mode the program suggests one alignment at a time. The user may accept the suggestion as it is, or change it by adding and removing sentences, using button clicks or shortcut keys. Details about the program’s reason for choosing a particular alignment are displayed in the user interface, and also written to a log file. In its “skip 1-1” mode the program runs the alignment process automatically, only pausing for user action when the suggestion is not an 1-1 alignment (i.e, 1-0, 0-1, 2-1, or 1-2). A third, all-automatic mode will also be implemented. A sample of the user interface is found at http://gandalf.aksis.uib.no/tca2/img/tca2.jpg. The time to get a 100% correct sentence aligned corpus will be reduced compared to the use of the old version. The output of the program can also be read by the IMS Corpus Workbench program after a simple transformation.

The program has been used to align an English-Spanish parallel corpus (ACTRES). In the OMC project the program is currently tested to align Norwegian and Russian texts. The program is tested at the University of Tromsø to be used in alignment of texts in the Sami languages and the other languages used in the Barents region.

References

Hofland, K. and S. Johansson. (1998) The Translation Corpus Aligner: A program for automatic alignment of parallel texts. In S. Johansson and S. Oksefjell (eds) Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam:Rodopi, 87-100. Available at: http://khnt.hd.uib.no/files/align.htm

ENPC home page: http://www.hf.uio.no/ilos/forskning/forskningsprosjekter/enpc/

OMC home page: http://www.hf.uio.no/forskningsprosjekter/sprik/