[Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

From: Joerg Tiedemann (tiedeman@let.rug.nl)
Date: Fri Feb 23 2007 - 13:20:49 MET

Next message: Joerg Tiedemann: "[Corpora-List] SMT models, Europarl fr <-> *"

Previous message: Kiril Simov: "[Corpora-List] Second Call for Workshop Proposals"
Next in thread: Ramesh Krishnamurthy: "Re: [Corpora-List] Summary: lexicographic tools for parallel/comparable corpora"
Reply: Olivier Kraif: "Re: [Corpora-List] Summary: lexicographic tools for parallel/comparable corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Here is a summary of responses to my question:
"I'm looking for information about tools for the lexicographic use of
parallel and comparable corpora."

Short summary:

First of all there do not seem to be many lexicographic projects that use
parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
canadien bilingue for which parallel corpora where used (back in 1996).
There are papers talking about the use of parallel/comparable corpora in
dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but
there are no projects mentioned explicitly. The main problem seems to be
the lack of "clean", suitable data in reasonable quantities (pointed out
by several people). Adam Kilgarriff and his team used monolingual corpora
and his SketchEngine for bilingual lexicography (English-Irish) (which is
a step towards using comparable corpora I believe) but he points out that
"... we're a fair way off from `bilingual word sketches' ...". Lieve
Macken reminded me that the topic is close related to multi-lingual
terminology extraction end there is, of course, a rich literature about it
(some references below).

Here are some pointers I got about available tools:

ParaConc: a commercial parallel concordancer (athel.com)

There is an online implementation of the Vanilla-aligner at
http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online
parallel concordancer at
http://www2.lael.pucsp.br/corpora/parallelconc/index.html
used by students

Thomas Schmidt used a combination of a parallel concordancing tool and a
lexicographic annotation tool for the construction of a multilingual
football dictionary (www.kicktionary.de)

The Finnish translation technology company Masterin has a bilingual term
extractor that builds a raw bilingual translation lexicon from translation
memory databases.

Grigori Sidorov has a research tool that performs lexical-based alignment
for English-Spanish parallel corpora.

CLaRK is an XML based system for corpora development with support for
document synchronization to be used to navigate through parallel corpora.
http://www.bultreebank.org/clark/

A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
(software available at http://csar.sourceforge.net/) - I'm not sure about
its support for parallel and comparable corpora ...

Well, I add from my experience some more related tools available:

various implementations of Gale&Church's sentence alignment algorithm
(e.g. http://nl.ijs.si/telri/Vanilla/),
Melameds GMA (http://nlp.cs.nyu.edu/GMA/),
Hunalign (http://mokk.bme.hu/resources/hunalign),
Champollion Tool Kit (http://champollion.sourceforge.net/)
Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
GIZA++
(http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html)
Twente word aligner
(http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
ILink (http://www.ida.liu.se/~nlplab/ILink/),
K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
CWB from IMS stuttgart with support for aligned corpora
(http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
Uplug (http://sourceforge.net/projects/uplug)
... there are more tools for visualization and manual alignment ...

I probably forgot a lot of links (that's why I asked on the list) - feel
free to remind me!

Some references to literature I got:

Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
Encyclopedia of Language and Linguistics, 2nd Edn., Vol. 1, (Oxford:
Elsevier), 787-796.

Krishnamurthy, R. 2005. Corpus Lexicography. In K. Brown (ed.)
Encyclopedia of Language and Linguistics, 2nd Edn., Vol. 3, (Oxford:
Elsevier), 250-254.

Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm

I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting
Parallel Texts", MIT Press. There is a lot more in his website
http://cs.nyu.edu/~melamed/ .

Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons
from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2,
May 2004, Pages 163 ~V 189
(http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817

Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
Proceedings of the 19th International Conference on Computational
Linguistics, COLING2002, Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN
1-55860-894

(more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)

Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in
Bilingual Texts using Bilingual Dictionaries and Dynamic Programming.
Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp
824-833.

two links about "bilingual terminology extraction on comparable corora":
acl.ldc.upenn.edu/P/P04/P04-1067.pdf
acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps

Thanks for responses:

Marie-Paule Jacques <marie-paule.jacques@lipn.univ-paris13.fr>
Michael Barlow <mi.barlow@auckland.ac.nz>
Thomas Schmidt <thomas.schmidt@uni-hamburg.de>
Tony Berber Sardinha <tony4@uol.com.br>
Mickel Grönroos <mickel.gronroos@masterin.com>
Grigori Sidorov <sidorov@cic.ipn.mx>
Raphael Salkie <R.M.Salkie@bton.ac.uk>
Dan Tufis <tufis@racai.ro>
Serge Sharoff <s.sharoff@leeds.ac.uk>
Adam Kilgarriff <adam@lexmasterclass.com>
Kiril Simov <kivs@bultreebank.org>
Alex Murzaku <lissus@gmail.com>
Lieve Macken <lieve.macken@hogent.be>

Jörg

***********/\/\/\/\/\/\/\/\/\/\/\************************************
** Jörg Tiedemann tiedeman@let.rug.nl **
** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
** 9712 EK Groningen fax: +31 (0)50-363 6855 **
*************************************/\/\/\/\/\/\/\/\/\/\/\**********

Next message: Joerg Tiedemann: "[Corpora-List] SMT models, Europarl fr <-> *"
Previous message: Kiril Simov: "[Corpora-List] Second Call for Workshop Proposals"
Next in thread: Ramesh Krishnamurthy: "Re: [Corpora-List] Summary: lexicographic tools for parallel/comparable corpora"
Reply: Olivier Kraif: "Re: [Corpora-List] Summary: lexicographic tools for parallel/comparable corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Feb 23 2007 - 13:18:57 MET