Re: [Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

From: Olivier Kraif (Olivier.Kraif@u-grenoble3.fr)
Date: Mon Feb 26 2007 - 12:13:10 MET

  • Next message: Bev Corwin: "[Corpora-List] Globalisation Management Strategies Conference, Monterey, March 29-30, 2007 - Updates"

    Hello Joerg,
    thank you for this useful summary.
    I have not replied earlier to your question because I thougth that there
    were some tools more specifically designed for lexicographers.
    But it appears that most of the links you give concern generic tools for
    multilingual corpora handling.

    I give you one more link : Alinea is a free aligner and parallel
    concordancer that has been evaluated in the last Arcade 2 campaign.
    It obtained results close to the best system (aroud 98% F-measure for
    european language pairs) and showed that it was particularly robust even
    for very "distant" language pairs (French-Chinese, French-Arabic,
    French-Farsi, etc.).
    Alinea can handle POS-tagged texts, complex expressions searching
    (regular expressions with tags and lemma), word-to-word aligning, and
    bilingual lexicon extraction.

    http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=27&Itemid=43

    On the same site you can find :
    - a review of links about tools :
    http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=23&Itemid=41
    - about corpora :
    http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=20&Itemid=36
    - and about sources of parallel texts :
    http://w3.u-grenoble3.fr/kraif/index.php?option=com_content&task=view&id=22&Itemid=38

    Best regards

    Olivier

    > Here is a summary of responses to my question:
    > "I'm looking for information about tools for the lexicographic use of
    > parallel and comparable corpora."
    >
    >
    > Short summary:
    >
    > First of all there do not seem to be many lexicographic projects that use
    > parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
    > canadien bilingue for which parallel corpora where used (back in 1996).
    > There are papers talking about the use of parallel/comparable corpora in
    > dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but
    > there are no projects mentioned explicitly. The main problem seems to be
    > the lack of "clean", suitable data in reasonable quantities (pointed out
    > by several people). Adam Kilgarriff and his team used monolingual corpora
    > and his SketchEngine for bilingual lexicography (English-Irish) (which is
    > a step towards using comparable corpora I believe) but he points out that
    > "... we're a fair way off from `bilingual word sketches' ...". Lieve
    > Macken reminded me that the topic is close related to multi-lingual
    > terminology extraction end there is, of course, a rich literature about it
    > (some references below).
    >
    >
    >
    > Here are some pointers I got about available tools:
    >
    >
    > ParaConc: a commercial parallel concordancer (athel.com)
    >
    > There is an online implementation of the Vanilla-aligner at
    > http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online
    > parallel concordancer at
    > http://www2.lael.pucsp.br/corpora/parallelconc/index.html
    > used by students
    >
    > Thomas Schmidt used a combination of a parallel concordancing tool and a
    > lexicographic annotation tool for the construction of a multilingual
    > football dictionary (www.kicktionary.de)
    >
    > The Finnish translation technology company Masterin has a bilingual term
    > extractor that builds a raw bilingual translation lexicon from translation
    > memory databases.
    >
    > Grigori Sidorov has a research tool that performs lexical-based alignment
    > for English-Spanish parallel corpora.
    >
    > CLaRK is an XML based system for corpora development with support for
    > document synchronization to be used to navigate through parallel corpora.
    > http://www.bultreebank.org/clark/
    >
    > A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
    > (software available at http://csar.sourceforge.net/) - I'm not sure about
    > its support for parallel and comparable corpora ...
    >
    >
    > Well, I add from my experience some more related tools available:
    >
    > various implementations of Gale&Church's sentence alignment algorithm
    > (e.g. http://nl.ijs.si/telri/Vanilla/),
    > Melameds GMA (http://nlp.cs.nyu.edu/GMA/),
    > Hunalign (http://mokk.bme.hu/resources/hunalign),
    > Champollion Tool Kit (http://champollion.sourceforge.net/)
    > Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
    > Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
    > GIZA++
    > (http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html)
    > Twente word aligner
    > (http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
    > Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
    > ILink (http://www.ida.liu.se/~nlplab/ILink/),
    > K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
    > CWB from IMS stuttgart with support for aligned corpora
    > (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
    > Uplug (http://sourceforge.net/projects/uplug)
    > ... there are more tools for visualization and manual alignment ...
    >
    > I probably forgot a lot of links (that's why I asked on the list) - feel
    > free to remind me!
    >
    >
    >
    > Some references to literature I got:
    >
    > Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
    > Encyclopedia of Language and Linguistics, 2nd Edn., Vol. 1, (Oxford:
    > Elsevier), 787-796.
    >
    > Krishnamurthy, R. 2005. Corpus Lexicography. In K. Brown (ed.)
    > Encyclopedia of Language and Linguistics, 2nd Edn., Vol. 3, (Oxford:
    > Elsevier), 250-254.
    >
    > Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
    > Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm
    >
    > I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting
    > Parallel Texts", MIT Press. There is a lot more in his website
    > http://cs.nyu.edu/~melamed/ .
    >
    > Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons
    > from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2,
    > May 2004, Pages 163 ~V 189
    > (http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817
    >
    > Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
    > Proceedings of the 19th International Conference on Computational
    > Linguistics, COLING2002, Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN
    > 1-55860-894
    >
    > (more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)
    >
    > Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in
    > Bilingual Texts using Bilingual Dictionaries and Dynamic Programming.
    > Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp
    > 824-833.
    >
    > two links about "bilingual terminology extraction on comparable corora":
    > acl.ldc.upenn.edu/P/P04/P04-1067.pdf
    > acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps
    >
    >
    > Thanks for responses:
    >
    > Marie-Paule Jacques <marie-paule.jacques@lipn.univ-paris13.fr>
    > Michael Barlow <mi.barlow@auckland.ac.nz>
    > Thomas Schmidt <thomas.schmidt@uni-hamburg.de>
    > Tony Berber Sardinha <tony4@uol.com.br>
    > Mickel Grönroos <mickel.gronroos@masterin.com>
    > Grigori Sidorov <sidorov@cic.ipn.mx>
    > Raphael Salkie <R.M.Salkie@bton.ac.uk>
    > Dan Tufis <tufis@racai.ro>
    > Serge Sharoff <s.sharoff@leeds.ac.uk>
    > Adam Kilgarriff <adam@lexmasterclass.com>
    > Kiril Simov <kivs@bultreebank.org>
    > Alex Murzaku <lissus@gmail.com>
    > Lieve Macken <lieve.macken@hogent.be>
    >
    >
    >
    >
    > Jörg
    >
    > ***********/\/\/\/\/\/\/\/\/\/\/\************************************
    > ** Jörg Tiedemann tiedeman@let.rug.nl **
    > ** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    > ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    > ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    > ** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    > *************************************/\/\/\/\/\/\/\/\/\/\/\**********
    >



    This archive was generated by hypermail 2b29 : Mon Feb 26 2007 - 12:11:36 MET