[Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

From: Joerg Tiedemann (tiedeman@let.rug.nl)
Date: Fri Feb 23 2007 - 13:20:49 MET

  • Next message: Joerg Tiedemann: "[Corpora-List] SMT models, Europarl fr <-> *"

    Here is a summary of responses to my question:
    "I'm looking for information about tools for the lexicographic use of
    parallel and comparable corpora."

    Short summary:

    First of all there do not seem to be many lexicographic projects that use
    parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
    canadien bilingue for which parallel corpora where used (back in 1996).
    There are papers talking about the use of parallel/comparable corpora in
    dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but
    there are no projects mentioned explicitly. The main problem seems to be
    the lack of "clean", suitable data in reasonable quantities (pointed out
    by several people). Adam Kilgarriff and his team used monolingual corpora
    and his SketchEngine for bilingual lexicography (English-Irish) (which is
    a step towards using comparable corpora I believe) but he points out that
    "... we're a fair way off from `bilingual word sketches' ...". Lieve
    Macken reminded me that the topic is close related to multi-lingual
    terminology extraction end there is, of course, a rich literature about it
    (some references below).

    Here are some pointers I got about available tools:

    ParaConc: a commercial parallel concordancer (athel.com)

    There is an online implementation of the Vanilla-aligner at
    http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online
    parallel concordancer at
    http://www2.lael.pucsp.br/corpora/parallelconc/index.html
    used by students

    Thomas Schmidt used a combination of a parallel concordancing tool and a
    lexicographic annotation tool for the construction of a multilingual
    football dictionary (www.kicktionary.de)

    The Finnish translation technology company Masterin has a bilingual term
    extractor that builds a raw bilingual translation lexicon from translation
    memory databases.

    Grigori Sidorov has a research tool that performs lexical-based alignment
    for English-Spanish parallel corpora.

    CLaRK is an XML based system for corpora development with support for
    document synchronization to be used to navigate through parallel corpora.
    http://www.bultreebank.org/clark/

    A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
    (software available at http://csar.sourceforge.net/) - I'm not sure about
    its support for parallel and comparable corpora ...

    Well, I add from my experience some more related tools available:

    various implementations of Gale&Church's sentence alignment algorithm
    (e.g. http://nl.ijs.si/telri/Vanilla/),
    Melameds GMA (http://nlp.cs.nyu.edu/GMA/),
    Hunalign (http://mokk.bme.hu/resources/hunalign),
    Champollion Tool Kit (http://champollion.sourceforge.net/)
    Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
    Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
    GIZA++
    (http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html)
    Twente word aligner
    (http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
    Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
    ILink (http://www.ida.liu.se/~nlplab/ILink/),
    K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
    CWB from IMS stuttgart with support for aligned corpora
    (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
    Uplug (http://sourceforge.net/projects/uplug)
    ... there are more tools for visualization and manual alignment ...

    I probably forgot a lot of links (that's why I asked on the list) - feel
    free to remind me!

    Some references to literature I got:

    Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
    Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 1, (Oxford:
    Elsevier), 787-796.

    Krishnamurthy, R.  2005. Corpus Lexicography. In K. Brown (ed.)
    Encyclopedia of Language and Linguistics,  2nd Edn., Vol. 3, (Oxford:
    Elsevier), 250-254.

    Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
    Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm

    I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting
    Parallel Texts", MIT Press. There is a lot more in his website
    http://cs.nyu.edu/~melamed/ .

    Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons
    from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2,
    May 2004, Pages 163 ~V 189
    (http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817

    Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
    Proceedings of the 19th International Conference on Computational
    Linguistics, COLING2002, Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN
    1-55860-894

    (more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)

    Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in
    Bilingual Texts using Bilingual Dictionaries and Dynamic Programming.
    Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp
    824-833.

    two links about "bilingual terminology extraction on comparable corora":
    acl.ldc.upenn.edu/P/P04/P04-1067.pdf
    acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps

    Thanks for responses:

    Marie-Paule Jacques <marie-paule.jacques@lipn.univ-paris13.fr>
    Michael Barlow <mi.barlow@auckland.ac.nz>
    Thomas Schmidt <thomas.schmidt@uni-hamburg.de>
    Tony Berber Sardinha <tony4@uol.com.br>
    Mickel Grönroos <mickel.gronroos@masterin.com>
    Grigori Sidorov <sidorov@cic.ipn.mx>
    Raphael Salkie <R.M.Salkie@bton.ac.uk>
    Dan Tufis <tufis@racai.ro>
    Serge Sharoff <s.sharoff@leeds.ac.uk>
    Adam Kilgarriff <adam@lexmasterclass.com>
    Kiril Simov <kivs@bultreebank.org>
    Alex Murzaku <lissus@gmail.com>
    Lieve Macken <lieve.macken@hogent.be>

    Jörg

    ***********/\/\/\/\/\/\/\/\/\/\/\************************************
    ** Jörg Tiedemann tiedeman@let.rug.nl **
    ** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    ** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    ** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    ** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    *************************************/\/\/\/\/\/\/\/\/\/\/\**********



    This archive was generated by hypermail 2b29 : Fri Feb 23 2007 - 13:18:57 MET