Re: [Corpora-List] Summary: lexicographic tools for parallel/comparable corpora

From: Ramesh Krishnamurthy (r.krishnamurthy@aston.ac.uk)
Date: Fri Feb 23 2007 - 14:38:08 MET

  • Next message: Nuno Seco: "RE: [Corpora-List] wordlist-similarity tools in Java?"

    Dear Joerg

    The Oxford-Hachette French Dictionary (1994) was "based on
    two electronic text collections, one French and one English,
    each containing over 10 million words" (cover flap).

    Best
    Ramesh

    At 12:20 23/02/2007, Joerg Tiedemann wrote:

    >Here is a summary of responses to my question:
    >"I'm looking for information about tools for the lexicographic use of
    >parallel and comparable corpora."
    >
    >
    >Short summary:
    >
    >First of all there do not seem to be many lexicographic projects that use
    >parallel/comparable corpora. Raphael Salkie pointed me to the Dictionnaire
    >canadien bilingue for which parallel corpora where used (back in 1996).
    >There are papers talking about the use of parallel/comparable corpora in
    >dictionary building (e.g. Corréard (2005) and Krishnamurthy (2005)) but
    >there are no projects mentioned explicitly. The main problem seems to be
    >the lack of "clean", suitable data in reasonable quantities (pointed out
    >by several people). Adam Kilgarriff and his team used monolingual corpora
    >and his SketchEngine for bilingual lexicography (English-Irish) (which is
    >a step towards using comparable corpora I believe) but he points out that
    >"... we're a fair way off from `bilingual word sketches' ...". Lieve
    >Macken reminded me that the topic is close related to multi-lingual
    >terminology extraction end there is, of course, a rich literature about it
    >(some references below).
    >
    >
    >
    >Here are some pointers I got about available tools:
    >
    >
    >ParaConc: a commercial parallel concordancer (athel.com)
    >
    >There is an online implementation of the Vanilla-aligner at
    >http://www2.lael.pucsp.br/corpora/alinhador/index.html and an online
    >parallel concordancer at
    >http://www2.lael.pucsp.br/corpora/parallelconc/index.html
    >used by students
    >
    >Thomas Schmidt used a combination of a parallel concordancing tool and a
    >lexicographic annotation tool for the construction of a multilingual
    >football dictionary (www.kicktionary.de)
    >
    >The Finnish translation technology company Masterin has a bilingual term
    >extractor that builds a raw bilingual translation lexicon from translation
    >memory databases.
    >
    >Grigori Sidorov has a research tool that performs lexical-based alignment
    >for English-Spanish parallel corpora.
    >
    >CLaRK is an XML based system for corpora development with support for
    >document synchronization to be used to navigate through parallel corpora.
    >http://www.bultreebank.org/clark/
    >
    >A web-based corpus interface: http://corpus.leeds.ac.uk/internet.html
    >(software available at http://csar.sourceforge.net/) - I'm not sure about
    >its support for parallel and comparable corpora ...
    >
    >
    >Well, I add from my experience some more related tools available:
    >
    >various implementations of Gale&Church's sentence alignment algorithm
    >(e.g. http://nl.ijs.si/telri/Vanilla/),
    >Melameds GMA (http://nlp.cs.nyu.edu/GMA/),
    >Hunalign (http://mokk.bme.hu/resources/hunalign),
    >Champollion Tool Kit (http://champollion.sourceforge.net/)
    >Berger's align tool (http://www.cse.unt.edu/~rada/wa/tools/aberger/)
    >Moore's sentence aligner (http://research.microsoft.com/users/bobmoore/)
    >GIZA++
    >(http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html)
    >Twente word aligner
    >(http://wwwhome.cs.utwente.nl/~irgroup/align/download.html) now in the NA
    >Tools package (http://natura.di.uminho.pt/natura/natura?&topic=NATools)
    >ILink (http://www.ida.liu.se/~nlplab/ILink/),
    >K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
    >CWB from IMS stuttgart with support for aligned corpora
    >(http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/)
    >Uplug (http://sourceforge.net/projects/uplug)
    >... there are more tools for visualization and manual alignment ...
    >
    >I probably forgot a lot of links (that's why I asked on the list) - feel
    >free to remind me!
    >
    >
    >
    >Some references to literature I got:
    >
    >Corréard, M.-H. 2005. Bilingual Lexicography. In K. Brown (ed.)
    >Encyclopedia of Language and Linguistics, 2nd Edn., Vol. 1, (Oxford:
    >Elsevier), 787-796.
    >
    >Krishnamurthy, R. 2005. Corpus Lexicography. In K. Brown (ed.)
    >Encyclopedia of Language and Linguistics, 2nd Edn., Vol. 3, (Oxford:
    >Elsevier), 250-254.
    >
    >Roberts, R.P. 1996. Parallel Text Analysis and Bilingual
    >Lexicography. Available from http://www.dico.uottawa.ca/articles-fr.htm
    >
    >I. Dan Melamed's 2001 book/dissertation "Empirical Methods for Exploiting
    >Parallel Texts", MIT Press. There is a lot more in his website
    >http://cs.nyu.edu/~melamed/ .
    >
    >Dan Tufis, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons
    >from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2,
    >May 2004, Pages 163 ~V 189
    >(http://dx.doi.org/10.1023/B:CHUM.0000031172.03949.48) ISSB 0010-4817
    >
    >Dan Tufis 'A cheap and fast way to build useful translation lexicons' in
    >Proceedings of the 19th International Conference on Computational
    >Linguistics, COLING2002, Taipei, 25-30 August, 2002, pp. 1030-1036, ISBN
    >1-55860-894
    >
    >(more papers on Dan Tufis homepage http://www.racai.ro/~tufis/)
    >
    >Alexander Gelbukh and Grigori Sidorov. Alignment of Paragraphs in
    >Bilingual Texts using Bilingual Dictionaries and Dynamic Programming.
    >Lecture Notes in Computer Science, N 4225, Springer-Verlag, 2006, pp
    >824-833.
    >
    >two links about "bilingual terminology extraction on comparable corora":
    >acl.ldc.upenn.edu/P/P04/P04-1067.pdf
    >acl.ldc.upenn.edu/acl2003/iral/ps/Sadat.ps
    >
    >
    >Thanks for responses:
    >
    >Marie-Paule Jacques <marie-paule.jacques@lipn.univ-paris13.fr>
    >Michael Barlow <mi.barlow@auckland.ac.nz>
    >Thomas Schmidt <thomas.schmidt@uni-hamburg.de>
    >Tony Berber Sardinha <tony4@uol.com.br>
    >Mickel Grönroos <mickel.gronroos@masterin.com>
    >Grigori Sidorov <sidorov@cic.ipn.mx>
    >Raphael Salkie <R.M.Salkie@bton.ac.uk>
    >Dan Tufis <tufis@racai.ro>
    >Serge Sharoff <s.sharoff@leeds.ac.uk>
    >Adam Kilgarriff <adam@lexmasterclass.com>
    >Kiril Simov <kivs@bultreebank.org>
    >Alex Murzaku <lissus@gmail.com>
    >Lieve Macken <lieve.macken@hogent.be>
    >
    >
    >
    >
    >Jörg
    >
    >***********/\/\/\/\/\/\/\/\/\/\/\************************************
    >** Jörg Tiedemann tiedeman@let.rug.nl **
    >** Alfa-Informatica http://www.let.rug.nl/~tiedeman **
    >** Rijksuniversiteit Groningen Harmoniegebouw, room 1311-429 **
    >** Oude Kijk in 't Jatstraat 26 phone: +31 (0)50-363 5935 **
    >** 9712 EK Groningen fax: +31 (0)50-363 6855 **
    >*************************************/\/\/\/\/\/\/\/\/\/\/\**********

    Ramesh Krishnamurthy

    Lecturer in English Studies, School of Languages
    and Social Sciences, Aston University, Birmingham B4 7ET, UK
    [Room NX08, North Wing of Main Building] ; Tel:
    +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766
    http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

    Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/



    This archive was generated by hypermail 2b29 : Fri Feb 23 2007 - 15:08:12 MET