RE: [Corpora-List] special purpose corpora

From: Luis Sarmento (parapraxe@excite.com)
Date: Thu Sep 23 2004 - 16:16:20 MET DST

  • Next message: Jason Eisner: "[Corpora-List] Call for Proposals: JHU Summer Workshop on Language Engineering"

     Dear Chelo, I think you might find the following reference useful:

    Sarmento, Luís, Belinda Maia & Diana Santos. "The Corpógrafo - a Web-based environment for corpora research", in Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa & Raquel Silva (eds.), Proceedings of LREC'2004, Fourth International Conference on Language resources and Evaluation (Lisboa, 26-28 May 2004), pp. 449-52.You can download the pdf file from Linguateca's web site at: http://www.linguateca.pt/Diana/download/SarmentoMaiaSantosLREC2004.pdf

    In this paper we describe the Corpógrafo, a web based environment that we have been developing for almost two years. The Corpógrafo allows users to create their own personal (and private) corpora by uploading various types of files (pdf, post-script, html, word, rtf...) to our web server. Once a specific corpus has been collected (containing any combination of the uploaded files), users can perform a variety of standard corpus search operations (regular expression concordancing, KWIC, N-Gram analysis) and also extract terminology from the corpus by using a combination of statistical algorithms and lexical filters, built for portuguese, english, spanish and italian (also french and german but results are not as good).

    All the terminology extracted may be automatically stored in specific terminology databases (created by the user) for further knowledge extraction. These databases allow the user to set and manage meta-information about the term, as defined by the ISO standard. The Corpografo will also help the user in finding definitions for terms and possible semantic relations among them (at the moment only meronimy and hiponimy) by searching the corpus again for specific patterns and clues and presenting possible candidates to the user for validation. Users may also manually identify bilingual equivalents in order to create multilingual terminological databases.

    We will be releasing Version 2 of Corpógrafo in late October with more and revised functionalities and a more user-friendly interface. For now, please have a look at www.linguateca.pt/corpografo/ and have a try (you need to subscribe the Corpógrafo before using it). The web interface is in portuguese but user documentation is available in both portuguese and english. At the moment, Corpógrafo is being regurlarly used by 40 users (aprox.) that have been doing terminological research on a variety of knowledge domains based on their own personal specific domain corpora...

    I hope this helps.

    Regards,

    Luís Sarmento

    las@letras.up.pt

    Linguateca

     

     --- On Wed 09/22, Chelo Vargas < Chelo.Vargas@ua.es > wrote: From: Chelo Vargas [mailto: Chelo.Vargas@ua.es] To: CORPORA@HD.UIB.NO Date: Wed, 22 Sep 2004 07:00:43 +0200 Subject: [Corpora-List] special purpose corpora

    Dear all,

    I am looking for literature dealing with the design and compilation of special purpose corpora, more specifically, corpora with a terminographical purpose. The references I already have are:

    Pearson, J. (1998): Terms in Context; Meyer,



    This archive was generated by hypermail 2b29 : Thu Sep 23 2004 - 16:36:54 MET DST