[Corpora-List] LEXICOGRAPHIC SOFTWARE: courses

From: Adam Kilgarriff (adam.kilgarriff@itri.brighton.ac.uk)
Date: Wed Oct 02 2002 - 07:33:30 MET DST

Next message: Kiril Simov: "[Corpora-List] Proceedings of the Treebanks and Linguistic Theories 2002 Workshop"

Previous message: Josephine Lo: "[Corpora-List] Concordancer for Chinese (Summary of reply)"
In reply to: delucca@nilc.icmc.usp.br: "[Corpora-List] LEXICOGRAPHIC SOFTWARE TOOLS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

                     ===========================
                        Lexicographic software
                            Short Courses
                              Responses
                     ===========================

J. L. DeLucca writes:
> We would like to hearing from you WHAT computational tools do you use at the
> present time for developping your LEXICOGRAPHIC projects.

This email provoked a number of responses on the lists above. It's a
topic we have deep (and sometimes bitter) experience of and will be
addressing in depth in an upcoming short course

LCM04 Computers and Lexicography
11-14 November 2002
ITRI, Brighton, England
Tutors:
    Adam Kilgarriff
    David Tugwell
Guest lectures from:
    Steve Crowdy, Longman Dictionaries / Pearson Education
    Laura Elliot, Oxford University Press

For details and bookings see

http://www.itri.brighton.ac.uk/lexicom

In response to some of the earlier responses to the mailout:

(1) Ramesh Krishnamurthy presents desiderata for both corpus resources
and corpus querying, and a "Dictionary Writing System", (attached
below). While Ramesh's list is useful as a starting point, it is
clearly not a full specification and does not present requirements in
relation to, eg, critical database issues such as 'sort' and
'cross-reference' functionality. Steve Crowdy has worked extensively
on two full specifications, both implemented and used in large-scale
dictionary production environments, which he will be talking about.
(One of these is now commercially available.)

As John Wiliams notes (also attached below), it is useful to distinguish the
Dictionary Writing System and the Corpus Query System (a read-only
package, from the lexicographer's viewpoint, in which language corpora
are loaded and can be viewed flexibly.) The course mentioned above
covers the former. Another Brighton course (website, bookings as
above) covers the latter:

LCM07 Corpus Design and Use
2-5 December 2002
Tutors:
Adam Kilgarriff
Michael Rundell

(2) Baden Hughes and others listed software they used, as here

> >Languages:
> >Perl, C++, NLP++, VB, Java, Tcl
> >
> >Applications & Utilities:
> >TeX, sed, awk, grep, FileMaker Pro, MySQL, Excel, Word
>

While these are all salient for various aspects of dictionary-making,
they fall far short of being an environment designed to help
lexicographers efficiently produce a large, coherent and consistent
dictionary. Most lexicographers are not programmers, and want a single
tool for writing a dictionary which takes care of the growing lexical
database in such a way that they need not think about it, but can get
on with the job of analysing meaning and writing entries.

?? WORKSHOP ??

If there is sufficient interest in the topic, we could append a
workshop, for pooling ideas and experiences of Dictionary Writing
Systems, to the LCM04 short course. if you would be interested and in a
position to come to Brighton for it (most likely dates: around 15/16
Nov), do let me know, with the dates you could make: if there are
enough responses, I'll organise it.

Adam Kilgarriff

also for Sue Atkins, Michael Rundell (Lexicography Masterclass Ltd)
and Lexicom group, University of Brighton

Ramesh Krishnamurthy writes:
> Dear Dr De Lucca
>
> I have drawn up a checklist from my 15 years experience in corpus-based computational lexicography.
> I hope this helps.
>
> If you are going to create software for the whole process from raw data to publishing
> of a dictionary/reference book, I think these would be my requirements.
> Every process should be automated to the maximum, with allowance for human intervention
> or input of preferences.
>
> 1. for monolingual dictionaries, a large corpus of L1
> 2. for bilingual dictionaries, a large corpus of L1 and L2, with pointers in both directions to find
> suggested equivalent words and phrases
> 3. lemmatized frequency lists, to decide which words are important enough to include in the dictionary,
> and which forms are significant, etc
> 4. based on the frequency lists, a spelling checker, giving variant spellings
> 5. pronunciation, with regional variations; concordanced tone units to hear word pronunciation in context
> 6. statistics for regional variations
> 7. statistics for genre distribution: is the wordform used in all types of text, or mainly in speech,
> mainly in newspapers, mainly in novels, etc
> 8. grammar - wordclass identification, colligation, grammar patterns (valency, complementation, etc);
> with frequencies, regional variations, and genre-distribution
> 9. collocation: individual collocates, lexical phrases, etc; with frequencies, regional variations, and genre-distribution
> 10. semantics - hypernyms, hyponyms, synonyms (i.e. thesaurus), antonyms
> 11. pragmatics - any relevant information
> 12. selected examples for each point from 3 onwards; large corpora yield hundreds or thousands of examples, so
> 13. spoken data: typical speaker, context, interlocutor, etc
> 14. concordancer to allow access to raw data and ability to check the information given from point 3 onwards
> 15. automatic cut-and-paste to dictionary or reference book database
> 16. customizable database templates for reference books
> 17. validation routines to ensure database entry fields contain correct information and are in correct sequence
> 18. ability to interrogate database on any field or subfield, to count entries, check that editorial policies have been followed,
> check cross-references, check that examples contain the headword, etc
> 19. automatic conversion from database to typesetting formats - columnation, page numbering, headers and footers, widows and orphans, typefaces, etc
> 20. progress monitoring - which processes have been completed (e.g. compilation, editing, proofreading), which words have been done, who did them, when, etc
>
> All the tools should be flexible, to allow users to cater for local variations in any feature, from orthographic form (capitalization, punctuation, contractions, etc)
> to size of field in the databases, etc.
>
> Best wishes
> Ramesh
>
> Ramesh Krishnamurthy
> Consultant, Collins Cobuild and Bank of English Corpus;
> Honorary Research Fellow, Centre for Corpus Linguistics, University of Birmingham;
> Honorary Research Fellow, Computational Linguistics Research Group, University of Wolverhampton.
>
>

John Williams writes:
>
> Dear Dr De Lucca,
>
> As a former colleague of Ramesh, I haven't got much to add to his very
> comprehensive checklist. But you may wish to consider to what extent the
> data requirements (approx. points 1-14 of Ramesh's list) need to be
> integrated with the compilation package proper (approx. points 15-20).
> For maximum reusability, you may want to separate the two components,
> and maybe your brief only covers the latter.
>
> I would add a couple of things:
> - since many big dictionary projects today are compiled by dispersed
> teams working on their own computers, the software ideally needs to be
> platform-independent, and include some kind of networking facility for
> ease of file transfer;
> - again, for maximum reusability and flexibility, the software should
> allow the project manager to define his/her own tagset (though a basic
> tagset should be included initially). There's a Croatian package called
> Softlex that allows precisely this.
>
> Best wishes,
>
> John
>
>
>
> --
>
> John Williams
>
> Freelance Lexicographer
>
> Tel/Fax: (+44) (0)151 733 5459
> Mobile: (+44) (0)7968 027829
>
> Web: http://www.eflex-mcmail.com
>
> E-mail: johnw@whoever.com
>

> ----- Original Message -----
> From: delucca@nilc.icmc.usp.br
> To: corpora@hd.uib.no
> Cc: delucca@usp.br
> Subject: [Corpora-List] Dictionary Creation Software
>
> Dear Colleagues,
>
> We are a team of researchers in Computational Linguistics and, at the
> present time, we are working on construction software tools for making
> Dictionaries.
>
> We would like to hearing from those who have experiences with the compiling
> dictionaries
> and vocabularies the following: WHAT you would like, would need, and would
> hope of a Dictionary Creation Software. What type of tools would be essential
> for making dictionaries, vocabularies and other any type of reference work. A
> concordancer? A Spelling Checker? Pronouncing ?
>
> We look forward to hearing from you with great interest.
>
> Thank you very much in advance for your advice.
>
> Sincerely
>
>
>
>
> J.L. DeLucca, PhD
>
> Interinstitutional Center for Research and Development in Computational
> Linguistics (NILC)
> Sao Paulo University
>
>

-- NEW!! MSc and Short Courses in Lexical Computing and Lexicography Info at

http://www.itri.brighton.ac.uk/lexicom

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Adam Kilgarriff Senior Research Fellow tel: (44) 1273 642919 Information Technology Research Institute (44) 1273 642900 University of Brighton fax: (44) 1273 642908 Lewes Road Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Next message: Kiril Simov: "[Corpora-List] Proceedings of the Treebanks and Linguistic Theories 2002 Workshop"
Previous message: Josephine Lo: "[Corpora-List] Concordancer for Chinese (Summary of reply)"
In reply to: delucca@nilc.icmc.usp.br: "[Corpora-List] LEXICOGRAPHIC SOFTWARE TOOLS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Oct 02 2002 - 07:44:59 MET DST