[Corpora-List] LEXICOGRAPHIC SOFTWARE: courses

From: Adam Kilgarriff (adam.kilgarriff@itri.brighton.ac.uk)
Date: Wed Oct 02 2002 - 07:33:30 MET DST

  • Next message: Kiril Simov: "[Corpora-List] Proceedings of the Treebanks and Linguistic Theories 2002 Workshop"

                         ===========================
                            Lexicographic software
                                Short Courses
                                  Responses
                         ===========================

    J. L. DeLucca writes:
    > We would like to hearing from you WHAT computational tools do you use at the
    > present time for developping your LEXICOGRAPHIC projects.

    This email provoked a number of responses on the lists above. It's a
    topic we have deep (and sometimes bitter) experience of and will be
    addressing in depth in an upcoming short course

     LCM04 Computers and Lexicography
     11-14 November 2002
     ITRI, Brighton, England
     Tutors:
        Adam Kilgarriff
        David Tugwell
     Guest lectures from:
        Steve Crowdy, Longman Dictionaries / Pearson Education
        Laura Elliot, Oxford University Press

    For details and bookings see

        http://www.itri.brighton.ac.uk/lexicom

    In response to some of the earlier responses to the mailout:

    (1) Ramesh Krishnamurthy presents desiderata for both corpus resources
    and corpus querying, and a "Dictionary Writing System", (attached
    below). While Ramesh's list is useful as a starting point, it is
    clearly not a full specification and does not present requirements in
    relation to, eg, critical database issues such as 'sort' and
    'cross-reference' functionality. Steve Crowdy has worked extensively
    on two full specifications, both implemented and used in large-scale
    dictionary production environments, which he will be talking about.
    (One of these is now commercially available.)

    As John Wiliams notes (also attached below), it is useful to distinguish the
    Dictionary Writing System and the Corpus Query System (a read-only
    package, from the lexicographer's viewpoint, in which language corpora
    are loaded and can be viewed flexibly.) The course mentioned above
    covers the former. Another Brighton course (website, bookings as
    above) covers the latter:

     LCM07 Corpus Design and Use
     2-5 December 2002
     Tutors:
        Adam Kilgarriff
        Michael Rundell

    (2) Baden Hughes and others listed software they used, as here

    > >Languages:
    > >Perl, C++, NLP++, VB, Java, Tcl
    > >
    > >Applications & Utilities:
    > >TeX, sed, awk, grep, FileMaker Pro, MySQL, Excel, Word
    >

    While these are all salient for various aspects of dictionary-making,
    they fall far short of being an environment designed to help
    lexicographers efficiently produce a large, coherent and consistent
    dictionary. Most lexicographers are not programmers, and want a single
    tool for writing a dictionary which takes care of the growing lexical
    database in such a way that they need not think about it, but can get
    on with the job of analysing meaning and writing entries.

                             ?? WORKSHOP ??

    If there is sufficient interest in the topic, we could append a
    workshop, for pooling ideas and experiences of Dictionary Writing
    Systems, to the LCM04 short course. if you would be interested and in a
    position to come to Brighton for it (most likely dates: around 15/16
    Nov), do let me know, with the dates you could make: if there are
    enough responses, I'll organise it.

       Adam Kilgarriff

       also for Sue Atkins, Michael Rundell (Lexicography Masterclass Ltd)
       and Lexicom group, University of Brighton

    Ramesh Krishnamurthy writes:
    > Dear Dr De Lucca
    >
    > I have drawn up a checklist from my 15 years experience in corpus-based computational lexicography.
    > I hope this helps.
    >
    > If you are going to create software for the whole process from raw data to publishing
    > of a dictionary/reference book, I think these would be my requirements.
    > Every process should be automated to the maximum, with allowance for human intervention
    > or input of preferences.
    >
    > 1. for monolingual dictionaries, a large corpus of L1
    > 2. for bilingual dictionaries, a large corpus of L1 and L2, with pointers in both directions to find
    > suggested equivalent words and phrases
    > 3. lemmatized frequency lists, to decide which words are important enough to include in the dictionary,
    > and which forms are significant, etc
    > 4. based on the frequency lists, a spelling checker, giving variant spellings
    > 5. pronunciation, with regional variations; concordanced tone units to hear word pronunciation in context
    > 6. statistics for regional variations
    > 7. statistics for genre distribution: is the wordform used in all types of text, or mainly in speech,
    > mainly in newspapers, mainly in novels, etc
    > 8. grammar - wordclass identification, colligation, grammar patterns (valency, complementation, etc);
    > with frequencies, regional variations, and genre-distribution
    > 9. collocation: individual collocates, lexical phrases, etc; with frequencies, regional variations, and genre-distribution
    > 10. semantics - hypernyms, hyponyms, synonyms (i.e. thesaurus), antonyms
    > 11. pragmatics - any relevant information
    > 12. selected examples for each point from 3 onwards; large corpora yield hundreds or thousands of examples, so
    > 13. spoken data: typical speaker, context, interlocutor, etc
    > 14. concordancer to allow access to raw data and ability to check the information given from point 3 onwards
    > 15. automatic cut-and-paste to dictionary or reference book database
    > 16. customizable database templates for reference books
    > 17. validation routines to ensure database entry fields contain correct information and are in correct sequence
    > 18. ability to interrogate database on any field or subfield, to count entries, check that editorial policies have been followed,
    > check cross-references, check that examples contain the headword, etc
    > 19. automatic conversion from database to typesetting formats - columnation, page numbering, headers and footers, widows and orphans, typefaces, etc
    > 20. progress monitoring - which processes have been completed (e.g. compilation, editing, proofreading), which words have been done, who did them, when, etc
    >
    > All the tools should be flexible, to allow users to cater for local variations in any feature, from orthographic form (capitalization, punctuation, contractions, etc)
    > to size of field in the databases, etc.
    >
    > Best wishes
    > Ramesh
    >
    > Ramesh Krishnamurthy
    > Consultant, Collins Cobuild and Bank of English Corpus;
    > Honorary Research Fellow, Centre for Corpus Linguistics, University of Birmingham;
    > Honorary Research Fellow, Computational Linguistics Research Group, University of Wolverhampton.
    >
    >

    John Williams writes:
    >
    > Dear Dr De Lucca,
    >
    > As a former colleague of Ramesh, I haven't got much to add to his very
    > comprehensive checklist. But you may wish to consider to what extent the
    > data requirements (approx. points 1-14 of Ramesh's list) need to be
    > integrated with the compilation package proper (approx. points 15-20).
    > For maximum reusability, you may want to separate the two components,
    > and maybe your brief only covers the latter.
    >
    > I would add a couple of things:
    > - since many big dictionary projects today are compiled by dispersed
    > teams working on their own computers, the software ideally needs to be
    > platform-independent, and include some kind of networking facility for
    > ease of file transfer;
    > - again, for maximum reusability and flexibility, the software should
    > allow the project manager to define his/her own tagset (though a basic
    > tagset should be included initially). There's a Croatian package called
    > Softlex that allows precisely this.
    >
    > Best wishes,
    >
    > John
    >
    >
    >
    > --
    >
    > John Williams
    >
    > Freelance Lexicographer
    >
    > Tel/Fax: (+44) (0)151 733 5459
    > Mobile: (+44) (0)7968 027829
    >
    > Web: http://www.eflex-mcmail.com
    >
    > E-mail: johnw@whoever.com
    >

    > ----- Original Message -----
    > From: delucca@nilc.icmc.usp.br
    > To: corpora@hd.uib.no
    > Cc: delucca@usp.br
    > Subject: [Corpora-List] Dictionary Creation Software
    >
    > Dear Colleagues,
    >
    > We are a team of researchers in Computational Linguistics and, at the
    > present time, we are working on construction software tools for making
    > Dictionaries.
    >
    > We would like to hearing from those who have experiences with the compiling
    > dictionaries
    > and vocabularies the following: WHAT you would like, would need, and would
    > hope of a Dictionary Creation Software. What type of tools would be essential
    > for making dictionaries, vocabularies and other any type of reference work. A
    > concordancer? A Spelling Checker? Pronouncing ?
    >
    > We look forward to hearing from you with great interest.
    >
    > Thank you very much in advance for your advice.
    >
    > Sincerely
    >
    >
    >
    >
    > J.L. DeLucca, PhD
    >
    > Interinstitutional Center for Research and Development in Computational
    > Linguistics (NILC)
    > Sao Paulo University
    >
    >

    -- 
    NEW!! MSc and Short Courses in Lexical Computing and Lexicography
    Info at
    

    http://www.itri.brighton.ac.uk/lexicom

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Adam Kilgarriff Senior Research Fellow tel: (44) 1273 642919 Information Technology Research Institute (44) 1273 642900 University of Brighton fax: (44) 1273 642908 Lewes Road Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



    This archive was generated by hypermail 2b29 : Wed Oct 02 2002 - 07:44:59 MET DST