[Corpora-List] Developing Linguistic Corpora: a guide to good practice

From: Martin Wynne (martin.wynne@oucs.ox.ac.uk)
Date: Mon Oct 10 2005 - 16:33:22 MET DST

  • Next message: Nicole Adamides: "[Corpora-List] Translating and the Computer 27 Conference"

    The Arts and Humanities Data Service (AHDS) have published 'Developing
    Linguistic Corpora', edited by Martin Wynne of the Oxford Text Archive.
    This is the latest in the series of AHDS Guides to Good Practice.

    The printed book can be ordered online from Oxbow Books
    (http://www.oxbowbooks.com/) for £15 plus post and packing, and the full
    text is available for free online at http://ahds.ac.uk/linguistic-corpora/.

    In this volume, a selection of leading experts offer advice to help the
    reader to ensure that their corpus is well-designed and fit for the
    intended purpose.

    As John Sinclair writes in the first chapter: "A corpus is a remarkable
    thing, not so much because it is a collection of language text, but
    because of the properties that it acquires if it is well-designed and
    carefully-constructed."

    The collection includes the following chapters:

    * 'Corpus and text: basic principles' by John Sinclair
    * 'Adding linguistic annotation' by Geoffrey Leech
    * 'Metadata for corpus work' by Lou Burnard
    * 'Character encoding in corpus construction' by Tony McEnery and
    Richard Xiao
    * 'Spoken language corpora' by Paul Thompson
    * 'Archiving, distribution and preservation' by Martin Wynne

    John Sinclair sets out ten principles for corpus design, plus a new
    definition of a corpus. Geoffrey Leech offers a taxonomy of types of
    annotations as well as clear guidelines and some provisional standards
    for annotation at various linguistic levels. Lou Burnard explains the
    different types of metadata which can be provided for a corpus, and
    gives examples of how these can be implemented using the Text Encoding
    Initiative guidelines. Tony McEnery and Richard Xiao take on the tricky
    issue of encoding characters in languages other than English, giving an
    historical overview of the various solutions, leading to a discussion of
    how to use Unicode today in encoding corpus texts. Paul Thompson draws
    on his experience in developing the British Academic Spoken English
    (BASE) corpus to set out the stages involved in the development and
    exploitation of a corpus of speech, covering data collection,
    transcription, markup and annotation, and access. In chapter six, Martin
    Wynne explains how good planning and design can help to ensure the
    ongoing availability and usefulness of a corpus.

    This and other guides in the series are available from
    http://www.ahds.ac.uk/creating/guides/.

    AHDS Literature, Languages and Linguistics is hosted by the Oxford Text
    Archive, and is the repository for many freely available corpora in
    several languages, including English, French, German, Italian, Chinese
    and a variety of South Asian languages. There are also historical
    corpora, such as the Old English Corpus, the Helsinki Corpus of English
    Texts and the Lampeter Corpus of Early Modern English Tracts. These
    resources can be found via the experimental new AHDS cross-subject
    catalogue at http://www.ahds.ac.uk/, and at the OTA website at
    http://www.ota.ox.ac.uk. A listing of corpora is at
    http://www.ota.ox.ac.uk/search/search.perl?misc=corpus. Note that some
    of these resources are available for immediate download and others
    require the user to write in for permission to download them.

    Regards,
    Martin

    -- 
    Martin Wynne
    Head of the Oxford Text Archive and
    AHDS Literature, Languages and Linguistics
    

    Oxford University Computing Services 13 Banbury Road Oxford UK - OX2 6NN Tel: +44 1865 283299 Fax: +44 1865 273275 martin.wynne@oucs.ox.ac.uk



    This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 16:44:03 MET DST