[Corpora-List] Developing Linguistic Corpora: a guide to good practice

From: Martin Wynne (martin.wynne@oucs.ox.ac.uk)
Date: Mon Oct 10 2005 - 16:33:22 MET DST

Next message: Nicole Adamides: "[Corpora-List] Translating and the Computer 27 Conference"

Previous message: ELDA: "[Corpora-List] LREC2006 - [Reminder submission deadline October 14, 2005]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The Arts and Humanities Data Service (AHDS) have published 'Developing
Linguistic Corpora', edited by Martin Wynne of the Oxford Text Archive.
This is the latest in the series of AHDS Guides to Good Practice.

The printed book can be ordered online from Oxbow Books
(http://www.oxbowbooks.com/) for Ł15 plus post and packing, and the full
text is available for free online at http://ahds.ac.uk/linguistic-corpora/.

In this volume, a selection of leading experts offer advice to help the
reader to ensure that their corpus is well-designed and fit for the
intended purpose.

As John Sinclair writes in the first chapter: "A corpus is a remarkable
thing, not so much because it is a collection of language text, but
because of the properties that it acquires if it is well-designed and
carefully-constructed."

The collection includes the following chapters:

* 'Corpus and text: basic principles' by John Sinclair
* 'Adding linguistic annotation' by Geoffrey Leech
* 'Metadata for corpus work' by Lou Burnard
* 'Character encoding in corpus construction' by Tony McEnery and
Richard Xiao
* 'Spoken language corpora' by Paul Thompson
* 'Archiving, distribution and preservation' by Martin Wynne

John Sinclair sets out ten principles for corpus design, plus a new
definition of a corpus. Geoffrey Leech offers a taxonomy of types of
annotations as well as clear guidelines and some provisional standards
for annotation at various linguistic levels. Lou Burnard explains the
different types of metadata which can be provided for a corpus, and
gives examples of how these can be implemented using the Text Encoding
Initiative guidelines. Tony McEnery and Richard Xiao take on the tricky
issue of encoding characters in languages other than English, giving an
historical overview of the various solutions, leading to a discussion of
how to use Unicode today in encoding corpus texts. Paul Thompson draws
on his experience in developing the British Academic Spoken English
(BASE) corpus to set out the stages involved in the development and
exploitation of a corpus of speech, covering data collection,
transcription, markup and annotation, and access. In chapter six, Martin
Wynne explains how good planning and design can help to ensure the
ongoing availability and usefulness of a corpus.

This and other guides in the series are available from
http://www.ahds.ac.uk/creating/guides/.

AHDS Literature, Languages and Linguistics is hosted by the Oxford Text
Archive, and is the repository for many freely available corpora in
several languages, including English, French, German, Italian, Chinese
and a variety of South Asian languages. There are also historical
corpora, such as the Old English Corpus, the Helsinki Corpus of English
Texts and the Lampeter Corpus of Early Modern English Tracts. These
resources can be found via the experimental new AHDS cross-subject
catalogue at http://www.ahds.ac.uk/, and at the OTA website at
http://www.ota.ox.ac.uk. A listing of corpora is at
http://www.ota.ox.ac.uk/search/search.perl?misc=corpus. Note that some
of these resources are available for immediate download and others
require the user to write in for permission to download them.

Regards,
Martin

-- Martin Wynne Head of the Oxford Text Archive and AHDS Literature, Languages and Linguistics

Oxford University Computing Services 13 Banbury Road Oxford UK - OX2 6NN Tel: +44 1865 283299 Fax: +44 1865 273275 martin.wynne@oucs.ox.ac.uk

Next message: Nicole Adamides: "[Corpora-List] Translating and the Computer 27 Conference"
Previous message: ELDA: "[Corpora-List] LREC2006 - [Reminder submission deadline October 14, 2005]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Oct 10 2005 - 16:44:03 MET DST