Corpora: LREC Workshop

Nancy M. Ide (ide@cs.vassar.edu)
Sat, 18 Dec 1999 11:39:05 -0500 (EST)

+**+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+**+

Second International Conference on Language Resources and Evaluation
(LREC 2000)

Athens, Greece

Pre-Conference Workshop Announcement and Call for Participation

Data Architectures and Software Support for Large Corpora:
Towards an American National Corpus

Monday, May 29, 2000

http://www.cs.vassar.edu/~ide/anc/lrec.html

*+**+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+**

Description
-----------

Several software systems for linguistic annotation, search, and
retrieval of large corpora have been developed within the natural
language processing community over the past several years, including
LT-XML (Edinburgh), GATE (Sheffield), IMS Corpus Workbench
(Stuttgart), Alembic Workbench (Mitre), MATE
(Edinburgh/Odense/Stuttgart), Silfide (Loria/CNRS), SARA (BNC), and
several others. Related to and in support of this development, there
have also been efforts to develop standards for encoding and various
kinds of linguistic annotation, as well as data architectures (e.g.,
TIPSTER, TalkBank) etc. Still other developments, such as the
introduction of XML and the powerful XSL transformation language and
work on semi-structured data (e.g., the work of the Lore group at
Stanford), have also impacted the ways in which corpora and other
linguistic resources can be represented, stored, and accessed.

Approaches to the fundamental design of the formats, data, and tools
are varied among current systems for the annotation and exploitation
of linguistic corpora. A primary reason for this diversity is that
most developers of formats and systems are concerned with only one
aspect of the creation/annotation/exploitation process. However, in
order to work effectively to develop commonality, the phases of the
process must be considered as a whole. This demands bringing together
researchers and developers from a variety of domains in text, speech,
video, etc., many of whom have previously had little or no contact
with one another.

This workshop is intended to bring these groups together to look
broadly at the technical issues that bear on the development of
software systems for the annotation and exploitation of linguistic
resources. The goal is to lay the groundwork for the definition of a
data and system architecture to support corpus annotation and
exploitation that can be widely adopted within the community. Among
the issues to be addressed are:

o layered data architectures
o system architectures for distributed databases
o support for plurality of annotation schemes
o impact and use of XML/XSL
o support for multimedia, including speech and video
o tools for creation, annotation, query and access of corpora
o mechanisms for linkage of annotation and primary data
o applicability of semi-structured data models, search and query
systems, etc.
o evaluation/validation of systems and annotations

The motivation for this workshop is the American National Corpus (ANC)
effort, which should begin corpus creation within the year. We
anticipate that the ANC will provide a significant resource for
natural language processing, and we therefore seek to identify
state-of-the-art methods for its creation, annotation, and
exploitation. Also, as a national and freely available resource, the
data and system architecture of the ANC is likely to become a de facto
standard. We therefore hope to draw together leading researchers and
developers to establish a basis for the design of a system to support
the creation and use of the ANC.

A "Birds of a Feather" session for those interested in the ANC project
will be held immediately following the workshop.

Submission information
----------------------

Submissions should address one or more of the listed
topics. Descriptions of planned or existing systems is acceptable, but
they should be situated in the larger context of the issues the
workshop addresses e.g., outline of the strengths and/or weaknesses of
the system and/or data formats, comparison with alternative
approaches, etc.

A 3000-4500 word abstract in English should be submitted by e-mail to
Nancy Ide (ide@cs.vassar.edu) in plain ASCII text format and with the
subject line "LREC WORKSHOP SUBMISSION : <First author's name>". Each submission
should include title; author(s); affiliation(s); and contact author's
e-mail address, postal address, telephone and fax numbers.

February 15, 2000 : Submissions due
March 15, 2000 : Results transmitted to authors
April 15, 2000 : Final Papers due
May 29, 2000 : Workshop

Organizing Committee
--------------------

Nancy Ide (contact)
Department of Computer Science
Vassar College
Poughkeepsie, New York 12604-0520 USA
Tel : +1 914 437 5988
Fax : +1 914 437 7498
Email : ide@vassar.edu

Laurent Romary
LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-lhs-Nancy FRANCE
Tel : +33 (0)3 83 59 30 00
Fax : +33 (0)3 83 27 83 19
Email : romary@loria.fr

Henry S. Thompson
Human Communication Research Centre
2 Buccleuch Place
Edinburgh EH8 9LW
SCOTLAND
Tel : +44 (131) 650 4440
Fax : +44 (131) 650 4587
Email : ht@cogsci.ed.ac.uk

Program Committee
-----------------

Steven Bird, Linguistic Data Consortium
Patrice Bonhomme, LORIA/CNRS
Roy Byrd, IBM Corporation
Jean Carletta, HCRC Edinburgh
Ulrich Heid, IMS Stuttgart
Hamish Cunningham, Sheffield
David Day, Mitre Corporation
Robert Gaizauskas, Sheffield
Ralph Grishman, New York University
Nancy Ide, Vassar College (Chair)
Masato Ishizaki, JAIST
Dan Jurafsky, University of Colorado at Boulder
Tony McEnery, Lancaster
David McKelvie, HCRC Edinburgh
Laurent Romary, LORIA/CNRS
Gary Simons, Summer Institute of Linguistics
Henry Thompson, HCRC Edinburgh
Yorick Wilks, Sheffield
Peter Wittenburg, Max Planck Institute
Remi Zajac, New Mexico State University