Corpora: Technical Report: A Formal Foundation for Linguistic Annotation

Steven Bird (sb@unagi.cis.upenn.edu)
Wed, 3 Mar 1999 11:21:14 -0500 (EST)

Announcing a technical report on linguistic annotation. See below
for download information. Our apologies for any duplicate messages.

A Formal Framework for Linguistic Annotation
Steven Bird & Mark Liberman

Abstract

`Linguistic annotation' covers any descriptive or analytic notations
applied to raw language data. The basic data may be in the form of
time functions - audio, video and/or physiological recordings - or it
may be textual. The added notations may include transcriptions of all
sorts (from phonetic features to discourse structures), part-of-speech
and sense tagging, syntactic analysis, `named entity' identification,
co-reference annotation, and so on. While there are several ongoing
efforts to provide formats and tools for such annotations and to
publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the
extent they exist, have focussed on file formats. This paper focuses
instead on the logical structure of linguistic annotations. We survey
a wide variety of existing annotation formats and demonstrate a common
conceptual core, the annotation graph. This provides a formal
framework for constructing, maintaining and searching linguistic
annotations, while remaining consistent with many alternative data
structures and file formats.

49pp, download from: [http://xxx.lanl.gov/abs/cs.CL/9903003]
Formats: PDF (336kb), Postscript (161kb), DVI (134kb), LaTeX (112kb)

For an online survey and extensive links, visit the
Linguistic Annotations Page: [http://www.ldc.upenn.edu/annotation]

@TechReport{BirdLiberman99,
author={Steven Bird and Mark Liberman},
title={A Formal Framework for Linguistic Annotation},
institution={Department of Computer and Information Science,
University of Pennsylvania},
year=1999,
number={MS-CIS-99-01},
note={[xxx.lanl.gov/abs/cs.CL/9903003]}
}

Please send comments to: sb@ldc.upenn.edu, myl@ldc.upenn.edu

Regards,
Steven Bird & Mark Liberman

--
Steven.Bird@ldc.upenn.edu  http://www.ldc.upenn.edu/sb
Assoc Director, LDC; Adj Assoc Prof, CIS & Linguistics
Linguistic Data Consortium, University of Pennsylvania
3615 Market St, Suite 200, Philadelphia, PA 19104-2608