THE TEXT ENCODING INITIATIVE: A PROGRESS REPORT

Lou Burnard

The first public draft of the Text Encoding Initiative's Guidelines for the Encoding and Interchange of Machine-Readable Texts is now available. Parts of the Guidelines were presented in preliminary form in June 1990 at the annual conferences of the ACH/ALLC in Siegen (Germany) and of the ACL in Pittsburgh (USA).
The first full publication of the Guidelines took place in August, marking the end of the first phase of the TEI's work. At the time of writing, over 500 copies of the Draft Guidelines have already been distributed for comment to interested scholars, researchers, librarians and computing specialists around the world. This article describes briefly the background to the TEI itself, summarizes the current content of the TEI Guidelines and outlines the work plan for the remaining two years of the project.

THE TEI

The Text Encoding Initiative (TEI) arose out of a planning conference convened by the Association for Computers and the Humanities (ACH) at Poughkeepsie, New York, in November 1987. It is a major international research project, sponsored jointly by the ACH, the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC), with the further participation of numerous other organizations and learned societies, and funded by the US National Endowment for the Humanities, DG XIII of the Commission of the European Community and the Andrew W Mellon Foundation.
Its task is to develop and disseminate a clearly defined format for the interchange of machine-readable texts among researchers, so as to allow easier and more efficient sharing of resources for textual computing and natural language processing. In addition, the TEI has taken on the task of making recommendations about which textual features should be distinguished when encoding texts from scratch, to help ensure that the resulting text can be maximally useful to the research community. The current Guidelines thus have two closely related goals: a 'how' defining a format for text interchange among researchers and a 'what' recommending specific practices in the encoding of new texts.
The availability of an international standard for the description of encoding schemes (the Standard Generalized Markup Language - SGML, ISO 8879, 1986) and its increasing recognition within both the commercial and academic text processing communities encourages the belief that this effort at standardization may succeed where previous ones have failed. SGML provides a simple method of text encoding which consistently distinguishes markup from the content of the text being marked up. It allows texts of any kind to be marked up as an ordered hierarchy of typed objects and is designed to be independent of both hardware or software environments and of the application to which encoded texts are to be put.
During its first two year cycle, the work of the TEI was done within four working committees, with membership drawn from a broad cross section of the international scholarly community. One committee, with expertise in librarianship and archive management, concerned itself with problems of text documentation and produced detailed recommendations for the in-file encoding of cataloguing information about the electronic text itself, its source and the relationship between the two. A second committee, with technical expertise in formal language theory and in SGML itself, produced recommendations about how SGML should best be used and addressed the problems of conversion between the TEI and other encoding schemes.
As a first attempt to divide up the daunting task of producing recommendations for the immense variety of textual features which scholarship might need to encode, a distinction was made between textual features conventionally represented by typographic or other visible means on the one hand, and those which could be identified only by some analytic or interpretative act on the other. The former features were the responsibility of the Text Representation Committee, which produced a set of recommendations on ways of dealing with divers character sets and laid ground rules for the encoding of textual features common to most types of continuous prose text, as well as exploring some specific types of (largely, but not exclusively, literary) texts.
For the latter, the initial focus was on linguistic analysis, since it was felt that this would provide the best basis for any more specialised analytic or interpretive efforts in the future.
Members of this committee, most of whom had considerable expertise in computational and theoretical linguistics, developed a number of powerful theory-independent mechanisms for the encoding of analytic features in SGML, represented as tree structures or as parallel (but aligned) levels of analysis. A subgroup of this committee also worked on defining a standard for monolingual dictionaries.
The task of co-ordinating the work of the four committees, and combining their drafts into the initial publication is carried out by two editors, one European and one American. The project as a whole is managed by a steering committee, with two representatives from each of the three sponsoring organisations.
An Advisory Board, with representatives from 15 major learned and professional societies, endorsed the initial work plan at its first meeting in February 1989, and will also (all being well) endorse the final Guidelines document when it is available in June of 1992.

THE GUIDELINES - DRAFT 1

ILLUSTRATION
It should be stressed that the first draft of the Guidelines, despite its weighty appearance (nearly 300 pages of closely printed A4), is very much a discussion paper and far from being complete or definitive. At least one and probably two interim drafts will be produced over the next two years, as described further below. Some characteristics of the TEI approach are however already discernible which are unlikely to change. One is a focus on the encoding of the content of text, rather than its appearance, which is also a characteristic of SGML. Another is the rigorous application of Occam's razor: the TEI approach to the immense variety of text types in the real world is to attempt to define a comparatively small number of features which all texts share, and to allow for these to be used in combination with user-definable sets of more specialised features.
The current draft has eight main sections, which are briefly summarized below.
Chapter 1 outlines the purpose and scope of the TEI scheme. As outlined above, its main goals are both to facilitate data interchange and to provide guidance for those creating new texts. The desiderata of simplicity, clarity, formal rigour, sufficient power for research purposes, conformance to international standards, and independence of software, hardware or application alike are stressed.
Chapter 2, recognising that SGML is not yet widely understood within the TEI community, provides a gentle introduction to the basic concepts of SGML. It also contains some more technical information about the ways in which the TEI scheme uses the standard.
Chapter 3 addresses the problems of character encoding and translation in a world dominated by the rival claims of ASCII and EBCDIC. If the goal is to provide machine-independent support for all writing systems of all languages, these problems are far from trivial. The specific recommendations made are that only a subset of the ISO-646 character set (sometimes known as ASCII) can currently be relied on for data interchange, and that this should be extended either by using the entity reference mechanism provided by SGML or by using transliteration schemes. It proposes a powerful but economical way of documenting such transliteration schemes by a formal Writing System Declaration.
Chapter 4 contains recommendations for in-file documentation of electronic texts adequate to the bibliographic needs of researchers, data archivists and librarians. It recommends that a special header be added to each file to perform a function analogous to that of the title page of a non-electronic text, and proposes sets of tags for information about the file itself, the source from which it was derived and how it was encoded.
Chapter 5, the largest chapter, attempts to define general-purpose structural and non-structural tags for continuous prose texts. It embodies a view of text as a hierarchic structure, divided into front, body and back matter, within which neutrally-named subdivisions may be tagged, down to a level corresponding with paragraphs, or other segments. It also allows for 'phrase level tags' to identify non-structural units contained arbitrarily within the lowest level of structural tags, and proposes tagging schemes for such features as notes, names, abbreviations, numbers, foreign or emphasised phrases, cross references, and hypertextual links. Other sections discuss ways of encoding textual variation and critical apparatus and of recording the rendering of arbitrary textual fragments within this overall framework. There is also some discussion of different ways of maintaining multiple referencing schemes within the same text. Finally, it contains some initial proposals for low-level structured items (termed 'crystals') which can be contained by, or between, structural tags, such as lists, citations, formulae, figures and tables.
Chapter 6 outlines a number of theory-independent mechanisms for representing all kinds of linguistic analyses of running text. It is probably the most daunting chapter for the non-specialist reader, though much of its contents are of very wide relevance.
It argues that most, if not all, linguistic analyses can be represented as bundles of named, value-bearing, 'feature structures', which may be nested and grouped into sets or lists.
It proposes ways of supporting multiple and independently aligned analyses, chiefly by means of the ID/IDREF pointer mechanism native to SGML. It also contains some tagsets for such commonly occurring formalisms as tree structures and parts of speech.
Chapter 7 considers in more detail particular aspects of some specific types of text. The text-types discussed in this draft are: language corpora and collections; verse, drama, and narrative; dictionaries; and office documents. In each case, an overview of the problems specific to these types of discourse is given, with some preliminary proposals for tags appropriate to them. This chapter is one that will be considerably revised and extended over the coming months, as its initial proposals are firmed up and as its scope is extended to other types of text.
Chapter 8 outlines a method by which the current Guidelines may be modified and extended, largely by introducing indirection into the Document Type Definitions (the formal SGML specifications for the TEI encoding scheme). Extension and modification of the TEI proposals is an important design goal, since this is both expected and intended, and the final form of the Guidelines will facilitate it.
Preliminary versions of a number of technical appendixes are provided in the current draft. These include annotated examples, illustrating the application of the TEI encoding scheme to a wide range of texts, formal SGML document type declarations (DTDs) for all the tags and groups of tags defined in the TEI scheme, and code pages for some commonly used character sets. Later drafts will extend and improve these initial versions considerably, and will also contain an alphabetical reference section with a summary of each tag, its attributes, its usage, and an example of its use, as well as full Writing System Declarations for a range of commonly used alphabets.

THE FUTURE

By making the Guidelines available now, in an admittedly incomplete state, it is hoped to stimulate the widest possible discussion of their proposals. Incomplete as they are, it is believed that they contain a good basis for extension, and that the basic approach they advocate is a sound one. Three methods are proposed by which the academic community as a whole can participate in the task of putting this claim to the test. Firstly, individual scholars are encouraged to read and report on the usability of the current draft of Guidelines, which is being distributed free of charge. Secondly, individual research projects, engaged in the creation of large textual resources, may become Affiliated Projects of the TEI and attempt to put its recommendations into practice. Affiliated Projects, once approved by the TEI's Steering Committee, will be given access to internal drafts of the Initiative and may have a major role in shaping the content of final Guidelines. Thirdly, specialist working groups will be set up to help in the task of drafting extensions to the Guidelines. The TEI has limited funds to help in the setting up of such specialist groups and is currently actively seeking volunteers with specialist knowledge to extend the coverage of the Guidelines.
Over the next two years, by drawing on the expertise of specialist working groups and the experience of the affiliated projects, the scope and depth of the current draft Guidelines should be extended considerably. Feedback from these and from individual respondents will be acted on to refine the current proposals. No standard can be imposed: it must be accepted by the community which it aims to serve. That can only come about as a result of the widest participation. The publication of the draft Guidelines is thus only the first step in a process of consultation which will continue for many months to come.

FOR MORE INFORMATION ...

If you would like more information about the TEI, a copy of the Guidelines, or simply to be kept informed about the progress of the Initiative, please get in touch with one of the editors at the addresses below. The TEI also maintains an electronic bulletin board on which news of all TEI activities is regularly posted. To subscribe, send an electronic mail message containing only the line SUBSCRIBE TEI-L Your Name to LISTSERV@UICVM.EARN.

Editorial addresses:

In Europe: Lou Burnard, Oxford University Computing Service, 13 Banbury Rd, Oxford OX2 6NN, UK.
e-mail: LOU@VAX.OXFORD.AC.UK
tel. +44 (865) 273238
fax 273275

Elsewhere: C.M. Sperberg-McQueen, University of Illinois at Chicago, Computer Center MC 135, Box 6998, Chicago IL 60680, USA.
e-mail: U35395@UICVM.EARN
tel. +1 (312) 996-2981
fax 996-6834

Lou Burnard has worked at Oxford University Computing Service since 1974. He is founder and director of the Oxford Text Archive, and Associate Editor of the Text Encoding Initiative.