RUTH

A CONCORDANCE-BASED TEXT ENCODING PROGRAM

Øystein Reigem

TEXT ENCODING

In many areas of research, especially in the humanities, there is a need to analyse texts. Where large and complex texts are involved, one would like to use the computer for that purpose. Some computer analyses can be made on the basis of the text itself, but in many cases this places too heavy a burden on the analysis tools - the computer programs. E.g. a program that were to measure the degree of irony in a literary text would have to be very intelligent indeed. The solution to this problem is to code the texts prior to the analysis, i.e to mark or "tag" the various text elements relevant to the study. The codes or marks or tags should contain the necessary information characterizing the semantics and morpho|ogy of the elements.

Within the humanities the need for text encoding has received a lot of attention lately, especially through the international Text Encoding Initiative (TEI) project. The need for coding arises not only when texts are to be analysed, but also when texts are to be exchanged between researchers, institutions or projects. TEI base their recommendations on the Standard Generalized Markup Language (SGML), and so the TEI recommendations specify a comprehensive encoding of various text features from the overall structure down to minute details. For more information on TEI and SGML see the article The Text Encoding Initiative: A Progress Report by Lou Burnard, Oxford University Computing Service, in Humanistiske Data 3-90; the report Living with Guidelines, from the European TEI Workshop in Oxford 1þ2 July by Donald Spaeth in this issue; and the forthcoming report (in Norwegian) from a seminar on text encoding arranged in Bergen 21þ22 June by the Norwegian Computing Centre for the Humanities (NCCH) and the Wittgenstein Archivesin Bergen.

Other areas with a marked (no pun intended) interest in text encoding, are the publishing industry and those concerned with the transmission of electronic documents (EDI). SGML offers portability and device and program independence.

ENCODING PROGRAMS

Coding can be done manually, but there is obviously a great need for programs helping the user in the process, both programs conforming to SGML (and TEI), and others (like RUTH). Still there are rather few SGML-conformant programs for encoding/editing, and also few SGML-conformant programs for other purposes like filtering, formatting, retrieval and analysis of encoded texts.

RUTH

RUTH is an interactive text encoding program. RUTH reads an uncoded (or partially coded) text file, permits the user to add, delete or alter codes, and finally writes the coded text to another file.

RUTH does not conform to SGML or TEI. The main difference is that RUTH does not handle the full range of features found in the SGML standard or TEI recommendations. RUTH can be used to encode "short" elements like words and expressions, and also points in the text.

The interface of RUTH is based on a KWIC concordance, which makes the program suitable for some purposes, and obviously unsuitable for others.

RUTH is not an acronym. The program is named after its first user
- Ruth Vatvedt Fjeld, at the Department of Norwegian Lexicography at the University of Oslo. RUTH was developed to assist Fjeld in her task of encoding law texts for readability studies. Fjeld works with computer analysis of Norwegian Law language in a project supported by the Norwegian Computing Centre for the Humanities (NCCH) and two departments at the Oslo University - Department of Norwegian Lexicography and the Research Institute for Computers and Law. RUTH was written with a broader range of applications in mind, however.

The program is written by the author of the article - \ystein Reigem, at the NCCH.

The following is a description of the main features of the program, advantages and limitations, and there is also a discussion of possible enhancements. The author would like to make contact with potential users, and will also happily accept suggestions for modifications.

TECHNICAL INFO

-	RUTH runs under DOS. RUTH does not run under Windows. RAM requirements are unknown as yet. Disk requirements: temporary files need a lot of disk space.
-	RUTH is written in Turbo Pascal 6.0 (but not OOP - object oriented programming). The program also uses routines from Borland's Database Toolbox 4.0 and Turbo Power's Turbo Professional 5.11.
-	Interface: menus, windows, arrow keys; no mouse.

CHARACTERISTICS AND LIMITATIONS

RUTH was designed with the following characteristics and limitations:

Concordance-based interface: The program displays the text as a KWIC concordance ordered by word and right context (see any illustration). In such a concordance, identical or similar words and expressions will be grouped together. The program is therefore suited to encode words and expressions, as long as the occurrences most often should be coded the same way. Encoding "long" elements like sentences may be difficult or impossible. The KWIC is scrollable both vertically and horizontally (horizontally limited to about 250 characters on either side of the keyword). (When a code is added, deleted or altered, all occurrences in the display are updated.)

Comment: The design of the program, files and internal structures does not exclude further enhancements in the form of an additional word processor type interface. Such an interface might for example allow longer elements to be encoded.

Static text: The text is supposed to be static. No changes can be made to the text itself. Only codes kan be added, deleted or altered.

Comment: This is a more fundamental characteristic of the program. For the program to be able to handle a dynamic text in addition to codes, large parts of the program as well as data structures and files would have to be redesigned.

Redefinable code structure: The structure of the codes is read from a user-defined file.

(Different code sets may be used for input, screen and output. The program may therefore also be used for simple code conversions.)

No linguistic knowledge; very little knowledge of the internal structure of the codes: In these respects the program is fairly stupid. See code variants.

Comment: One might consider special versions of the program with e.g. linguistic knowledge in a separat module. A general solution may be harder to develop.

Types of codes: Area or point codes: Two main types of codes are allowed - codes for continuous areas like words and expressions and codes for points in the text (i.e. points with no extension). Points may at present only be defined immediately to the left of words.

Comment: Further enhancements may allow discontinuous areas, such as expressions with embedded words or clauses, and the encoding of arbitrary areas and points.

Code variants: Both area and point codes may have variants with different representations

- different tag open and close
- different strings starting the tag itself
- varying degree of showing the tag.

E.g. codes for word may have the form <tag>word</tag>; or <tag>word<> or <tag>;word and codes for expressions the form <tag>word word word</tag>; or <Etag>word word word<E>.

Manouvering - word, line, page, jump: At all times one word in the concordance will be highlighted. This is the current word - the word to be coded if the user presses the proper key. The highlighting can be moved horizontally word by word, and vertically line by line or page by page. The user can also jump directly to another keyword or retrace the most recent moves.

Comment: The user may jump directly to any keyword. In that event, the keyword must be keyed in. A possible enhancement might be to allow the user to jump to the word highlighted. The possibility of jumping to the first occurrence of the next or previous keyword would also be useful.

Encoding of expressions: One can encode sequences of words (e.g. expressions) by extending the highlighting to cover more than one word horizontally before pressing the key for encoding. See illustration 1.

Comment: A possible enhancement of the program may allow encoding discontinuous expressions.

Encoding of more than one occurrence of areas or points: Similarly, by extending the highlighting vertically, one can encode several occurrences of words, expressions or points. All occurrences will be given the same code. See illustration 2.

Comment: A possible enhancement of the program may allow encoding in discontinuous sequences of lines.

Horizontal scrolling: The screen is horizontally scrollable 254 characters on either side of the KWIC keyword (starting point of keyword, to be precise).

Encoding not restricted to concordance keywords: Areas or points to be coded need not contain or be near the KWIC keyword. This means that single occurrences of elements can be encoded when encountered - without the user having to jump to another keyword. Another potential advantage is the coding of elements that are grouped by a keyword not contained in the element.

Encoding restricted to "short" elements
: Elements to be coded must at least be partially inside the display line limit.

Suggested codes: Areas: The program always suggests a tag based upon (i.e. consisting mainly of) the word or expression that is to be encoded. This is suitable for codes containing lemma. Both areas and points: The program suggests all codes already used for the element. If, for instance, three occurrences of the same word are highlighted, and one of them already has a code, this code will be among the suggestions. Se also code pick lists.

Code pick lists: During the coding process the program remembers the codes set by the user. The program maintains two lists, one for area codes and one for point codes. The user may select codes from these lists. See illustration 7.

Comment: The pick list mechanism is fairly primitive. Among possible enhancements are: The possibility of editing the lists, to save lists, to access previously saved lists, to access user-defined lists, to handle lists of unlimited lengths, etc.

Not limited to hierarchical encoding
: The program allows overlapping non-hierarchical codes.

Tightly nested or neighbouring codes: The program allows the same area to be encoded with several tightly nested codes, or the same point with several codes.

Input may be partially encoded: The program accepts input texts that are already coded. This allows a text to be encoded in more than one pass. It may also be used to speed up the process of encoding, by combining an uncoded and a coded text. If the same elements occur in the two files, the program may suggest codes found in the coded part of the text (see suggested codes). Specially prepared coded texts might also be considered, e.g. lists of encoded words and expressions, whose sole purpose are to provide codes to be suggested by the program. Such lists might be condensed versions of coded texts. The program cannot prepare such lists, however.
Comment: This might be used to give the program a simple form of learning capability. Enhancements may be necessary to provide the required functionality, e.g. the possibility of making parts of the input text static (so that the coded text parts cannot be altered).

Input syntax check: The program checks the input for syntax errors (partially encoded input text).

Comment: Not good enough now: slow, not robust and does not recover from errors of any severity. Coding of erroneous input should probably not be allowed, but errors should not abort the syntax checking process. The process might be speeded up with a finite state automaton.

Save and restore: The result (the encoded text) may at any time be saved to files and restored from these files. (These files are not text files like the ordinary output file.)

Online help: Help is planned but not implemented - just a few prototype windows. The help function will be context-sensitive. The help information will contain cross-references.

No undo function. The user may save the result at any time, however (see save and restore).

Illustration 1. RUTH displays the text as a KWIC concordance. (The illustration only shows a detail of the RUTH KWIC screen. See the encoding example at the end of the article for full screen images (illustrations 3-7).)

Illustration 2. By extending the highlighting horizontally before pressing the encoding key, the user can encode sequences of words (expressions). By extending the high<->lighting vertically, the user can encode several occurrences of words or expressions (or points) at the same time.

POSSIBLE ENHANCEMENTS

Some possible enhancements to the program have already been mentioned. Others are:

  -   Different orderings of the concordance: The KWIC is now ordered by keyword and right context, disregarding case and diacritics, and treating the separators between the words as one space.
In some cases one might want other options, such as:
- stop words, i.e. words that should be excluded as keywords, or plus words, i.e. the only words to be used as keywords
- reverse alphabetical order, sorting by left context, or sorting by alternating right and left context words
- definition of other character set orderings, multiple characters, etc
- ordering by substrings of words, e.g. end morphemes from a plus list. (Then the KWIC is not really a KWIC any more.)

  -   Several orderings at once: If different orderings were possible, one might even want to switch between orderings during one coding process. (Technically, this is a fairly simple enhancement.)
  -   More "precise" coding: In order to allow precise encoding of other elements than words, expressions and points immediately to the left of words, it might be useful to be able to adjust - character by character or code by code - the area or point to be encoded.
  -   Restrictions: The possibility of enforcing restrictions like hierarchical encoding only.

A SIMPLE ENCODING EXAMPLE

Illustration 3. The screen (for an uncoded text) may look like this. A word, e.g. around the top of the screen, will be highlighted. The highlighting can be moved from line to line and from word to word with the arrow keys, like a cursor. One can also jump directly to a certain keyword in the KWIC concordance. One can then extend (or shrink) the highlighting to contain all the words and lines one wants to encode (with the same code).

Illustration 4. Here the user has highlighted all occurrences of the keyword "Steve". The words do not have to be identical, as in this example; neither is encoding restricted to the KWIC keywords.

Illustration 5. The user presses the proper key(s), a small window pops up, and she writes the kode (in this example "NAME").

Illustration 6. Finally the user presses the Enter key. After a short while the screen is updated, showing the code on all occurrences of "Steve". Note that "Steve" in the last line is now also encoded, since this is the same occurrence as four lines earlier.

Illustration 7. This final illustration shows a code pick list, one of the mechanisms for making the encoding process more efficient. The program remembers all codes given during an encoding session, and the user may select codes from the pick list in stead of keying them in.