4 The LOB tagging suite2

Since the purpose was to make use of, and at the same time improve on, the automatic tagging of the Brown Corpus (undertaken at Brown University 1971-78)3, the first step was collecting and analysing data from the tagged Brown Corpus. The tagged Brown Corpus was kindly made available to us by Henry Kucera and Nelson Francis, who also provided us with a copy of the automatic tagging program TAGGIT written by Greene and Rubin (1971). An exploratory run of the program on the LOB Corpus suggested that a new approach to tag selection would be needed, if we were to improve substantially on TAGGIT's performance. For comparability with the Brown Corpus, we had decided to use largely the same set of tags as were used by TAGGIT; but in practice some changes were advisable, and as a result of these changes, the new tag set (see Section 3 and Appendix 4) consisted of 134 tags, as against Brown's 87. The chief advantage we derived from the Brown tagging project was that we were able to make substantial use of the tagged Brown Corpus itself, as a data base for our automatic tagging. Our study yielded lists of word-tag and suffix-tag associations which formed the nucleus of our Tag-Assignment program (see 4.2). Also, by means of a group of Context Collecting programs, we were able to derive from the corpus frequency lists of tag sequences, and these were later adopted for inclusion in our Tag-Selection program (see 4.3). The overall process of tagging can be divided into three stages:

Fig. 1

As may be expected with programs acting on unrestricted language input, the automatic tagging programs required both a pre-editing phase, where human investigators prepared the corpus for input, and a post-editing phase, where they corrected any errors made by automatic tagging. Manual pre-editing and post-editing were both, however, carried out with the aid of computer programs. The automatic tagging process can be broken down into three logically separable processes:

Fig. 2

For development purposes, it was convenient to write a separate program for each of the three processes; but it would be easy enough in principle to combine them all into a single program. Logically speaking, the Automatic Tagging divides into Tag Assigment (whereby each word in the corpus is assigned one or more possible tags), and Tag Selection (whereby a single tag is selected as the correct one in context, from the one or more alternatives generated by Tag Assignment).

It was as something of an afterthought that we added to the Tag Assignment program (WORDTAG) and the Tag Selection program (CHAINPROBS) a third, intermediate program (IDIOMTAG) to deal with various grammatically anomalous word-sequences which, without intending any technical usage of the term, we may call 'idioms'. 4

4.1 Pre-editing

At the start of the process, the Raw Corpus (the corpus in its orthographic form) existed in a 'horizontal' format; i.e. it read from left to right in the normal way. A Verticalisation Program converted this corpus into a 'Vertical Corpus' in which one word occurred beneath another in a vertical column. At the same time, the Verticalization Program made automatic changes to provide help later in the tagging. These included supplying missing punctuation, splitting enclitic words (n't, 'll, etc) from their predecessors, changing capital letters to lower case at the beginning of sentences, in headings, etc; and marking foreign words, formulae, and other exceptional features of the text. The Verticalization Program also created a number of columns alongside the text, so that various kinds of information (orthographic, lexical, syntactic) could be recorded for future users of the corpus.

When the Verticalization of the corpus took place, another set of programs produced 'Editlists' of particular text features which had to be checked by a human editor to see whether they had to be altered in order to be suitable input to the Automatic Tagging. The most important lists were those of 'CAPITALS' (non- sentence- initial words beginning with a capital letter) and 'UNCAPITALS' (sentence-initial words whose capital letter would have been changed to lower case by the Verticalization program). For example, if a sentence began with a proper name such as John, the program changed this to john, and a manual editor had to change it back again. The reason for these changes in capitalisation was that the Automatic Tagging programs made use of word-initial capitals in deciding what kind of tags to assign to a word (most words beginning with a capital end up being tagged as proper nouns; see 7.7 and Appendix 3).

Although the majority of pre-editing changes were made automatically by the Verticalization program, Pre-editing proved to be a time-consuming process, especially since all pre-editing decisions had to be carefully standardized and entered in a 'Pre-editing Manual'. In a subsequent tagging project we are now trying to eliminate manual pre-editing, by enabling the automatic tagging programs to accept a word with an initial capital as a possible variant of a lower case word. 5

4.2 Tag Assignment

The simplest kind of Tag Assignment procedure would be just a look-up in a WORDLIST or a dictionary specifying the tag(s) associated with each word. In addition to such a Wordlist, the Brown Tagging Program TAGGIT has a SUFFIXLIST, or list of pairings of word-endings and tags (for example, the ending -ness is associated with nouns). We followed Brown in this, using a Wordlist of over 7,000 words, and a Suffixlist of approximately 660 word-endings.6 Further, the LOB Tag Assignment Program contains a number of procedures for dealing with words containing hyphens, words beginning with a capital letter, words ending with -s, with 's, etc. The advantages of having a SUFFIXLIST is that (a) the WORDLIST can be shortened, since words whose wordclass is predictable from their ending can be omitted from it; and (b) the set of words accepted by the program can be open-ended, and can even include neologisms, rare words, nonsense words, etc. These advantages also apply to the procedures for dealing with hyphenated and capitalized words.

The Tag Assignment Program reads each word in turn, and carries out a series of testing procedures, to decide how the word should be tagged. The procedures are crucially ordered, so that if one procedure fails to tag a word, the word drops, through through to the next procedure. In the rare cases where none of the tag-assignment procedures is successful, the word is given a set of default tags. The program's structure can be summarized at its simplest by listing the major procedures as follows (where W = the word currently being tagged):

(1)

Is W in the WORDLIST?
If so, assign the tags given in the WORDLIST.

(2)

Is W a number, a single letter, or a letter preceded or followed by a number of digits?
If so, assign special tags.

(3)

Does W contain a hyphen?
If so, carry out the special procedure APPLYHYPHEN.

(4)

Does W have a word-initial capital (WIC)?
If so, carry out the special procedure APPLYWIC.

(5)

Does W end with one of the endings in the SUFFIXLIST?
If so, assign the tags specified in the SUFFIXLIST.

(6)

Does W end in -s?
If so, apply an -s stripping procedure, and check again whether W is in the WORDLIST, or failing that, the SUFFIXLIST. If it is, apply the tags given in the WORDLIST or SUFFIXLIST retaining only those tags which are compatible with -s.
If not, assign default tags for words ending in -s.

(7)

If none of the above apply, assign default tags for words not ending in -s.

APPLYHYPHEN and APPLYWIC are 'macroprocedures' which themselves consist of a set of tests comparable to those of the main program. For further details, see the Flowcharts in Appendices 1-3.

The output of the Tag Assignment Program is a version of the Vertical Corpus, in which one or more grammatical tags (with accompanying rarity markers @ or % if appropriate)7 are centered alongside each word. As an additional useful feature, this program provides a diagnostic (in the form of an integer between 0 and 100) indicating the tagging decision which led to the tag-assignment of each word. This enables the efficacy of each procedure in the program to he monitored, so that any improvement effected by changes in the program can be measured and analysed. In this respect, the program is self-evaluating. It can also be readily updated through revisions to the Tag-set, Wordlist, or Suffixlist.

4.3 Tag Selection

If one part of the project can be said to have made a particular contribution to automatic language processing, it is the Tag Selection Program (CHAINPROBS). the structure of which is described in greater detail in Marshall (1983). This program operates on a principle quite different from that of the Tag Selection part of the program used on the Brown Corpus. The Brown program used a set of CONTEXT FRAME RULES, which eliminated tags on the current word if they were incompatible with tags on the words within a span of two to the left or two to the right of the current word (W). Thus assuming a sequence of words -2, -1. W, +1, +2, an attempt was made to disambiguate W on the evidence of tags already unambiguously assigned to words -2, -1, +1, or +2. The rules worked only if one or more of these words were unambiguously tagged, and consequently often failed on sequences of ambiguous words. Moreover, as many as 80% of the applications of the Context Frame Rules made use of only one word to the left or to the right of W. These observations, made by running the Brown Program over part of the LOB Corpus, led us to develop, as a prototype of the LOB Tag-Selection Program, a program which computes transitional probabilities between one tag and the next for all combinations of possible tags, and chooses the most likely path through a set of ambiguous tags on this basis.

Given a sequence of ambiguous tags, the prototype Tag-Selection Program computed all possible combinations of tag-sequences (i.e. all possible paths), building up a search tree. It treated each possible Tag Sequence or path as a first-order Markov chain, assigning to each path a probability relative to other paths, and reducing by a constant scaling factor the likelihood of sequences containing tags marked with a rarity marker @ or %. Our assumption was that the frequency of tag sequences in the Tagged Brown Corpus would be a good guide to the probability of such sequences in the LOB Corpus; these frequencies were therefore extracted from the Brown Corpus data, and adjusted to take account of changes we had made to the Brown Tag-set. We expected that the choice of tags on the basis of first-order probabilities would provide a rough-and-ready tag-selection procedure which would then have to be refined to take account of higher-order probabilities. It is generally assumed, following Chomsky (1957:18-25) that a first-order Markov process is an inadequate model of human language. We therefore found it encouraging that the success rate of this simple first-order probabilistic algorithm, when tried out on a sample of over 15,000 words of the LOB Corpus, was as high as 94%. An example of the output of this program (from Marshall 1983) is given in Fig 3:

Fig 3

this

DT

task

NN

involved

[VBD]/90 VBN/10 JJ@/0

a

AT

very

[QL]/99 JJB@/1

great

[JJ]/98 RB/2

deal

[NN]/99 VB/1

of

IN

detailed

[JJ]/98 VBN/2 VBD/0

work

[NN]/100 VB/0

for

[IN]/97 CS/3

the

ATI

committee

NN

In this output, the tags supplied by the Tag Assignment Program are accompanied by a probability expressed as a percentage. For example, the entry for the word involved ([VBD]/90 VBN/10 JJ@/0) indicates that the tag VBD 'past tense verb' has an estimated probability of 90%; that the tag VBN 'past participle' has an estimated probability of 10%; and that the tag M 'adjective' has an estimated probability of 0%. The symbol @ after M means that the Tag Assignment program has already marked the 'adjective' tag as rare for this word. The square brackets enclosing the 'past tense' tag indicate that this tag has been selected as correct by the Tag Selection Program. (The square brackets are used to indicate the preferred tag for every word that is marked as ambiguous; where the word has only one assigned tag, this marking is omitted as unnecessary.)

An improved Tag Selection Program was developed as a result of an analysis of the errors made by the prototype program. We realised that an attempt to supplement the first-order transition matrix by a second-order matrix would lead to a vast increase in the amount of data to be handled as part of the program. with only a marginal increase in the program's success. A more practical approach would be to concentrate on those limited areas where failure to take account of longer sequences resulted in errors, and to introduce a scaling factor to adjust such sequences in the direction of the desired result. For instance, the occurrence of an adverb between two verb forms (as in has recently visited) often led to the mistaken selection of VBD rather than VBN for the second verb,

and this mistake could be corrected by downgrading the likelihood of a triple consisting of the verb be or have followed by an adverb followed by a past tense verb. Similarly, many errors resulted from sequences such as live and work, where we would expect the same word-class to occur on either side of the coordinator - something which an algorithm using frequency of tag-pairs alone could not predict. This again could be handled by boosting or reducing the predicted likelihood of certain tag triples. A further useful addition to the program was an alternative method of calculating relative likelihood, making use of the probability of a word's belonging to a particular grammatical class, rather than the probability of the occurrence of a whole sequence of tags. This serves as a cross-check on the 'sequence-probability' method, and appears to be more accurate for some classes of cases. These improvements, together with the introduction of an Idiom Tagging Program (see 4.4 below), resulted in an overall success rate of between 96.0% and 97.0%. (This calculation excludes punctuation tags, which are automatically 'correct'.)

4.4 Idiom Tagging

The third tagging program, which intervenes between the Tag Assignment and Tag Selection programs, is an Idiom Tagging Program (IDIOMTAG) developed as a means of dealing with idiosyncratic word sequences which would otherwise cause difficulty for the automatic tagging. One set of anomalous cases consists of sequences which are best treated, grammatically, as a single word: for example, in order that is tagged as a single conjunction, as to as a single preposition, and each other as a single pronoun. Another group consists of sequences in which a given word-type is associated with a neighbouring grammatical category; for example, preceding the preposition by, a word like invoked is usually a past participle rather than a past tense verb. The Idiom Tagging Program is flexible in the sorts of sequence it can recognize, and in the sorts of operation it can perform: it can look either at the tags associated with a word, or at the word itself; it can look at any combination of words and tags, with or without intervening words. It can delete tags, add tags, or change the probability of tags. It uses an Idiom Dictionary to which new entries may be added as they arise in the corpus. In theory, the program can handle any number of idiomatic sequences, and thereby anticipate likely mis-taggings by the Tag Selection Program; in practice, in the LOB Corpus tagging project, the program was used in a rather limited way, to deal with a few areas of difficulty. Although this program might seem to be an ad hoe device, it is worth bearing in mind that any fully automatic language analysis system has to come to terms with problems of lexical idiosyncrasy.

4.5 Post-editing

The Vertical Corpus, after automatic tagging, contained, alongside each word, one or more grammatical tags, placed in order of their likelihood of occurring in this context. The tag selected by the program as the correct one was already indicated (see the example in 4.3). To simplify the task of the post-editor, a threshold was set below which the likelihood of error was low enough to be disregarded at the initial stage of post-editing. Sample analyses had shown that 60% of the text-words were unambiguously tagged; that of the 40% which were ambiguously tagged, 64% had a likelihood, as calculated by the Tag Selection Program of more than 90%; and that these had only a 0.5% risk of being erroneous. This means that over the whole sample 86% of words could be unambiguously tagged with less than 1% error. In these relatively safe cases, the output listing simply assumed the one tag to be correct, and gave alternative taggings only for the 14% of words for which the risk of error was relatively high.

The computer programs achieved a level of success in identifying the correct tag of between 96% and 97%. In spite of the very high success rate, there remained a very large number of errors to be corrected. Post-editing proved to be a laborious and time-consuming process. Initially, post-editors read through the running text to identify tagging errors. This was followed by a good deal of consistency checking based on concordance listings for selected words. All errors which were discovered were corrected and the two versions of the corpus (cf Section 2) and the KWIC concordance (cf