to accompany

The Wellington Corpus of Written New Zealand English


Laurie Bauer

Department of Linguistics

Victoria University of Wellington


Department of Linguistics

Victoria University of Wellington

P O Box 600


New Zealand


© 1993

ISBN 0-475-11019-6




The corpus described in this handbook was developed in the Department of Linguistics at Victoria University of Wellington in the years 1986-1992.

The idea of a New Zealand corpus had been around since the first half of the 1980s, was canvassed at a Linguistic Society of New Zealand Conference in Wellington in 1985 by Derek Davy, and was warmly supported by the Linguistic Society. In 1986 planning for such a project was begun by a group of people interested in the idea of a corpus from the Department of Linguistics and the English Language Institute. In 1987 a tentative start was made on collecting the material for the Press section.

At the same time, it was decided that we should collect material for a corpus of spoken New Zealand English, the idea being that two separate one-million-word corpora should be collected. As a way of dividing the work load, Janet Holmes took control of the organisation of the spoken corpus, while I took on the task of directing the collection of the written material.

The project has been generously supported by the Internal Grants Committee of Victoria University of Wellington, and by the (now defunct) University Grants Committee, for whose support we are extremely grateful.

We have also been helped considerably by the staff of Victoria University's Computer Services Centre, under the directorship of Frank March, and we should like to express our appreciation of the effort made by them in aid of this project.

We were fortunate to be able to employ a number of current and former Linguistics students as research assistants, and it is their work and care which have brought the project to a successful conclusion so quickly. I should like to thank for their hard work on this corpus Anna Adams, Debra Beckett, Rachel Dickinson, Katrina Foster, Lisa Matthewson, Ruth Pemberton, Mary Roberts, Shelley Robertson, Jane Sayers, Robert Sigley, Rowena Simpson.

Finally, we should like to thank the large number of copyright holders who allowed their texts to be included in the corpus.


Laurie Bauer

Wellington, New Zealand



Information about the Corpus

Appendix I: Uncoded Character Index

Appendix II: Non-English Orthography

Appendix III: 'Deviant' codes

Appendix IV: Comment Tags




Information about the Corpus

1 Aim

The basic aim of the Wellington Corpus of Written New Zealand English is to provide a computerised sample of written New Zealand English which will allow direct comparisons with the Brown University Corpus of American English, the Lancaster-Oslo/Bergen Corpus of British English and, especially, with the Macquarie Corpus of Australian English. Since the Australian Corpus was not available while the Wellington Corpus was being developed, the New Zealand Corpus was based largely on the LOB Corpus, both in terms of content and also in terms of coding practice, though with one extremely significant difference. Both the Brown and the LOB corpora collected material published in 1961. By the time planning for the Wellington corpus began, it was known that there was an Australian project underway which would use 1986 as its baseline. Since it was realised from the outset that comparisons with Australian data would be of vital importance if any distinct New Zealand variety of written English was to be established, the year 1986 was also taken as the baseline for the Wellington Corpus. However, not enough suitable material was published in New Zealand in 1986, and in practice the Wellington Corpus, while most of the material it uses was published in 1986 or 1987, covers the years 1986-1990.

In other respects, the Wellington Corpus should be directly comparable with the LOB corpus, and, it is hoped, with the Australian Corpus, although the realities of the publishing situation in New Zealand meant that we had to change some of the categories to a certain extent. These changes will be discussed below. Only minor differences in the coding exist between the LOB and the Wellington Corpora.

2 Distribution

This manual and the Wellington Corpus it describes are available at cost to bona fide researchers through the International Computer Archive of Modern English (ICAME) at the Norwegian Computing Centre for the Humanities, Bergen, Norway. The following conditions, which were explicitly stated for copyright holders, must be strictly observed:

1) No copies of the Wellington corpus material, or any part of the Wellington Corpus material, will be distributed under any circumstances without the written permission of the Department of Linguistics at Victoria University of Wellington or, acting on its behalf, ICAME (The International Computer Archive of Modern English, Bergen).

2) Print-outs of the Wellington Corpus material, or parts thereof, will be used only for bona fide research. Holders of copies of the corpus will not be permitted to reproduce texts or parts of texts for any purpose other than scholarly research without obtaining the written permission of the individual copyright holders, as listed in the manual accompanying the corpus.

3) Commercial publishers and other non-academic organisations wishing to make use of part or all of the corpus or a print-out thereof will have to obtain permission from all of the individual copyright holders involved.

4) Persons or institutions ordering copies of the material will be required to subscribe to these restrictions by signing a written contract before a copy is issued.

5) A careful record will be kept of all those who receive a copy of the corpus.

3 Organisation of the Corpus

As previously stated, the main text categories have been arranged to match those in the LOB Corpus as closely as possible. The overall organisation, and the parallel with the LOB Corpus, is shown in Table 1. Note that where the Wellington Corpus has Categories K and L for fiction, and these are not distinguished from each other in terms of content, the LOB Corpus has Categories K General Fiction, L Mystery and Detective Fiction, M Science Fiction, N Adventure and Western Fiction, P Romance and Love Story and R Humour. This difference reflects a large difference in the fiction publishing profiles of the two countries. Genuine mass-market fiction written in New Zealand tends to be published overseas. The fiction published in New Zealand, generally speaking, aims at a smaller market and is more consciously literary. This means that some of the categories used in the LOB Corpus are virtually absent from the New Zealand publishing scene, and those works which might fit are usually published overseas, and have thus been through the hands of overseas editors. Where possible, we avoided the influence of overseas editors by admitting only New Zealand published material. One consequence of this was that it was not felt that distinctions could be drawn between so many classes of fiction, and all fiction was correspondingly put together in a single category. Interestingly enough, one of the main types of fiction published in New Zealand was omitted, because it was not represented in the LOB Corpus. This is fiction aimed at children.


Table 1

Basic Structure of the Wellington Corpus, and Comparison with the LOB Corpus

Text categories

Number of texts in category


LOB Corpus

Wgton Corpus

A Press: reportage



B Press: editorial



C Press: reviews



D Religion



E Skills, trades and hobbies



F Popular lore



G Belles lettres, biography, essays



H Miscellaneous (government documents, foundation reports, industry reports, college catalogue, industry house organ)



J Learned and scientific writings



K, L Fiction (K-R in LOB)







A summary of the material within each of the categories of the Wellington Corpus is given below. For full details of the individual texts, see the section entitled Texts.

Category A (Press: reportage)


A01-10 Political Daily

A11-14 Political Weekly

A15-19 Sports Daily

A20-21 Sports Weekly

A22-28 Brief news items Daily

A29-30 Brief news items Weekly

A31-33 Financial Daily

A34 Financial Weekly

A35-42 Features Daily

A43-44 Features Weekly

Category B (Press: editorial)


B01-07 Institutional Daily

B08-10 Institutional Weekly

B11-17 Personal Daily

B18-20 Personal Weekly

B21-25 Letters Daily

B26-27 Letters Weekly

Category C (Press: reviews)


C01-14 Daily

C15-17 Weekly

Category D (Religion)


D01-09 Books

D10-17 Periodicals

Category E (Skills, trades and hobbies)

E01-33 Periodicals

E33-38 Books

Category F (Popular lore)


F01-22 Popular politics, psychology, sociology

F23-30 Popular history

F31-33 Popular health, medicine

F34-37 "Culture", popular anthropology

F38-44 Miscellaneous

Category G (Belles lettres, biography, essays)


G01-35 Biography

G36-41 Literary essays and criticism

G42-50 Arts

G51-77 General essays

Category H (Miscellaneous)


H01-12 Government documents

H13-14 Acts/treaties

H15-19 Proceedings/debates

H20-23 Other reports

H24-26 Foundation reports

H27-28 Industry report

H29 University catalogue

H30 Industrial house organ

Category J (Learned and scientific writings)


J01-12 Natural sciences

J13-17 Medicine

J18-21 Mathematics and computing

J22-25 Psychology

J26-30 Sociology

J31 Demography

J32-35 Linguistics

J36-39 Education

J40-47 Politics and economics

J48-50 Law

J51-54 Philosophy

J55-59 History

J60-63 Literary criticism

J64-67 Art

J68 Music

J69-80 Technology and engineering

Categories K and L (Fiction)



L01-27 General fiction. Not subcategorised.

Sources and Sampling

The Brown and LOB Corpora were both intended to be representative samples of the texts published in the relevant countries in 1961. The Wellington Corpus was designed far more with comparability than with representativeness in mind. Thus, rather than going back to first principles and taking a random sample of material published in New Zealand in 1986, the categories of the LOB Corpus were taken as given, and an attempt was made to match those categories as closely as possible. In some cases, this was not possible. The case of the fiction has already been mentioned, but there are other mis-matches between the Wellington Corpus and the LOB Corpus, which will become clear below. Perhaps the most important mis-match again one that has already been mentioned is that because of the volume of published work in New Zealand (a country with a population of only 3.3m), it was not always possible to fill the requirements of the category from the year 1986 alone. In such cases, material was chosen from later years to make up the appropriate number of samples. This is most noticeable in the Fiction sections (Categories K and L), but also applies elsewhere.

Sections A, B, C (Press section)

These were collected by rigorous random sampling with probability proportional to size of the readership, within the overall section parameters of daily:weekly ratios and subject categories (here we followed the LOB corpus; as a result the composition of the section may not be entirely representative of 1980s NZ publications). The probability of any publication being selected was determined from its average weekly circulation in the year 30/09/86-30/09/87 (New Zealand Audit Bureau of Circulations, 1987). Local give-away papers funded by advertising revenue are not included in that publication, and so were not sampled.

Section D (Religion)

(i) Periodicals. There were no circulation figures available for most of the religious periodicals published; hence each periodical title was given equal weighting. There was no attempt to adjust weighting for frequency of publication or length of each issue; this subsection is 'representative' only in the sense of 'including an example from each of as many sources as possible'. For each title, several 1986 issues were chosen at random; a numbered list of article titles was obtained from inspection of each of these issues for suitable material (discarding, e.g., articles syndicated from overseas a very common problem in these publications) and the articles to be used sampled randomly from this list.

(ii) Books. Books in this section were discovered by a search of specialist libraries in the Wellington area, backed up by consultation of the New Zealand Books in Print index. Selection was made from the appropriate titels randomly, with due consideration of the factors referred to above.

Section E (Skills & hobbies)

(i) Periodicals. The texts in this section were collected by a rigorous random sampling of the appropriate periodicals with probability proportional to the size of readership. The titles sampled and the size of readership were obtained from the Nielsen Media Directory.

(ii) Books. This is an area which it is difficult to approach through published sources, so that selections had to made from the relevant sections of public (not university) libraries. Since very few extracts from books were required, it was a relatively simple matter to select randomly from those works discovered.

Section F (Popular lore)

(i) Periodicals. Unlike the Religion section, here there were no obvious specialist publications to sample, and instead, the Index to New Zealand Periodicals 1986 was used to find suitable articles by topic. Publications were used as their titles came up from this search, with two exceptions: periodicals which had previously refused permission to use any 1986 material were not sampled; and contributions from any one periodical were limited (arbitrarily) to less than 10% of the section, in line with the version of 'representativeness' practised in the Religion section (and to avoid domination by the one weekly publication included).

(ii) Books. A similar procedure was followed here, using the New Zealand Books in Print index (1986 and 1987) to search for titles by topic. This procedure was modified for anthologies. In many cases, not all works in an anthology were known to be by New Zealanders, or known not to have been previously published; here, we randomly selected complete items from those known to be suitable.

Section G

(a) Biography. The material for this section was collected in the first instance by checking on the New Zealand Bibliographic Network, with subsequent checks in other sources.

(b) Belles lettres. This material was collected as outlined above for Popular Lore.

Section H (Government and company publications)

A completely rigorous collection for Government documents would have used the Government Publications index; but as Victoria University is in the capital, and receives copies of most if not all non-classified publications by government departments, a search using the University Library computer catalogue was much more convenient, with little loss of sampling population. A list of all 1986-7 publications held at Victoria was compiled and inspected in random order, non-suitable publications (mostly foreign authors, some reprints from the 1970s) being discarded from the list as they turned up, until the total number needed was collected.

Section J

(i) Periodicals. The Index to New Zealand Periodicals was used as a starting point for the random selection of articles in the same relative proportions for the different academic fields as in the LOB Corpus. In a small number of cases (e.g. demography) no suitable article was found for 1986, so an article from a related field was substituted, or a suitable article substitued from a subsequent year.

(ii) Books. The New Zealand Books in Print index and the helpful advice of librarians in University and specialised scientific libraries enabled a comprehensive list of titles to be assembled for the subsequent sampling of suitable extracts.

Sections K, L (Fiction)

As has already been mentioned, there is not the abundance or variety of genres in New Zealand writing to allow us to follow strictly the subclassification used in the LOB corpus, and in fact these two sections are a single undifferentiated unit. Because not enough material was published in 1986 to fill this section, collection was thrown open to all years up to 1990. Books were collected using whatever came to hand through a search of shelves in most local libraries. A measure of the success of this method is that only 12 works of fiction were missed out of those listed in the 1987 New Zealand Books in Print index, when that became available.

Anthologies were treated as outlined above in Popular Lore. Much the same procedure was used for periodicals sampled for this section.

Periodicals: There are in fact very few NZ publications which specialise in new local fiction; as a result, Landfall has been very heavily used in this section. Because there was no adequate way to search for them, one-off works perhaps present in certain periodicals such as the NZ Listener or Auckland Metro are not included.


The coding system used in the Wellington Corpus is based firmly on that used in the LOB Corpus. Some extra codes were added when necessary, and not all of the LOB codes were required. In a few cases we encountered difficulty in interpreting the LOB codes, so that our interpretation may not consistently match that of the LOB encoders.

1. Organisation of the material

1.1 The Corpus starts with the tag for the first text (see 1.2) and ends with the end of corpus symbol.

1.2 The text categories are included in the order given in Table 1. Each corpus text starts with a tag giving its running number in the corpus, its category and its number within the category, and ends with the end of text symbol.

1.3 Each line is prefixed with the category of the text, the number of the text in the category a space and then a three-figure line number. Thus B07 009 indicates line 9 in text number 7 in Category B.

2. Textual material included/excluded

2.1 The text of a sample starts either with the first sentence beginning on the first page sampled or at the beginning of the first paragraph beginning on the first page sampled, and ends at the end of the sentence containing the 2,000th word. In the Press sections (Sections A, B, C) and when anthologies are sampled, initial headings are included in the sample. Word-counts were made within WordPerfect text-processing software before editing so that comment tags and coding symbols are not counted.

2.2 Headings are coded and included in the text (see 3.3 and 9), except initial headings in places other than those specified in 2.1.

2.3 Material highlighted in boxes or as titles to sections (except complete sentences repeated in the text) is included. Any introductory textual material is included.

2.4 Editorial material extraneous to the source is omitted.

2.5 Extra-textual material in the source such as diagrams, maps, lists, tables is excluded, and is represented by comment tags (see 17.1).

2.6 Footnotes and footnote references are excluded without comment.

2.7 Quotations over thirty words are consistently excluded, and are indicated by comment tags, unless they are contemporary, of New Zealand origin, and part of the text. Shorter quotations are generally included unless they are (a) in a foreign language or (b) in the language of a different period. Omitted quotations are marked by a tag.

3. Main coding key

3.1 The major principle underlying the coding is that alphanumeric characters have their expected value. Thus upper and lower case letters represent precisely those upper or lower case letters in the original. In only a few cases does a character not have its face value. These are treated in 3.2. In addition, in many cases a compound coding is required. Such codes always begin with an asterisk *. They are treated in 3.3.

3.2 Characters not having their face value.

* prefix for a compound symbol (see 3.3).

^ new sentence

_ begin list

| new paragraph or new line or blank line

" umlaut or diaerisis on preceding letter

' apostrophe (but not single quotation mark)

\ begin deviant word

{ begin deviant phrase or passage

} end deviant phrase or passage


Also note:


- hyphen or minus (but not dash)

. full stop or abbreviation marker or decimal point or

multiplication sign

... ellipsis

3.3 Compound coding symbols.

These have a * or ** prefix.


*0 begin lower case roman

*1 begin italic (underlined) text

*2 begin capitalisation (roman)

*3 begin capitalisation (italic)

*4 begin bold face

*5 begin italic bold

*6 begin bold face capitalisation

*7 begin italic bold capitalisation

*8 begin script

*9 begin gothic


*@ degree symbol ()

*= begin upper case roman numeral

**= begin lower case roman numeral


*+$ $

*- dash

*/ asterisk (*)

*| new section, not identical with new para but not otherwise


*{ open curly bracket

*} close curly bracket

*# end of corpus text

**# end of corpus

*? uncoded character (see list in Appendix I)


*" begin double quotes

**" end double quotes

*' begin single quotes

**' end single quotes

*< begin heading

*> end heading

**[ begin comment tag

**] end comment tag

*; begin subscript

**; end subscript

*: begin superscript

**: end superscript

4 Typographical shifts

4.1 * followed by a digit indicates a typographical shift (see 3.3), i.e. the beginning of a section of text in a given type. Since upper and lower case letters are distinguished in the computer record (this was not the case with the first version of the Brown Corpus, for example), *0 is used as the symbol for ordinary Roman type. Thus we find |^*0Although rather than |^*2A*0lthough. Similarly, we find *2WELLINGTON rather than *2wellington.

4.2 The symbol marking a typographical shift (*0, *1 etc) occurs directly before the first character to which it applies, except that \ or *{ may follow it.

4.3 The symbols . , - and other punctuation marks are regarded as neutral between roman, italics and bold face.

4.4 An introductory typographical shift symbol always occurs before the first word of every text.

4.5 Note that typographical shift symbols may occur within words, e.g.


5 Capitalisation

5.1 See also 4.1. The character set used includes both upper and lower case letters. Continuing capitalisation is indicated redundantly by a typographical shift symbol.

5.2 Before short sequences of expected capitals, the shift symbol is usually omitted, e.g.

^\0Cr {0J.H.} Dennison said ...

Capitalised abbreviations are never coded as being capitalised.

5.3 Proper names are broadly identifiable as character sequences introduced by a capital letter which is not preceded by a sentence-initial marker (see 8).

6 Spacing and ordering

6.1 A single space or the end of a line indicates a typographic word-boundary in the source text.

6.2 A space follows the punctuation marks . , ; : ? **" **' as in printed texts, unless these are followed by another punctuation mark from this set. A space also precedes and follows a dash (*-).

6.3 No space is inserted before end-quote symbols (**" **') or following begin-quote symbols (*" *').

6.4 / occurring between words is followed by a space, except in the case of and/or, and he/she which are coded as single words. No space follows / in numerical expressions (see 15.2).

6.5 The ordering of punctuation symbol and marker, or of marker and marker, is immaterial if they apply at the same point in the text, except that

(a) The new-paragraph marker | precedes other markers at the same point in the text, e.g.

|^\0Cr Kane told ...

(b) With the exception of the paragraph marker mentioned in (a), the new-sentence marker ^ precedes other markers at the same point in the text.

(c) The beginning-of-headline marker *< replaces the new-paragraph marker, and thus precedes all other symbols.

(d) See also 4.2.

7 Paragraph/line division

7.1 The beginning of a new paragraph is indicated by | and by indentation in the printout.

7.2 The paragraph marker is used

(a) at the beginning of each new paragraph, whether or not the line was indented in the original;

(b) to mark significant new line distinctions, e.g. in lists.

7.3 Breaks in the text (as indicated by two-line space, asterisks, etc.) which are distinct from normal paragraph breaks are indicated by the compound symbol *|.

7.4 The paragraph marker always appears as the first character in a line.

7.5 Headings (see 9) are placed on a separate line. The same applies to comment tags (see 17) except **[SIC**] and **[ARB**] and to end-of-text and end-of-corpus markers.

8 Sentence-initial marking

8.1 The objects of sentence-initial marking are (1) to define suitable contexts for researchers who wish to use the corpus in that way; (2) to define basic units for parsing; (3) to simplify studies of sentence length; (4) to distinguish sentence-initial capitalisation from other types.

8.2 The sentence-initial maker ^ normally appears where a terminal punctuation mark (. ? !) is followed by a capital letter.

8.3 ^ is not used at the beginnings of headlines. If the headline contains a sentence division, the second sentence is marked with ^.

8.4 'Quasi-headlines' (see 9.4) are preceded by ^.

8.5 The use of ^ is problematical in connection with quotations. When a quotation is preceded by a reporting clause, ^ is used on the quotation; when the reporting clause follows, ^ is not used on the reporting clause:

^She said, ^*"Let's go.**'

^*"Let's go,**" she said.

8.6 A semi-colon is only treated as a mark separating sentences in the circumstances outlined in 8.9.

8.7 A particular problem in connection with sentence-initial marking is where a colon occurs followed by capitalisation. A ^ is inserted if one or more of the elements following the colon has the character of or includes a complete sentence. The marker is omitted in cases of enumerations where the items enumerated do not form or include complete sentences.

8.8 In cases of a doubt, the marker ^ is excluded.

8.9 The LOB corpus used the symbol ~ for an 'included sentence'. We had so many problems with the conditions for using this, that the symbol was not used in the Wellington Corpus. Either ^ was used if an entire sentence with initial capitalisation was included in another, or no marking was used.

8.10 A begin-list marker _ was used to introduce word sequences without full syntactic structure. Note, however, that many lists were excluded (see 2.5). Included lists were part of the syntactic structure of the text in which they occurred; excluded lists show no or little syntactic structure or were extremely long.

8.11 Paragraph indicators in a text such as 1. a) B. etc. are included but are followed by sentence markers..

9 Headings

9.1 Headings are characterised by special typographical and linguistic features, and are thus specially marked. They are enclosed within *< *> and are placed on separate lines.

9.2 In source texts, headings are not always placed at the head of the portion of text to which they apply. (They may, for example, occur in the middle of an article, interrupting a sentence in the body of the text.) In such cases the headings are appropriately repositioned.

9.3 ^ is not used at the beginning of a heading, even if the heading has the structure of a complete sentence.

9.4 The brackets *< *> are only used for headings which are separated from the body of the text by being on a separate line. In other cases, the symbol ^ is used, e.g.

Problem: Improvements needed to staff cafeteria.

is coded

|^*4Problem: ^*0Improvements needed to staff cafeteria.

That is, such 'quasi-headlines' are treated as sentences. The same applies to the occurrence of the name of an author at the end of an article

9.5 Running heads at the top or foot of a page and other editorial headings (e.g. "continued on p. ...") are ignored.

10 Quotations

10.1 There are at least two types of quotation, both of which it is important to be able to distinguish from ordinary text.

(a) Quotations from people who do not represent written New Zealand English in the appropriate years. In some cases these quotations come from earlier periods of English, frequently they come from varieties of English outside New Zealand.

(b) Genuine or supposed spoken forms, which may show features which are not standard in written texts. Fictional dialogue is one case of this type.

10.2 Separate markers are used for begin-quote and end-quote (see 3.3).

10.3 Where quotation marks are not used in the original, the quotations are tagged **[BEGIN QUOTE**] and **[END QUOTE**]. Where quotation marks are avoided in fiction the original punctuation was followed.

10.4 Quotations from sources showing markedly deviant forms may contain a deviant marker (see 12).

10.5 Quotations over 30 words in length were regularly replaced with the tag **[LONG QUOTATION**] if they were not clearly from contemporary New Zealand sources.

11 Foreign language material

11.1 The researcher may need to know what is foreign material in order to be able to exclude it from analysis. Unfortunately, this proved difficult to deal with in practice. Firstly, there was the question of Maori words, which are clearly not foreign (even if they are non-English), and where it is rarely clear precisely how assimilated into English they are. Maori words were, as a matter of principle, left unmarked. Secondly, the question of assimilation exists for words from other languages as well. While "doovay" might be a clearly English word, "duvet" may or may not be. In the end, we decided that only those words to which the authors themselves had drawn attention (by the use of italics or quotation marks, for instance) would be marked as foreign.

11.2 Foreign-language words are marked by the prefix \, e.g.


11.3 The brackets { } are used instead of \ for foreign expressions consisting of more than one word, e.g.

*1{folie a*?3 deux}

Note, however, that long foreign quotations are omitted (see 2.7).

11.4 The codings for the Greek alphabet are provided in Appendix II.

11.5 The LOB Corpus used the symbol \6 (or {6 ... }) for 'foreign expression widely used'. Again, this created difficulties, since not all such expressions were marked as foreign, and since 'widely used' is an extremely subjective criterion. This category was therefore dropped, and \6 was used instead for scientific Latin in the names of species and genera.

11.6 Foreign titles of books, operas, etc. are marked unless they consist simply of a name.

11.7 Foreign names are not marked as foreign unless the author has drawn attention to the foreignness (see 11.1).

11.8 For foreign abbreviations, see 13.13.

12 Other deviant material

12.1 The LOB Corpus provides a range of markings for various kinds of deviance from current standard English. Although we did not find the need for the full range of such markings, the numbering system used in LOB has been retained. A list of those used is given in Appendix III.

12.2 Non-current English is marked \1 (or {1 ...}).

12.3 What LOB calls non-standard English is marked \2 (or {2 ... }). This applies almost exclusively to the coding of impressions of dialect (usually not New Zealand dialect) shown by deviant spelling. This marking was used very sparingly.


{2Och aye, she's a bonnie wee lass}

12.4 Foreigner English, i.e. non-standard English spoken by non-native speakers of English, is coded \3 (or {3 }).

{3^Shure does seem lika box wid a big fela like me init.}

12.5 A miscellaneous category is marked as \5 (or {5 }). This is used for deviant forms which are not otherwise categorisable.

Unaccustomed as I am to {5 oogah zurrgh bloof}

12.6 No tagging is used when an apostrophe is used to shorten a word.

12.7 Dialogue passages containing occasional non-standard grammatical features such as double negation, yep for yes, or use of tag eh have not been marked.

13 Abbreviations

13.1 Abbreviations are coded

(a) so that they can be distinguished from full vocabulary items;

(b) so that the abbreviation point can be recognised as distinct from a full stop.

13.2 The code for abbreviations is the deviant marker (see 11) followed by a zero, \0 or {0 ... }.

13.3 Abbreviations are coded as such whether or not they end in an abbreviation point. See also 13.6, 13.9 and 13.10.

13.4 A sequence of initials or abbreviations is marked {0 ... } rather than \0, whether or not the sequence contains spaces or abbreviation points, e.g.

\0Cr {0J.H.} Dennison represents Cr J.H. Dennison

{0NZPA} represents NZPA

{0lbw} represents lbw

18\0ft represents 18ft

\0No represents No (= number)

13.5 Typical abbreviations are initials as in J.B. Bolger, initialisms as in BCNZ and acronyms as in ASEAN.

13.6 Clipped words and short forms are not marked as abbreviations, e.g. Chev (= Chevrolet), didn't, etc. However, orthographic clippings which are not normally pronounced as such in speech are marked, as in \0para, \0Capt.

13.7 Ordinal numbers are not marked as abbreviations, i.e. 6th not 6\0th or \06th.

13.8 Chemical formulae are marked \0, e.g. \0N (nitrogen).

13.9 In some cases the same word may be treated differently. It is left unmarked if it appears to be treated as an ordinary vocabulary items, and no longer as an abbreviation, e.g.

O.K., OK marked

okay not marked

Anzus not marked

13.10 In some cases the use of an abbreviation point is exceptional, but a form is marked as an abbreviation solely because the abbreviation point is present:

ad. marked

ad unmarked

13.11 Any abbreviation point is placed within the brackets: {0U.S.A.}.

13.12 If an abbreviation occurs at the end of a sentence, it may not be clear whether . is to be treated as an abbreviation point or as a full stop. In cases of doubt the . is included within the abbreviation bracket.

13.13 Foreign abbreviations are marked in the same way as English abbreviations, and only as abbreviations, not as foreign: {0i.e.}.

13.14 An abbreviation marker can occur in the middle of a word, e.g. 25\0c.

13.15 Abbreviations ending in the middle of a word are marked as such: {0MP}s.

14 Hyphen and dash

14.1 The hyphen (-) is used within a word, and is not preceded or followed by a space.

14.2 The dash (*-) is preceded and followed by a space.

14.3 A line-end hyphen is only coded if it is clear from the source that the word is normally hyphenated at the appropriate point. If the word could be hyphenated at that point, but it is not clear whether the author would normally have hyphenated, the tag **[ARB**] for 'arbitrary hyphen' is inserted before the hyphen. Such arbitrary hyphens are listed in the section devoted to the appropriate text in the Texts section.

e.g. co**[ARB**]-ordination

14.4 :- as a punctuation mark is coded : *-, and is followed by a space.

14.5 - meaning 'to' (e.g. between numbers) is coded as a hyphen.

15 Mathematical expressions

15.1 Mathematical characters are, where possible, coded as themselves.

15.2 - between numerical expressions represents 'minus' (but see 14);

. between digits represents 'decimal point' or 'multiplication sign';

x in numerical expression represents 'multiplication sign';

/ in numerical expressions represents a divisor in fractions; note the following expressions:

1/2 represents  

61/2 represents  

6 1/2 represents 6 

15.3 Other mathematical characters are represented where necessary by entries from the 'Uncoded Character Index' (see Appendix I).

15.4 More complex mathematical expressions and equations are represented by **[FORMULA**]. The decision on whether to use this blanket coding is essentially a practical one, depending on the length of the coded string. Coded strings exceeding 30 characters in length are replaced.

16 typographical errors

16.1 Obvious typographical errors are corrected, e.g.

Form in the source: Corrected to:

liquorr liquor

embarassment embarrassment

used. to. used to.

set at at set at

manyyears many years

All corrections are listed under the text in question in Section Texts below.

16.2 Note that while some of these are clearly typographical errors, some spelling mistakes may be linguistically significant. They have nonetheless been corrected. Similarly, some variation between possible forms has been made consistent, especially with reference to punctuation (abbreviation points, hyphens, etc).

16.3 Where there is doubt, or where we have recognised possible change in progress, or where it is not clear what correction should be made (if any), the tag **[SIC**] is inserted into the text. This is used for cases of aberrant syntax. All occurrences of the this tag are listed under the text in question in the section Texts.

16.4 Errors in foreign names, words or expressions have been left as in the source without comment.

17 Comment tags

17.1 **[ ... **] marks comment tags. Such tags are used for explanatory comments on the text, e.g.

(a) The number of each text, given as a header to that text, e.g. **[001 TEXT A01**].

(b) Extra-textual material in the source (e.g. diagrams, lists) is represented by tags in the appropriate position.

(c) Formatting such as indentation which cannot otherwise be shown in the corpus is marked by a tag, e.g. **[BEGIN INDENTATION**] **[END INDENTATION**].

17.2 A list of comment tags is given in Appendix IV.

17.3 Comment tags except **[SIC**] and **[ARB**] are placed on separate lines at appropriate points in the text.

Appendix I: Uncoded Character Index


*?1 (macron) on preceding character
*?2 (acute accent) on preceding character
*?3 ` (grave accent) on preceding character
*?4 ~ (tilde) on preceding character
*?5 ^ (circumflex) on preceding character^
*?6 (cedilla) under preceding character

*?7 ' (single prime) superscripted to previous character
*?8 " (double prime) superscripted to previous character
*?11 / (slash) through preceding character (e.g. ≠).


Appendix II: Non-English orthography

1 ö, ä, ü are coded as o", a", u" (see 3.2).

2. , æ are coded oe, ae.

3. b in German orthography is coded as ss.

4. For representations of other characters, see Appendix I.

5. The following key is used in the transliteration of Greek:

A a

A a

N n

N n

B b

B b

X x

X x

G g

G g

O o

O o

D d

D d

P p

P p

E e

E e

R r

R r

Z z

Z z

S s

S s

H h

E e

T t

T t

Q q

TH th

U u

U u

I i

I i

F f

F f

K k

K k

C c

CH ch

L l

L l

Y y

PS ps

M m

M m

W w

O o

' (soft breathing) on following character

` (rough breathing) on following character

Note: Greek characters are considered neutral between roman and italic. They may come within the scope of *0 or *1 according to the context.

Appendix III: 'Deviant' codes


\ { } foreign word or expression

\0 {0 } abbreviation (see 13)

\1 {1 } non-current English (see 12.2)

\2 {2 } non-standard English (see 12.3)

\3 {3 } foreigner English (see 12.4)

\6 {6 } Latin names of biological species or genera (see 11.5)

\15 {15 } Greek alphabet


Appendix IV: Comment Tags

The following is a list of the common comment tags used in the corpus:


**[001 TEXT A01**] ETC. (headings for corpus texts)





**[END BOX**]














The following list contains for each text

(a) bibliographical information

(b) information on copyright

(c) number of words in the sample

(d) notes on typographic errors in the text

(e) notes on linguistic oddities in the text

(f) occasional notes on any other peculiarities of the text, which may help the reader understand the language.

Typographical errors are corrected in the Corpus, the corrected version is listed with its line number, along with the original form.

113 electricity [electricty]

indicates that the source text had electricty, and that this has been corrected to electricity on line 113 of the text as it appears in the corpus.

**[SIC**] tags are also listed, with a brief comment in square brackets to say why it has been marked. Thus

Sic: a another [grammar]

indicates that the grammar is the reason for drawing attention to a another.

Other notes should be self-explanatory.

A note on copyright

Most of the texts from which samples were taken are under copyright. In all cases where the copyright holder could be traced, permission was sought to include the extract under the conditions outlined on pp. 1-2. In the very few cases where permission was refused, alternative extracts were sampled. In the following list, we indicate who provided permission to use the appropriate extracts. In a few cases, we have not been able to trace the copyright holder, and no information is provided in the following list. Unfortunately, in one of our mailings, we took a leaf from the LOB practice, and suggested that lack of reply would be deemed to give assent, and some of the authors may have felt that they did not need to reply. We were later informed that such a procedure was not legally binding, and we did attempt to contact everyone on several occasions, but not always with success. In some cases, the authors were members of writers' workshops which disbanded immediately after the publication of their work, and we have been unable to trace them; in other cases publishers have not been able to provide author's addresses. We would be very pleased to hear from any person holding copyright to the samples we have used who we have not already contacted.


Francis, W. Nelson 1964. Manual of Information to Accompany a Standard Sample of Present-Day Edited American English, for Use with Digital Computers. Providence, R.I.: Department of Linguistics, Brown University.

Index to New Zealand Periodicals 1986. 1987. Wellington: National Library of New Zealand.

Johansson, Stig 1978. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with digital Computers. Oslo: Department of English, University of Oslo.

New Zealand Audit Bureau of Circulations 1987. Summary of Audited Circulations, year ended Sept. 30 1987. Wellington: New Zealand Audit Bureau of Circulations.

New Zealand Books in Print 1987. Port Melbourne, Australia: D.W. Thorpe Pty Ltd. 15th edn.

Nielsen Media Directory. Auckland: A.C. Nielsen (N.Z.) Ltd.