English-Norwegian Parallel Corpus: Manual

1 Introduction

The main purpose of this manual is to describe the structure and explain the coding of the English-Norwegian Parallel Corpus.

1.1 Aim

The aim of the English-Norwegian Parallel Corpus (ENPC) project is to produce a computer corpus for use in contrastive analysis and translation studies. There is a core corpus consisting of original texts and translations (Norwegian to English and English to Norwegian), and a supplementary corpus consisting of English and Norwegian texts matched by genre.

1.2 Structure of the corpus

Figure 1 shows the schematic structure of the corpus. The core corpus is indicated by solid boxes, the supplementary by broken ones. On the basis of this structure several kinds of studies are possible:

Figure 1: The structure of the corpus

1.2.1 Core corpus

The core corpus contains original texts and theirs translations (English to Norwegian and Norwegian to English). In order to include material by a range of translators, the texts of the core corpus are limited to text extracts of some 10,000 - 15,000 words. The core corpus contains both fictional and non-fictional texts.

1.2.2 Supplementary corpus

The supplementary corpus contains original text extracts only. Both fiction and non-fiction are included.

1.3 Text selection

The texts of the corpus have been selected on the basis of the taxonomy of figure 2. Each text is categorised according to this taxonomy. A novel intended for children has the code FC, and a popularised book on the vikings will have the code NPS.

                       --------------
                       | Children (C)
Fiction (F) ---------- | Detective (D)
                       | General (G)
                       --------------

                                    -------------------
                                    | Belles lettres (B)
                                    | Information (I)
                  |Popular (P)------|
                  |                 | Science (S)
                  |                 | Miscellaneous (M)
                  |                 --------------------
Non-fiction (N) --|
                  |                 ------------------
                  |                 | Acts (A)
                  |                 | Reports (R)
                  |Specialised (S)--|
                                    | Science (S)
                                    | Miscellaneous (M)
                                    -------------------

Figure 2: Taxonomy of text categories

1.4 Definition of text

1.4.1 Core corpus

The texts of the core corpus are mostly extracts from books. The extracts are between 10,000 and 15,000 words long (30 - 40 pages), and are taken from the beginning of the texts. The front matter, prefaces, forewords, list of contents, etc., are not included in the extracts. In some cases, introductions have been left out as well, e.g. introductions by scholars to works of fiction.

1.4.2 Supplementary corpus

1.5 Methodology

1.5.1 Core corpus

1.5.2 Supplementary corpus

1.6 Availability

To be allowed to store and use the corpus we have been subjected to strict copyright conditions, and the corpus can only be used for research. No commercial use is permitted. Use of the corpus is also limited to the institutions mentioned in the letters of permission. In Norway this is the Department of British and American Studies, University of Oslo and the Computing Centre for the Humanities, University of Bergen. Scholars and students outside these institutions can anly gain access to the corpus by visiting one of these places.

2 Coding

2.1 General principles

The coding of the texts is in broad agreement with the TEI guidelines for electronic texts, as presented in Sperberg-McQueen and Burnard (1994). Textual features are marked by tags enclosed within angle brackets. For example, a heading is marked by a start-tag <head> and an end-tag </head>. Tags may have attributes, to provide an identifier of the element or characterise it in some other way, e.g. <p id=p1> to identify a particular paragraph or <div type=chapter> to mark a chapter. Some tags do not enclose text, e.g. <pb n=2> marking a page break at a particular point in the text. So-called entity references (bounded by & ;) can be used for a variety of purposes, e.g. to represent characters which are not available or to carry a grammatical tag. The occurrence of tags, attributes, and entity references in a particular type of document is called a document type definition.

The document type definition for the texts in the corpus differs in some respects from the TEI model. The differences are, however, mainly additions to the TEI model; a few new tags and entities have been introduced. These tags and entities can be found in the files ENPC.DTD and ENCP.ENT respectively. Together with ENPC.TXT, which invokes the appropriate TEI tag sets, they constitute the complete ENPC tag set (see Appendix 1).

The overall structure of an ENPC text is shown by this example:

<tei.2 id=AT1>
     <teiHeader type=text>
     </teiHeader>
     <text>
     </text>
</tei.2>

In other words, there are two main parts: a header and the main text. Every text has a unique identifier AT1 (indicating text 1 by Anne Tyler). The corresponding coding for the translation would be: <tei.2 id=AT1T>

The value of the identifier of the translated text is identical to that of the original, with the addition of a letter (T) marking it as a translation.

2.2 Markup levels

A distinction is made between two levels:

level 1 (minimum coding): header, coding of main text structure (divisions, headings, paragraphs, s-units). Attributes for "rendition" may be omitted.

level 2: additional coding as outlined in this chapter

The aim is to code as many texts as possible according to level 2. The markup level of a text is specified in the encoding description of the header (see 2.3.2) if it is other than level 2.

2.3 The header

Each text is described by a header which has four main parts, in accordance with the TEI guidelines: a file description, an encoding description, a profile description, and a revision description. These are tagged as follows:

<header>
     <fileDesc></fileDesc>
     <encodingDesc></encodingDesc>
     <profileDesc></profileDesc>
     <revisionDesc></revisionDesc>
</header>

Header and main text structure

<tei.2 id=AT1>
     <teiHeader type=text>
          <fileDesc>
               <titleStmt>
                    <title>The Accidental Tourist: Extract in machine-readable
                    form</title>
                    <author>Anne Tyler</author>
                    <respStmt>
                         <resp>tagger</resp>
                         <name>BHL</name>
                    </respStmt>
               </titleStmt>
               <extent>12,000 words from beginning of text</extent>
               <publicationStmt><distributor>English-Norwegian Parallel Corpus (ENPC) Project</distributor></publicationStmt>
               <notesStmt><note resp=tag></note></notesStmt>
               <sourceDesc>
                    <biblStruct>
                         <monogr>
                              <author>Anne Tyler</author>
                              <respStmt>
                                   <resp></resp>
                                   <name></name>
                              </respStmt>
                              <title>The Accidental Tourist</title>
                              <imprint>
                                   <pubPlace>New York</pubPlace>
                                   <publisher>Alfred A. Knopf</publisher>
                                   <date>1985</date>
                              </imprint>
                         </monogr>
                    </biblStruct>
               </sourceDesc>
          </fileDesc>
          <encodingDesc>
               <p>Modified TEI P3. See the ENPC project manual.</p>
          </encodingDesc>
          <profileDesc>
               <langUsage><language>AmE</language></langUsage>
               <textClass><classCode>FG</classCode></textClass>
          </profileDesc>
     </teiHeader>
     <text>
          <body>
               <div1 type= id= >
                    <div2 type= id= >
                         <p id= >
                              <s id= corresp= ></s>
                         </p>
                    </div2>
               </div1>
          </body>
     </text>
</tei.2>

2.3.1 File description

Note that the <titleStmt> describes the machine-readable file, while the source text is specified in the <sourceDesc>. The title in the <titleStmt> should indicate that this is a machine-readable version and should not be identical to the title of the source text. The file description also specifies author, tagger, translator, publication information and the extent of the text extract.

Irregularities, e.g. omissions, of the electronic text are noted in the <notesStmt> (see 2.13.1 and 2.13.2).

2.3.2 Encoding description

The TEI encoding description may include a project description, editorial declarations (on correction, normalization, etc.), information on sampling, reference systems, and any classification schemes. In our case the encoding description can be very brief; it chiefly consists of a reference to the manual for the corpus, the markup level, and any additional comments on special features of encoding applying to the individual text.

In the early stages of the project the encoding description is limited to an indication of markup level and a description in prose of any special characteristics of the text.

2.3.3 Profile description

The profile description is of particular interest in the encoding of corpora, in that it makes it possible to describe each text in a very detailed manner. The present project will chiefly use the following main parts of the TEI profile description:

<langUsage><language> where the language/dialect of the text is described;

<textClass><classCode> where the text is classified in terms of a classification scheme;

The description under <langUsage><language> is in terms of labels like: American English (AmE), Australian English (AuE), British English (BrE), Canadian English (CaE), New Zealand English (NZE), etc. This section may also include observations on special linguistic features of the text (cf. 2.8 below).

The classification under <textClass><classCode> is in terms of the following scheme (see also 1.3):

Fiction:            Children (FC)
                    Detective (FD)
                    General (FG)
Non-fiction:        Popular: Belles lettres (biography, memoirs) (NPB)
                    Information (information for the general public) (NPI)
                    Science (history, biology, etc.) (NPS)
                    Miscellaneous (NPM)
                    Specialised: Acts (NSA)
                    Reports (official reports) (NSR)
                    Science (history, biology, etc.) (NSS)
                    Miscellaneous (NSM)

2.3.4 Revision description

The revision description takes the form of a series of changes. It is structured as follows:

<revisionDesc>
     <change>
          <date></date>
          <name></name>
          <what></what>
     </change>
</revisionDesc>

In other words, this is a list of changes specifying the date of the change, the person responsible for the change, and the nature of the change.

2.4 Text units

The corpus texts are segmented into the following main units: text, division (where applicable), paragraph, s-unit, and word. Words are simply marked by spacing as in ordinary written text. The other units are explicitly tagged.

2.4.1 Text

Where complete texts are encoded, these have the structure recommended by the TEI guidelines:

<text>
     <body>
     </body>
</text>

In the case of text extracts from books, [part of] the body only is included. The encoded text starts with the body of the main text, including headings, and ends with the nearest chapter or section division after the required number of words for the text extract has been reached. If the nearest chapter or section division extends considerably beyond the required number of words, the encoded text ends with the nearest paragraph.

The end of a text extract is marked by an <omit> tag; see 2.13.2.

2.4.2 Divisions

Most written texts include some sort of segmentation in terms of parts, chapters, sections, etc. According to the TEI guidelines, these units are tagged as numbered or unnumbered divisions. This corpus uses numbered divisions, where a lower number indicates a higher level. The type of division is described by an attribute. Example structure:

<body>
     <div1 type=part id=NN1.1>
          <div2 type=chapter id=NN1.1.1>
               <div3 type=section id=NN1.1.1.1></div3>
          </div2>
     </div1>
</body>

Each unit has an identifier which is built up by successively adding to the identifier of the text (in this case text NN1: cf. 2.1 above).

Low-level divisions in the text which are only marked by a blank line, asterisks, or the like, are not tagged as divisions. The tag <blankline> is inserted at the appropriate point in the text. This may be taken to signal a major paragraph break.

The front and the back of the texts are not tagged.

2.4.3 Paragraphs

Divisions primarily contain a sequence of paragraphs (in addition, there may be headings, notes, etc.). Continuing our example above, these are marked as follows:

<div3 type3=section id=NN1.1.1.1>
     <p id=NN1.1.1.1.p1></p>
</div3>

Each paragraph has an identifier which adds yet another layer to the immediately superordinate identifier.

Paragraphs are identified as sections of texts marked by indentation, a blank line, or a combination of the two. Lists are marked as paragraphs or sequences of paragraphs; see 2.10.

2.4.4 S-units

Paragraphs are divided into orthographic sentences, here called s-units to underline that they are not necessarily sentences in a grammatical sense. They are tagged as follows:

<p id=NN1.1.1.1.p1>
     <s id=NN1.1.1.1.s1 corresp=NN1T.1.1.1.s1></s>
     <s id=NN1.1.1.1.s2 corresp=NN1T.1.1.1.s2></s>
</p>

S-units are numbered within the nearest division, as shown above. After alignment, each s-unit in the core corpus has a "corresp" attribute containing a reference to the corresponding unit(s) in the parallel text. S-units in the supplementary corpus have no corresp attribute.

An s-unit always opens after a paragraph start and ends before an end-of-paragraph marker. S-units are split within paragraphs where a mark of end punctuation (.?! or ... marking ellipsis) is followed by a word beginning with a capital initial (ignoring intervening parentheses, dashes, and quotation marks). No split is made between a colon or semi-colon followed by a word beginning with a capital initial (unless there is an end-of-paragraph marker).

S-units are not allowed to nest, i.e. they cannot be contained within each other. If there is an included sentence, e.g. within parentheses or between dashes, it is not coded separately, but is part of the s-unit it is included in. S-units may contain embedded poems, intra-sentential quotations, etc.

The division into s-units is complicated in some cases involving abbreviations and direct speech. Examples:

     <s>Dr. Smith, St. George</s>
     <s>"Hurry up!" Wolfram interrupted.</s>
     <s>"Why didn't you come straight to me?" I asked her.</s>

No split is made in such cases, where the capital does not mark the beginning of an s-unit, but rather the nature of the word.

Headings, epigraphs, notes, and poems embedded in the text are not split into s-units.

2.4.5 Words

As pointed out above, words are not tagged, but are simply marked by spacing as in ordinary written text. The exception is that contractions are split into two words (in order to facilitate alignment). Examples:

       can't     ca n't
       I'll      I 'll
       it's      it 's
       d'you     d' you

In the early stages of the project words are not grammatically annotated, with a couple of exceptions:

       let's     let 's&pron;
       soon's    soon 's&subord;

The -s is here disambiguated by the following entity reference, which may be regarded as a grammatical tag.

2.5 Headings and other openers

Headings may occur at the beginning of a division or between paragraphs. They are marked by the tag <head>. Examples:

     <head id=NN1.1.h1>Part 1</head>
     <head id=NN1.1.1.h1>1 Mind in myth</head>

The "enumerator" is encoded as part of the head, as in these examples. Headings carry an "id" which is built up according to the same principle as the "id" of paragraphs and s-units, i.e. they are numbered within the nearest <div> but using "h1, h2, etc." rather than "p1, p2, etc." and "s1, s2, etc.". See 2.4.3-4.

Where there is more than one heading at a particular point, the tag <head> may be repeated. The typographical rendition of the heading is regularly left unmarked, but it can be specified by a "rend" attribute; see 2.7.1.

Running heads at the top of pages are not encoded.

Epigraphs at the beginning of divisions have the following structure:

<epigraph>
     <quote></quote>
     <bibl></bibl>
</epigraph>

As regards the encoding of other opening elements, see the TEI guidelines.

2.6 Punctuation

The punctuation is regularly left as in the original text. Some problems of detail are taken up below.

2.6.1 Full stop

The full stop is retained both as a marker of abbreviation and when marking the end of an orthographic sentence. The two uses are disambiguated by the tagging of s-units (see 2.4.4).

The marking of ellipsis by successive full stops is regularized; any spaces before or between the dots are removed.

2.6.2 Hyphen

Line-end (soft) hyphens are removed where they are not part of the regular spelling of the word. In cases of doubt, guidance should be sought elsewhere in the same text or in dictionaries. If doubt still remains, a hyphen should be retained rather than removed.

2.6.3 Dash

Dashes are marked by an entity reference (&mdash;). No distinction is made between different types of dashes.

2.6.4 Quotation marks

Quotation marks are regularized to single and double quotes. At a later stage in the project the various uses of quotation marks may be distinguished and marked according to the TEI conventions. See further 2.7.8 below.

2.6.5 Apostrophe

The apostrophe is left as it is. In the encoded text it cannot be distinguished from a single quotation mark. This is of less importance, as the two regularly appear in different contexts; the quotation mark at the beginning or end of words, the apostrophe within words (apart from genitives ending in -s' and split contractions; cf. 2.4.5). The ambiguity may be removed at a later stage (cf. 2.6.4).

2.7 Highlighting and quotation

No attempt is made to capture the full typography of the original text. Variation between upper and lower case is reproduced as in the original text. Use of typographical highlighting is marked where it is judged to be significant for the interpretation of the text.

2.7.1 Typographical highlighting

Typographical highlighting is marked by a "rend" (=rendition) attribute, if it applies to a whole element: a paragraph or an s-unit, as in:

     <p rend=italic>
     <s rend=bold>

Where there is no applicable element, the tag <hi> is used:

     I <hi rend=italic>hate</hi> it.

The TEI guidelines propose the tag <emph> for linguistically emphatic or stressed sections of the text. The TEI tag <hi> is preferred in the present corpus, to avoid some problems in identifying the purpose of typographical highlighting.

Where part of a text is highlighted typographically because it is identified as foreign, it is preferable to use the tagging presented in the next section (though the "rend" attribute can be used in addition).

2.7.2 Foreign words and expressions

Foreign words and expressions are marked by a "lang" attribute. This is simple if the foreign element carries a tag:

     <head lang=fr>
     <s lang=la>

Where there is no applicable element, the tag <foreign> is used:

     He was tried <foreign lang=la>in absentia</foreign>

Some possible values of the "lang" attribute are:

   
     de   German
     en   English
     es   Spanish
     fr   French
     gr   Greek
     la   Latin
     no   Norwegian
     sv   Swedish

Foreign words and expressions are only marked where they are clearly recognizable as foreign (by being identifiable as separate units or being reproduced as typographically distinct from the surrounding text). The "lang" attribute can of course be used in the cases taken up next. Long passages in a foreign language are replaced by an <omit> tag; see 2.13.2.

2.7.3 Language mention

Words and expressions which are mentioned rather than used are normally marked by italics or quotation marks. These are tagged <mentioned>, as in:

     <mentioned rend=italic>She</mentioned> is a personal pronoun.
     <mentioned lang=ger>"Singen"</mentioned> is a strong verb.

The rendition is marked by an attribute and/or by retaining quotation marks.

2.7.4 Terms

Highlighted terms are tagged <term>, possibly accompanied by a <gloss>, as in:

     Apical sounds are produced with the <term>apex</term>
     <gloss>'tip of the tongue'</gloss>.

2.7.5 Titles

Titles of books, newpapers, magazines, films, songs, paintings, etc. are tagged <title>, as in:

     Have you read <title>Paradise Lost</title>?

Titles are only tagged if they are typographically highlighted in some way, eg by italic, bold or underscore.

2.7.6 Names

Names of persons, ships, boats, buildings, etc. are tagged <name>, as in:

     I went on board  <name>Tumble</name> and set sail.

Names are only tagged if they are typographically highlighted in some way, eg by italic, bold or underscore. The "type" attribute is optional, and is usually not inserted at this stage.

Names of places, organizations, etc. are usually not tagged.

2.7.7 Quotations

Quotations from extraneous sources are tagged <quote> if they comprise one or more complete s-units. Quotations within s-units are not tagged, but are usually surrounded by double quotation marks, as in:

     The Apostle Paul said concerning some that "By good words
     and fair speeches they deceived the heart of the simple."

Foreign quotations are marked by a "lang" attribute. Long foreign quotations are omitted and replaced by an <omit> tag; see 2.13.2.

Direct speech in fiction is left unmarked and is simply shown by quotation marks. At a later stage direct speech may be tagged as in this example:

     <q>"Let's go,"</q> she said.

Before this tagging, direct speech may not be identifiable, as it is not always indicated by quotation marks. Missing quotation marks can be inserted using the <add> tag; see 2.13.2.

2.7.8 Use of single (') vs double quotation marks (")

All single quotation marks (') are converted to double quotation marks in direct speech and marked text (e.g. quotations within a s-unit).

     <s>"I do n't know how he stays so thin."</s>
     <s>She used her "meeting voice".</s>

The single quotation mark, ('), is only used in contractions (She 's, y' enjoy) and to mark the genitive (next week's Sunday newpapers' review section). Quotations within quotation are tagged <qq>. This also applies to marked text within quotations or direct speech.

     <p><s>"The finger got stuck inside his nose," Matilda
     said, "and he had to go around like that for a week.</s>
     <s>People kept saying to him, <qq>Stop picking your
     nose</qq>, and he could n't do anything about it.</s>
     <s>He looked an awful fool."</s></p>
     <s>"Lately he 's discovered <qq>breakfast
     meetings</qq>.</s> <s>Now he gorges and guzzles all
     day.</s> <s>I do n't know how he stays so thin."</s>

2.8 Linguistically distinct material

The marking of foreign elements has already been dealt with (see 2.7.2). It may be essential to mark other linguistically distinct material, such as dialect words or idiosyncratic spellings. These are tagged <distinct>, with an attribute indicating the type of deviance. Examples:

     <distinct type=nonstand>Mister Carlyle sure give it to
     yuh, he finds out!</distinct>
     Why do we not treat <distinct type=nonceword>bunkraptcy</distinct> precisely as we
     treat bankruptcy?

The main value used for the "type" attribute in the present project is "nonstand", indicating deviance of different kinds: dialect, slang, idiosyncratic spelling, etc. If such features are pervasive in the text, this is noted in the header (under <notesStmt>), and each individual case is not marked.

2.9 Notes

Notes in the source text are tagged <note> and are inserted at the place in the text marked by the reference to the note. Attributes include "resp" and "place". Example:

     <note resp=auth place=foot>Unless otherwise specified,
     all remarks about bilingualism apply as well to
     multilingualism, the practice of using alternately three
     or more languages.</note>

Values of the "resp" attribute used in the project are: auth (author), ed (editor), tr (translator), tag (tagger). References to notes are omitted. Notes are not counted as included in the text proper, and are not split into s-units. In special cases it may be desirable to omit notes. They are then replaced by an <omit> tag. See 2.13.2.

2.10 Lists

Lists which contain very little ordinary language text (e.g. lists of references) are omitted and replaced by an <omit> tag; see 2.13.2. Other lists are treated as paragraphs or sequences of paragraphs (the latter in case each list item is set out typographically as a paragraph). S-units are used for subdivision, as for ordinary paragraphs.

2.11 Figures, diagrams, and tables

Figures, diagrams, and tables are left out and replaced by an <omit> tag. See 2.13.2.

2.12 Embedded texts

Poems, songs, etc. that are embedded in a prose text are tagged <poem>. The internal structure is not specified. Verse lines are reproduced with a line break between each. There is a blank line between stanzas. Poems are included in the nearest s-unit. There is no internal division into s-units.

In some cases it may be preferable to leave out a poem and replace it by an <omit> tag. See 2.13.2.

Embedded texts in prose are simply reproduced as part of the main text. Ordinary paragraph and s-unit marking is used. Frequently they will be tagged as quotations; see 2.7.7.

2.13 Editorial comment

The mechanisms for editorial comment are those recommended by the TEI guidelines for simple editorial changes.

2.13.1 Correction and regularization

Correction is marked as shown by this example:

     ... to render that service to poor <corr sic=poele resp=tag>people</corr>

Where it is apparent that there is a typographical error, the main text is corrected and the original reading is given as a value of a "sic" attribute. A "resp" attribute should be used to specify the person responsible for the correction (normally "tag" for "tagger"; cf. 2.9). The tag <sic> is used where there is no straightforward correction, but it is apparent that the text is inaccurate. A suggested correction may be given as a value of a "corr" attribute. A "resp" attribute should be used to specify the person responsible for the correction. Repeated wrong spelling of words throughout a text is noted in the <notesStmt>, and not tagged using the <corr> tag on each occasion. Beyond correction of obvious typographical errors, the language of the corpus texts is not normalized or regularized.

2.13.2 Addition, deletion, and omission

Omission of passages in the text may be marked by an <omit> tag; see 2.4.1, 2.7.2, 2.7.7, 2.9, 2.10, 2.11, 2.12. The tag has the following attributes:

desc: describing the omitted text
reason: giving the reason for the omission
extent: indicating the extent of the omission
resp: specifying the person responsible for the omission

The "desc" and "resp" attributes should normally be used. Sample "desc" values include: table, figure, foreign text.

Addition and deletion in the main text are avoided, though they can be indicated by <add> and <del> tags. An example of the use of the <add> tag is the insertion of a missing quotation mark; cf. 2.7.7.

2.14 Special characters

Special characters are encoded as entity references, eg

š         š
£         £
—         —

Entity references specific to the project are listed in the project entity file (ENPC.ENT) (see Appendix 1). All others are found in one of the public entity sets that comes with TEI P3, e.g. ISOpub.ENT.

NB! Accented and special characters used in Western European languages (de, en, fr, no) are not encoded as entity references at this stage. They are, therefore, system dependent.

2.15 Page breaks

Page breaks in the source text are kept to make it easier to refer back to the source. They are tagged <pb n= >, i.e. with the number as the value of an attribute. The placement of <pb> is normalized and is always given at the beginning of the relevant page. If there is a page break in the middle of a hyphenized word in the original text, <pb> is placed after the relevant word in the encoded text.

2.16 Reference system

A reference system is built up using the identifiers of the text units. See 2.1 (text), 2.4.2 (division), 2.4.3 (paragraph), 2.4.4 (s-unit), 2.5 (heading).

2.17 Links

Links between parallel texts are indicated by attributes of s-units, as shown in 2.4.4. Example:

<s id=DL2.1.s18 corresp='DL2T.1.s18 DL2T.1.s19'>At once,
feeling her advantage, she said, "Do n't forget you 've been
living soft for four years."</s>
<s id=DL2T.1.s18 corresp=DL2.1.s18>Hun hadde fått et lite
overtak og fulgte det opp.</s>
<s id=DL2T.1.s19 corresp=DL2.1.s18>"Ikke glem at du har levd
godt i fire år nå."</s>

2.18 Analytic coding

In the earlier stages of the project there will be no linguistic annotation, with a few exceptions; see 2.4.5.