The main purpose of this manual is to describe the structure and explain the coding of the English-Norwegian Parallel Corpus.
The aim of the English-Norwegian Parallel Corpus (ENPC) project is to produce a computer corpus for use in contrastive analysis and translation studies. There is a core corpus consisting of original texts and translations (Norwegian to English and English to Norwegian), and a supplementary corpus consisting of English and Norwegian texts matched by genre.
Figure 1 shows the schematic structure of the corpus. The core corpus is indicated by solid boxes, the supplementary by broken ones. On the basis of this structure several kinds of studies are possible:
Figure 1: The structure of the corpus
The core corpus contains original texts and theirs translations (English to Norwegian and Norwegian to English). In order to include material by a range of translators, the texts of the core corpus are limited to text extracts of some 10,000 - 15,000 words. The core corpus contains both fictional and non-fictional texts.
The supplementary corpus contains original text extracts only. Both fiction and non-fiction are included.
The texts of the corpus have been selected on the basis of the taxonomy of figure 2. Each text is categorised according to this taxonomy. A novel intended for children has the code FC, and a popularised book on the vikings will have the code NPS.
-------------- | Children (C) Fiction (F) ---------- | Detective (D) | General (G) -------------- ------------------- | Belles lettres (B) | Information (I) |Popular (P)------| | | Science (S) | | Miscellaneous (M) | -------------------- Non-fiction (N) --| | ------------------ | | Acts (A) | | Reports (R) |Specialised (S)--| | Science (S) | Miscellaneous (M) -------------------
Figure 2: Taxonomy of text categories
The texts of the core corpus are mostly extracts from books. The extracts are between 10,000 and 15,000 words long (30 - 40 pages), and are taken from the beginning of the texts. The front matter, prefaces, forewords, list of contents, etc., are not included in the extracts. In some cases, introductions have been left out as well, e.g. introductions by scholars to works of fiction.
To be allowed to store and use the corpus we have been subjected to strict copyright conditions, and the corpus can only be used for research. No commercial use is permitted. Use of the corpus is also limited to the institutions mentioned in the letters of permission. In Norway this is the Department of British and American Studies, University of Oslo and the Computing Centre for the Humanities, University of Bergen. Scholars and students outside these institutions can anly gain access to the corpus by visiting one of these places.
The coding of the texts is in broad agreement with the TEI guidelines for electronic texts, as presented in Sperberg-McQueen and Burnard (1994). Textual features are marked by tags enclosed within angle brackets. For example, a heading is marked by a start-tag <head> and an end-tag </head>. Tags may have attributes, to provide an identifier of the element or characterise it in some other way, e.g. <p id=p1> to identify a particular paragraph or <div type=chapter> to mark a chapter. Some tags do not enclose text, e.g. <pb n=2> marking a page break at a particular point in the text. So-called entity references (bounded by & ;) can be used for a variety of purposes, e.g. to represent characters which are not available or to carry a grammatical tag. The occurrence of tags, attributes, and entity references in a particular type of document is called a document type definition.
The document type definition for the texts in the corpus differs in some respects from the TEI model. The differences are, however, mainly additions to the TEI model; a few new tags and entities have been introduced. These tags and entities can be found in the files ENPC.DTD and ENCP.ENT respectively. Together with ENPC.TXT, which invokes the appropriate TEI tag sets, they constitute the complete ENPC tag set (see Appendix 1).
The overall structure of an ENPC text is shown by this example:
<tei.2 id=AT1> <teiHeader type=text> </teiHeader> <text> </text> </tei.2>
In other words, there are two main parts: a header and the main text. Every text has a unique identifier AT1 (indicating text 1 by Anne Tyler). The corresponding coding for the translation would be: <tei.2 id=AT1T>
The value of the identifier of the translated text is identical to that of the original, with the addition of a letter (T) marking it as a translation.
A distinction is made between two levels:
level 1 (minimum coding): header, coding of main text structure (divisions, headings, paragraphs, s-units). Attributes for "rendition" may be omitted.
level 2: additional coding as outlined in this chapter
The aim is to code as many texts as possible according to level 2. The markup level of a text is specified in the encoding description of the header (see 2.3.2) if it is other than level 2.
Each text is described by a header which has four main parts, in accordance with the TEI guidelines: a file description, an encoding description, a profile description, and a revision description. These are tagged as follows:
<header> <fileDesc></fileDesc> <encodingDesc></encodingDesc> <profileDesc></profileDesc> <revisionDesc></revisionDesc> </header>
Header and main text structure
<tei.2 id=AT1> <teiHeader type=text> <fileDesc> <titleStmt> <title>The Accidental Tourist: Extract in machine-readable form</title> <author>Anne Tyler</author> <respStmt> <resp>tagger</resp> <name>BHL</name> </respStmt> </titleStmt> <extent>12,000 words from beginning of text</extent> <publicationStmt><distributor>English-Norwegian Parallel Corpus (ENPC) Project</distributor></publicationStmt> <notesStmt><note resp=tag></note></notesStmt> <sourceDesc> <biblStruct> <monogr> <author>Anne Tyler</author> <respStmt> <resp></resp> <name></name> </respStmt> <title>The Accidental Tourist</title> <imprint> <pubPlace>New York</pubPlace> <publisher>Alfred A. Knopf</publisher> <date>1985</date> </imprint> </monogr> </biblStruct> </sourceDesc> </fileDesc> <encodingDesc> <p>Modified TEI P3. See the ENPC project manual.</p> </encodingDesc> <profileDesc> <langUsage><language>AmE</language></langUsage> <textClass><classCode>FG</classCode></textClass> </profileDesc> </teiHeader> <text> <body> <div1 type= id= > <div2 type= id= > <p id= > <s id= corresp= ></s> </p> </div2> </div1> </body> </text> </tei.2>
Note that the <titleStmt> describes the machine-readable file, while the source text is specified in the <sourceDesc>. The title in the <titleStmt> should indicate that this is a machine-readable version and should not be identical to the title of the source text. The file description also specifies author, tagger, translator, publication information and the extent of the text extract.
Irregularities, e.g. omissions, of the electronic text are noted in the <notesStmt> (see 2.13.1 and 2.13.2).
The TEI encoding description may include a project description, editorial declarations (on correction, normalization, etc.), information on sampling, reference systems, and any classification schemes. In our case the encoding description can be very brief; it chiefly consists of a reference to the manual for the corpus, the markup level, and any additional comments on special features of encoding applying to the individual text.
In the early stages of the project the encoding description is limited to an indication of markup level and a description in prose of any special characteristics of the text.
The profile description is of particular interest in the encoding of corpora, in that it makes it possible to describe each text in a very detailed manner. The present project will chiefly use the following main parts of the TEI profile description:
<langUsage><language> where the language/dialect of the text is described;
<textClass><classCode> where the text is classified in terms of a classification scheme;
The description under <langUsage><language> is in terms of labels like: American English (AmE), Australian English (AuE), British English (BrE), Canadian English (CaE), New Zealand English (NZE), etc. This section may also include observations on special linguistic features of the text (cf. 2.8 below).
The classification under <textClass><classCode> is in terms of the following scheme (see also 1.3):
Fiction: Children (FC) Detective (FD) General (FG) Non-fiction: Popular: Belles lettres (biography, memoirs) (NPB) Information (information for the general public) (NPI) Science (history, biology, etc.) (NPS) Miscellaneous (NPM) Specialised: Acts (NSA) Reports (official reports) (NSR) Science (history, biology, etc.) (NSS) Miscellaneous (NSM)
The revision description takes the form of a series of changes. It is structured as follows:
<revisionDesc> <change> <date></date> <name></name> <what></what> </change> </revisionDesc>
In other words, this is a list of changes specifying the date of the change, the person responsible for the change, and the nature of the change.
The corpus texts are segmented into the following main units: text, division (where applicable), paragraph, s-unit, and word. Words are simply marked by spacing as in ordinary written text. The other units are explicitly tagged.
Where complete texts are encoded, these have the structure recommended by the TEI guidelines:
<text> <body> </body> </text>
In the case of text extracts from books, [part of] the body only is included. The encoded text starts with the body of the main text, including headings, and ends with the nearest chapter or section division after the required number of words for the text extract has been reached. If the nearest chapter or section division extends considerably beyond the required number of words, the encoded text ends with the nearest paragraph.
The end of a text extract is marked by an <omit> tag; see 2.13.2.
Most written texts include some sort of segmentation in terms of parts, chapters, sections, etc. According to the TEI guidelines, these units are tagged as numbered or unnumbered divisions. This corpus uses numbered divisions, where a lower number indicates a higher level. The type of division is described by an attribute. Example structure:
<body> <div1 type=part id=NN1.1> <div2 type=chapter id=NN1.1.1> <div3 type=section id=NN1.1.1.1></div3> </div2> </div1> </body>
Each unit has an identifier which is built up by successively adding to the identifier of the text (in this case text NN1: cf. 2.1 above).
Low-level divisions in the text which are only marked by a blank line, asterisks, or the like, are not tagged as divisions. The tag <blankline> is inserted at the appropriate point in the text. This may be taken to signal a major paragraph break.
The front and the back of the texts are not tagged.
Divisions primarily contain a sequence of paragraphs (in addition, there may be headings, notes, etc.). Continuing our example above, these are marked as follows:
<div3 type3=section id=NN1.1.1.1> <p id=NN1.1.1.1.p1></p> </div3>
Each paragraph has an identifier which adds yet another layer to the immediately superordinate identifier.
Paragraphs are identified as sections of texts marked by indentation, a blank line, or a combination of the two. Lists are marked as paragraphs or sequences of paragraphs; see 2.10.
Paragraphs are divided into orthographic sentences, here called s-units to underline that they are not necessarily sentences in a grammatical sense. They are tagged as follows:
<p id=NN1.1.1.1.p1> <s id=NN1.1.1.1.s1 corresp=NN1T.1.1.1.s1></s> <s id=NN1.1.1.1.s2 corresp=NN1T.1.1.1.s2></s> </p>
S-units are numbered within the nearest division, as shown above. After alignment, each s-unit in the core corpus has a "corresp" attribute containing a reference to the corresponding unit(s) in the parallel text. S-units in the supplementary corpus have no corresp attribute.
An s-unit always opens after a paragraph start and ends before an end-of-paragraph marker. S-units are split within paragraphs where a mark of end punctuation (.?! or ... marking ellipsis) is followed by a word beginning with a capital initial (ignoring intervening parentheses, dashes, and quotation marks). No split is made between a colon or semi-colon followed by a word beginning with a capital initial (unless there is an end-of-paragraph marker).
S-units are not allowed to nest, i.e. they cannot be contained within each other. If there is an included sentence, e.g. within parentheses or between dashes, it is not coded separately, but is part of the s-unit it is included in. S-units may contain embedded poems, intra-sentential quotations, etc.
The division into s-units is complicated in some cases involving abbreviations and direct speech. Examples:
<s>Dr. Smith, St. George</s> <s>"Hurry up!" Wolfram interrupted.</s> <s>"Why didn't you come straight to me?" I asked her.</s>
No split is made in such cases, where the capital does not mark the beginning of an s-unit, but rather the nature of the word.
Headings, epigraphs, notes, and poems embedded in the text are not split into s-units.
As pointed out above, words are not tagged, but are simply marked by spacing as in ordinary written text. The exception is that contractions are split into two words (in order to facilitate alignment). Examples:
can't ca n't I'll I 'll it's it 's d'you d' you
In the early stages of the project words are not grammatically annotated, with a couple of exceptions:
let's let 's&pron; soon's soon 's&subord;
The -s is here disambiguated by the following entity reference, which may be regarded as a grammatical tag.
Headings may occur at the beginning of a division or between paragraphs. They are marked by the tag <head>. Examples:
<head id=NN1.1.h1>Part 1</head> <head id=NN1.1.1.h1>1 Mind in myth</head>
The "enumerator" is encoded as part of the head, as in these examples. Headings carry an "id" which is built up according to the same principle as the "id" of paragraphs and s-units, i.e. they are numbered within the nearest <div> but using "h1, h2, etc." rather than "p1, p2, etc." and "s1, s2, etc.". See 2.4.3-4.
Where there is more than one heading at a particular point, the tag <head> may be repeated. The typographical rendition of the heading is regularly left unmarked, but it can be specified by a "rend" attribute; see 2.7.1.
Running heads at the top of pages are not encoded.
Epigraphs at the beginning of divisions have the following structure:
<epigraph> <quote></quote> <bibl></bibl> </epigraph>
As regards the encoding of other opening elements, see the TEI guidelines.
The punctuation is regularly left as in the original text. Some problems of detail are taken up below.
The full stop is retained both as a marker of abbreviation and when marking the end of an orthographic sentence. The two uses are disambiguated by the tagging of s-units (see 2.4.4).
The marking of ellipsis by successive full stops is regularized; any spaces before or between the dots are removed.
Line-end (soft) hyphens are removed where they are not part of the regular spelling of the word. In cases of doubt, guidance should be sought elsewhere in the same text or in dictionaries. If doubt still remains, a hyphen should be retained rather than removed.
Dashes are marked by an entity reference (—). No distinction is made between different types of dashes.
Quotation marks are regularized to single and double quotes. At a later stage in the project the various uses of quotation marks may be distinguished and marked according to the TEI conventions. See further 2.7.8 below.
The apostrophe is left as it is. In the encoded text it cannot be distinguished from a single quotation mark. This is of less importance, as the two regularly appear in different contexts; the quotation mark at the beginning or end of words, the apostrophe within words (apart from genitives ending in -s' and split contractions; cf. 2.4.5). The ambiguity may be removed at a later stage (cf. 2.6.4).
No attempt is made to capture the full typography of the original text. Variation between upper and lower case is reproduced as in the original text. Use of typographical highlighting is marked where it is judged to be significant for the interpretation of the text.
Typographical highlighting is marked by a "rend" (=rendition) attribute, if it applies to a whole element: a paragraph or an s-unit, as in:
<p rend=italic> <s rend=bold>
Where there is no applicable element, the tag <hi> is used:
I <hi rend=italic>hate</hi> it.
The TEI guidelines propose the tag <emph> for linguistically emphatic or stressed sections of the text. The TEI tag <hi> is preferred in the present corpus, to avoid some problems in identifying the purpose of typographical highlighting.
Where part of a text is highlighted typographically because it is identified as foreign, it is preferable to use the tagging presented in the next section (though the "rend" attribute can be used in addition).
Foreign words and expressions are marked by a "lang" attribute. This is simple if the foreign element carries a tag:
<head lang=fr> <s lang=la>
Where there is no applicable element, the tag <foreign> is used:
He was tried <foreign lang=la>in absentia</foreign>
Some possible values of the "lang" attribute are:
de German en English es Spanish fr French gr Greek la Latin no Norwegian sv Swedish
Foreign words and expressions are only marked where they are clearly recognizable as foreign (by being identifiable as separate units or being reproduced as typographically distinct from the surrounding text). The "lang" attribute can of course be used in the cases taken up next. Long passages in a foreign language are replaced by an <omit> tag; see 2.13.2.
Words and expressions which are mentioned rather than used are normally marked by italics or quotation marks. These are tagged <mentioned>, as in:
<mentioned rend=italic>She</mentioned> is a personal pronoun. <mentioned lang=ger>"Singen"</mentioned> is a strong verb.
The rendition is marked by an attribute and/or by retaining quotation marks.
Highlighted terms are tagged <term>, possibly accompanied by a <gloss>, as in:
Apical sounds are produced with the <term>apex</term> <gloss>'tip of the tongue'</gloss>.
Titles of books, newpapers, magazines, films, songs, paintings, etc. are tagged <title>, as in:
Have you read <title>Paradise Lost</title>?
Titles are only tagged if they are typographically highlighted in some way, eg by italic, bold or underscore.
Names of persons, ships, boats, buildings, etc. are tagged <name>, as in:
I went on board <name>Tumble</name> and set sail.
Names are only tagged if they are typographically highlighted in some way, eg by italic, bold or underscore. The "type" attribute is optional, and is usually not inserted at this stage.
Names of places, organizations, etc. are usually not tagged.
Quotations from extraneous sources are tagged <quote> if they comprise one or more complete s-units. Quotations within s-units are not tagged, but are usually surrounded by double quotation marks, as in:
The Apostle Paul said concerning some that "By good words and fair speeches they deceived the heart of the simple."
Foreign quotations are marked by a "lang" attribute. Long foreign quotations are omitted and replaced by an <omit> tag; see 2.13.2.
Direct speech in fiction is left unmarked and is simply shown by quotation marks. At a later stage direct speech may be tagged as in this example:
<q>"Let's go,"</q> she said.
Before this tagging, direct speech may not be identifiable, as it is not always indicated by quotation marks. Missing quotation marks can be inserted using the <add> tag; see 2.13.2.
All single quotation marks (') are converted to double quotation marks in direct speech and marked text (e.g. quotations within a s-unit).
<s>"I do n't know how he stays so thin."</s> <s>She used her "meeting voice".</s>
The single quotation mark, ('), is only used in contractions (She 's, y' enjoy) and to mark the genitive (next week's Sunday newpapers' review section). Quotations within quotation are tagged <qq>. This also applies to marked text within quotations or direct speech.
<p><s>"The finger got stuck inside his nose," Matilda said, "and he had to go around like that for a week.</s> <s>People kept saying to him, <qq>Stop picking your nose</qq>, and he could n't do anything about it.</s> <s>He looked an awful fool."</s></p>
<s>"Lately he 's discovered <qq>breakfast meetings</qq>.</s> <s>Now he gorges and guzzles all day.</s> <s>I do n't know how he stays so thin."</s>
The marking of foreign elements has already been dealt with (see 2.7.2). It may be essential to mark other linguistically distinct material, such as dialect words or idiosyncratic spellings. These are tagged <distinct>, with an attribute indicating the type of deviance. Examples:
<distinct type=nonstand>Mister Carlyle sure give it to yuh, he finds out!</distinct> Why do we not treat <distinct type=nonceword>bunkraptcy</distinct> precisely as we treat bankruptcy?
The main value used for the "type" attribute in the present project is "nonstand", indicating deviance of different kinds: dialect, slang, idiosyncratic spelling, etc. If such features are pervasive in the text, this is noted in the header (under <notesStmt>), and each individual case is not marked.
Notes in the source text are tagged <note> and are inserted at the place in the text marked by the reference to the note. Attributes include "resp" and "place". Example:
<note resp=auth place=foot>Unless otherwise specified, all remarks about bilingualism apply as well to multilingualism, the practice of using alternately three or more languages.</note>
Values of the "resp" attribute used in the project are: auth (author), ed (editor), tr (translator), tag (tagger). References to notes are omitted. Notes are not counted as included in the text proper, and are not split into s-units. In special cases it may be desirable to omit notes. They are then replaced by an <omit> tag. See 2.13.2.
Lists which contain very little ordinary language text (e.g. lists of references) are omitted and replaced by an <omit> tag; see 2.13.2. Other lists are treated as paragraphs or sequences of paragraphs (the latter in case each list item is set out typographically as a paragraph). S-units are used for subdivision, as for ordinary paragraphs.
Figures, diagrams, and tables are left out and replaced by an <omit> tag. See 2.13.2.
Poems, songs, etc. that are embedded in a prose text are tagged <poem>. The internal structure is not specified. Verse lines are reproduced with a line break between each. There is a blank line between stanzas. Poems are included in the nearest s-unit. There is no internal division into s-units.
In some cases it may be preferable to leave out a poem and replace it by an <omit> tag. See 2.13.2.
Embedded texts in prose are simply reproduced as part of the main text. Ordinary paragraph and s-unit marking is used. Frequently they will be tagged as quotations; see 2.7.7.
The mechanisms for editorial comment are those recommended by the TEI guidelines for simple editorial changes.
Correction is marked as shown by this example:
... to render that service to poor <corr sic=poele resp=tag>people</corr>
Where it is apparent that there is a typographical error, the main text is corrected and the original reading is given as a value of a "sic" attribute. A "resp" attribute should be used to specify the person responsible for the correction (normally "tag" for "tagger"; cf. 2.9). The tag <sic> is used where there is no straightforward correction, but it is apparent that the text is inaccurate. A suggested correction may be given as a value of a "corr" attribute. A "resp" attribute should be used to specify the person responsible for the correction. Repeated wrong spelling of words throughout a text is noted in the <notesStmt>, and not tagged using the <corr> tag on each occasion. Beyond correction of obvious typographical errors, the language of the corpus texts is not normalized or regularized.
Omission of passages in the text may be marked by an <omit> tag; see 2.4.1, 2.7.2, 2.7.7, 2.9, 2.10, 2.11, 2.12. The tag has the following attributes:
desc: describing the omitted text reason: giving the reason for the omission extent: indicating the extent of the omission resp: specifying the person responsible for the omission
The "desc" and "resp" attributes should normally be used. Sample "desc" values include: table, figure, foreign text.
Addition and deletion in the main text are avoided, though they can be indicated by <add> and <del> tags. An example of the use of the <add> tag is the insertion of a missing quotation mark; cf. 2.7.7.
Special characters are encoded as entity references, eg
š š £ £ — —
Entity references specific to the project are listed in the project entity file (ENPC.ENT) (see Appendix 1). All others are found in one of the public entity sets that comes with TEI P3, e.g. ISOpub.ENT.
NB! Accented and special characters used in Western European languages (de, en, fr, no) are not encoded as entity references at this stage. They are, therefore, system dependent.
Page breaks in the source text are kept to make it easier to refer back to the source. They are tagged <pb n= >, i.e. with the number as the value of an attribute. The placement of <pb> is normalized and is always given at the beginning of the relevant page. If there is a page break in the middle of a hyphenized word in the original text, <pb> is placed after the relevant word in the encoded text.
A reference system is built up using the identifiers of the text units. See 2.1 (text), 2.4.2 (division), 2.4.3 (paragraph), 2.4.4 (s-unit), 2.5 (heading).
Links between parallel texts are indicated by attributes of s-units, as shown in 2.4.4. Example:
<s id=DL2.1.s18 corresp='DL2T.1.s18 DL2T.1.s19'>At once, feeling her advantage, she said, "Do n't forget you 've been living soft for four years."</s> <s id=DL2T.1.s18 corresp=DL2.1.s18>Hun hadde fått et lite overtak og fulgte det opp.</s> <s id=DL2T.1.s19 corresp=DL2.1.s18>"Ikke glem at du har levd godt i fire år nå."</s>
In the earlier stages of the project there will be no linguistic annotation, with a few exceptions; see 2.4.5.