PoW Corpus Manual
A reformated PoW corpus manual will be available from
A Short Handbook to the Polytechnic of Wales Corpus
Centre for Computer Analysis of Language and Speech (CCALAS)
School of Computer Studies
University of Leeds
Leeds LS2 9JT
Tel: (0113) 233 5460
This booklet is intended to accompany the machine readable version of the
Polytechnic of Wales (PoW) Corpus, to be distributed through the International
Computer Archive of Modern English (ICAME) at Bergen. It aims to introduce the
reader to the corpus notation and format, and to list the systemic functional
grammar codes which have been used in the hand parsing of the corpus (Appendix
1). A very brief description of the corpus was supplied to the Lancaster
Preliminary Survey of Machine-Readable Language Corpora [Taylor and Leech 89]
and this is also included as Appendix 2. Papers related to the compilation of
the corpus, and its subsequent use for computational linguistic research in
the COMMUNAL project at Leeds are given in the references section. Queries
regarding the corpus and the grammar can be addressed to the author, or to Dr
one of its original compilers, at The Computational Linguistics
Unit, SESJP, Aberconway Building, University of Wales College of Cardiff,
Cardiff CF1 3XA, Wales, UK. (firstname.lastname@example.org). Any
suggestions for additions or improvements to this handbook are most welcome,
and should be addressed to the author in Leeds.
The POW corpus is distributed from two places:
The Oxford Text Archive (http://ota.ahds.ac.uk) orgainsed by Lou Burnard.
ICAME in Bergen, Norway (email@example.com) organised by Knut Hofland.
The corpus was originally collected between 1978-84 for a child language
development project to study the use of various syntactico-semantic constructs
in children between the ages of six and twelve. A sample of approximately 120
children in this age range from the Pontypridd area in South Wales was
selected, and divided into four cohorts of 30, each within three months of the
ages 6, 8, 10, and 12. These cohorts were subdivided by sex (B,G) and
socio-economic class (A,B,C,D). The latter was achieved using details of
i) `highest' occupation of both the parents of the child, or one in
ii) educational level of the parents.
The children were selected in order to minimise any Welsh or other second
language influence. The above subdivision resulted in small homogeneous cells
of three children. Recordings were made of a play session with a Lego brick
building task for each cell, and of an individual interview with the same
"friendly" adult for each child, in which the child's favourite games or TV
programmes were discussed.
The first 10 minutes of each play session commencing at a point where normal
peer group interaction began (ie: when the microphone was ignored) were
transcribed by 15 trained transcribers. Likewise for the interviews.
Transcription conventions were adopted from those used in the Survey of Modern
English Usage at University College London, and a similar project at Bristol.
Intonation contours were added by a phonetician to produce a hard copy
version, and the resulting transcripts published in four volumes [Fawcett and
Perkins 80]. A short report on the project was also published [Fawcett 80].
Again ten trained analysts were employed to manually parse the transcribed
texts, using Fawcett's version of Systemic-Functional Grammar (SFG), the main
architect of which is Michael Halliday. The SFG used in the analysis handles
phenomena such as raising, dummy subject clauses and ellipsis. Despite
thorough checking, some inconsistencies remain in the text owing to several
people working on different parts of the corpus. The grammar used in this hand
parsing process is described in more detail below. The parsed version is
available in machine readable form but does not contain any prosodic
Availability and Conditions
The resulting parsed corpus consists of approximately 65,000 words
in 11,396 (sometimes very long) lines, each containing a parse tree. The
corpus of parse trees fills 1.1 Mb. There are 184 files, each with a reference
header which identifies the age, sex and social class of the child, and
whether the text is from a play session or an interview. The corpus is also
available in wrap-round form with a maximum line length of 80 characters,
where one parse tree may take up several lines. The four-volume transcripts
can be supplied by the British Library Inter-Library Loans System.
NB: Earlier papers quote the size of the corpus as being approximately 100,000
words. The latest automatic extraction of a wordlist from the machine readable
corpus shows it to be just over 65,000 words, but this figure can only be
approximate. Noise in the original typing of the corpus in the form of
omissions of category labels, or of the spaces between such labels and the
words in the text, makes it difficult to give an accurate figure. The
difference between the two totals is almost certainly the difference between
the total for the recorded spoken texts, and the total for those which have
The following conditions apply to the distribution of the Polytechnic of
Wales Corpus from ICAME:
a) The original source of the corpus should be mentioned in any documents
published which derive from the data in the corpus in any way, and copies of
such documents should be sent to ICAME and Dr Robin Fawcett at the address
given in the introduction.
b) The corpus is made available to specialist scholars for scientific
linguistic research purposes only, and is not to be used for commercial
purposes without the prior agreement of Dr Fawcett.
c) The corpus will not be further distributed or reproduced in part or whole
for any purpose other than scholarly research, and will only be supplied to
a third party with the prior written permission of ICAME.
d) If these conditions are not complied with, any tape(s) of the corpus
(including backup copies) must be returned to ICAME at the Norwegian Computing
Centre for the Humanities, Bergen, Norway.
Systemic-Functional Grammar Categories
The grammatical theory on which the manual parsing is based is Robin Fawcett's
development of a Hallidayan Systemic-Functional Grammar, described informally
but in detail in [Fawcett 81]. The grammar is traditionally formalised in a
system network of semantic choices (systems), and a set of realisation rules
to be used in natural language generation.
>From the point of view of natural language analysis, grammars formalised for
parsing can be extracted from the corpus automatically in the form of phrase
structure rules or a recursive transition network [Atwell and Souter 88,
Souter 89a, 89b].
The terminology of SFG is quite complicated at first sight, but I will attempt
to introduce it clearly below. A syntax tree is characterised by having two
alternating types of category labels. The first are called elements of
structure, such as Subject (S), Complement (C), Adjunct (A), head (h),
modifier (mo) and qualifier (q). Note that, in a hand-analysis, capital
letters are used for elements of clause structure, and lower case letters for
elements of group (and cluster) structure. In the machine-readable version of
the corpus, capitals are used throughout. Elements of structure are typically
filled by the second type of category, ie: units; elements of clause structure
are filled by either subordinate clauses, groups, (cf phrases in TG or GPSG)
such as nominal group (ngp), prepositional group (pgp) and quantity-quality
group (qqgp), or clusters such as genitive cluster (gc). Terminal elements of
structure are expounded by lexical items. The top-level symbol is Z (sigma)
and is invariably filled by one or more clauses (Cl). Trees tend to be fairly
flat, but richly labelled, immediately below the clause level, notably because
of the absence of a Predicate or Verb Phrase constituent. This has a direct
effect on the size and shape of the formal grammar which can be extracted from
the parsed corpus. Some areas have a very elaborate description, eg: there are
15 types of adjuncts, six types of modifiers, nine different determiners, and
ten auxiliaries. Other categories are relatively simple, eg: main-verb (M),
head (h), and apex (ax). (The apex occurs in a quantity-quality group, and is
typically expounded by an adverb or adjective). A list of all the categories
used in the parsing of the corpus is given in Appendix 1, with details of
whether the symbol is used as a non-terminal or terminal category, and some
example lexical items which expound the terminal categories.
The tree notation employs numbers rather than the more traditional bracketed
form to define mother-daughter relationships, in order to capture
discontinuous units. The number directly preceding a group of symbols refers
to their mother. The mother is itself found immediately preceding the first
occurrence of that number in the tree. So, in the example section of a corpus
file given below in Figure 1, the first tree shows a sentence (Z) consisting
of two daughter clauses (Cl), as each clause is preceded by the number "1",
and the Z-symbol is found immediately before the first occurrence of the
number "1". The long lines have been folded manually for ease of reading. The
first number in each tree is a sentence reference, and I have edited the file
below to show these with a right bracket ")" symbol, which does not appear in
the actual corpus. I also include below (Figure 2) a few hand-drawn syntax
trees which correspond to sentences from Figure 1. All alphabetic characters
are in upper case. The only lower case alphabetical characters are in the
sentence references, which have occasionally been subdivided into 24a, 24b
etc, where what was initially analysed as one sentence was, on checking,
reanalysed as two (or more).
Occasionally when the correct analysis for a structure is uncertain, the one
given is followed by a question mark. Cases where unclear recordings have made
word identification difficult are treated similarly. Apart from the
grammatical categories and the words themselves, the only other symbols in the
tree are three types of bracketing:
i) square [NV...], [UN...], [RP...], [FS...], for non-verbal,
unclear/unfinished, repetition, false start, etc.
ii) round (...) for ellipsis of items recoverable from previous text.
iii) angle <...> for ellipsis of items not so recoverable, eg: in rapid
Filenames indicate precisely which age (6,8,10,12), social class (A,B,C,D),
sex (B,G) and recording situation (play-session (PS) or interview (I)) is
involved, followed by the child's initials. Hence, the text sample below is
from file 6ABICJ, involving a six year old, of social class A, of male sex, in
an interview, with initials CJ.
Figure 1: A Sample Section of a POW Corpus File
**** 58 1 1 1 0 59
1) [FS:Y...] Z 1 CL F YEAH 1 CL 2 S NGP 3 DD THAT 3 HP ONE 2 OM 'S 2 C NGP 4
DQ A 4 H RACING-CAR
2) Z CL 1 S NGP 2 DD THAT 2 HP ONE 1 OM 'S 1 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3
3) [HZ:WELL] Z 1 CL 2 S NGP HP I [RP:I] 2 AI JUST 2 HAD 2 C NGP 3 DQ A 3 MO
QQGP AX LITTLE 3 H THINK 1 CL 4 & THEN 4 S NGP HP I 4 M THOUGHT 4 C CL 5 BM OF
5 M MAKING 5 C NGP 6 DD THIS 6 HP ONE
4) Z 1 CL 2 S NGP HP I 2 AI JUST 2 M FINISHED 2 C NGP 3 DD THAT 3 HP ONE 1 CL 4
& AND 4 S NGP HN FRANCIS 4 M HAD 4 C NGP 5 DD THE 5 H IDEA 5 Q CL 6 BM OF 6 M
MAKING 6 C NGP 7 DQ A 7 RACING-CAR
5) [FS:THEN-I] Z CL 1 & THO 1 S NGP HP I 1 M MADE 1 C NGP DD THIS
6) Z CL 1 & THEN 1 S NGP HP FRANCIS 1 OX WAS 1 AI JUST 1 X GOING-TO 1 M MAKE 1
C NGP HP ONE 1 A CL 2 B WHEN 2 S NGP H YOU 2 M CAME 2 CM QQGP AX BACK 2 CM
QQGP AX IN
7) [NV:MM] Z 1 CL F NO [FS:FRAN...] 1 CL 2 S NGP HP WE 2 M HAD 2 C NGP 3 DQ AN
3 H IDEA 3 Q CL 4 BM OF 4 M MAKING 4 C NGP 5 DQ FOUR 5 H THINGS
8) Z 1 CL F YEAH 1 CL 2 S NGP HP I 2 M PLAYED 2 C PGP 3 P WITH 3 CV NGP HP IT 2
A PGP 4 P AT 4 CV NGP H HOME
9) Z CL F YEAH
10) [FS:I] [FS:I] Z 1 CL F NO 1 CL 2 S NGP HP I 2 OX 'VE 2 AI JUST 2 M GOT 2 C
NGP 3 DQ ONE 3 MO QQGP AX BIG 3 H TIN [FS:OF?] 3 Q QQGP 4 AX FULL 4 SC PGP 5 P
OF 5 CV NGP HP IT
11) [NV:ER] Z CL 1 (S) 1 (M) 1 C NGP 2 DQ NGP 3 DQ ALL 3 H SORTS 2 VO OF 2 H
12) Z 1 CL 2 S NGP HP I 2 M MAKE 2 C NGP H CARS 2 A QQGP AX ALWAYS 1 CL 3 &
AND 3 A SOMETIMES 3 S NGP HP I 3 M MAKE 3 C NGP H HOUSES 1 CLUN & AND
13) Z 1 CL F YEAH 1 CL 2 S NGP HP I 2 M GOT 2 C NGP HN KERPLUNK
14) [FS:IT] [FS:IT] [NV:UM] Z 1 CL 2 S NGP HP YOU 2 M PUT 2 C NGP H STRAWS 2 C
PGP 3 PM INTA 3 CV NGP 4 DQ A [RP:A] [RP:A] 4 MOTH NGP H GLASS 4 H TUB 4 Q PGP
5 P WITH 5 CV NGP 6 H HOLES 6 Q PGP 7 P IN 7 (CV) 1 CL 8 & THEN 8 S NGP HP YOU
8 M PUT 8 C NGP 9 DD THE 9 H STRAWS 8 C PGP 10 PM IN 10 CV NGP 11 DD THE 11 H
HOLES 1 CL 12 & THEN 12 S NGP HP YOU 12 M PUT 12 C NGP 13 DD THE 13 H MARBLES
12 CM QQGP AX DOWN 1 CL 14 & AND 14 (S) 14 M PULL 14 C NGP 15 DQ A 15 H STRAW
14 CM QQGP AX OUT 14 A CL 16 I TO 16 M SEE 16 C CL 17 B IF 17 S NGP 18 DQ A 18
H MARBLE 17 M GOES 17 C PGP 19 P INTO 19 CV NGP 20 DQ A 20 H POINT
15) Z CL 1 S NGP HP I 1 ON DUN 1 M NO 1 (C)
16) [NV:ER] [NV:ER] Z CL 1 S NGP HP I 1 M PLAY 1 C PGP 2 P WITH 2 CV NGP 3 DD
MY 3 H BIKE
17) Z 1 CL 2 S NGP HP I 2 M PLAY 2 C PGP 3 P WITH [FS:MY-CHIP] 3 CV NGP 4 DD
MY [RP:MY] 4 MO QQGP AX BIG 4 H TIPPER-LORRY 1 CL 5 & AND 5 S NGP HP I [RP:I]
5 M CALL 5 C PGP 6 PM FOR 6 CV NGP HN DAVID
18) Z 1 CL F YEAH [FS:HE'S-ONE-MY] [FS:HE'S-ROUND] 1 CL 2 S NGP HP HE 2 OM 'S
2 C PGP 3 P IN 3 CV NGP 4 DD MY 4 H CLASS
19) [NV:OH] [FS:WE-JUST] Z CL 1 S NGP HP WE 1 M PLAY 1 C PGP 2 P AT 2 CV 3 NGP
H FOOTBALL [HZ:AND-STUFF] 3 NGP 4 & AND 4 H CRICKET
20) [NV:ER] [FS:WE-PLAY-S...] Z CL 1 S NGP HP WE 1 M PLAY 1 C 2 NGP H FIREMEN
2 NGP 3 & AND 3 H POLICE
I would like to thank Robin Fawcett (Cardiff) for his kind help in proof
reading this document, and Tim O'Donoghue (Leeds) for his assistance in
producing parse trees for the paper version.
Atwell, Eric Steven and Clive Souter, (1988) Experiments with a very large
To appear in Proceedings of the 15th International Conference on Literary and
Linguistic Computing (ALLC). Jerusalem, June 5-9 1988.
Atwell, Eric Steven, Clive Souter and Tim O'Donoghue, (1988) Prototype
Parser 1. COMMUNAL Report No. 17, CCALAS, School of Computer Studies,
Fawcett, Robin P., (1980) Language Development in Children 6-12: Interim
Linguistics 18 pp 953-958.
Fawcett, Robin P., (1981) Some Proposals for Systemic Syntax.
Department of Behavioural and Communication Studies, Polytechnic of Wales.
Fawcett, Robin P. and Michael R. Perkins, (1980) Child Language Transcripts
With a preface, in 4 volumes. Department of Behavioural and Communication
Studies, Polytechnic of Wales.
Fawcett, Robin P., (1988) A note on the relationship between the syntactic
categories used in (1) the analysis of the Polytechnic of Wales Corpus and (2)
generation and analysis in the COMMUNAL project. (personal communication)
Souter, Clive, (1989a) The COMMUNAL Project: Extracting a grammar from
the Polytechnic of Wales Corpus ICAME Journal No. 13, April 1989,
pp20-27. Norwegian Computing Centre for the Humanities, Bergen University.
Souter, Clive, (1989b) Systemic-Functional Grammars and Corpora
Research Report 89.12, School of Computer Studies, University of Leeds. To
appear in a forthcoming volume of Aarts and Meijs (eds), "Corpus Linguistics"
Souter, Clive and Eric Atwell, (1988a) Constraints on Legal Syntactic
Configurations. COMMUNAL Report No. 14, CCALAS, School of Computer
Studies, Leeds University.
Souter, Clive and Eric Atwell, (1988b) Morphological Analysis. COMMUNAL
Report No. 16, CCALAS, School of Computer Studies, Leeds University.
Taylor, Lita and Geoffrey Leech, (1989) Lancaster Preliminary Survey of
Machine-Readable Corpora ICAME, The Norwegian Computing Centre for the
Humanities, P.O. Box 53, Universitet, N-5027 Bergen, Norway.
Appendix 1: Systemic-Functional Grammar categories in the PoW Corpus
(The % symbol is used here as a field separator for the table.
NT means non-terminal category, T means terminal)
Name of Category%Symbol in PoW%NT/T%Examples (for Terminals)
TEXT AND SENTENCE
Sentence%Z (for sigma)%NT%-
Adjunct (= Experiential Adjunct)%A%NT/T%really, mostly
Discourse organizational Adjunct%Ad%NT/T%first-of-all, anyway
Feedback-seeking Adjunct%Af%NT/T%look, right, you know
Inferential Adjunct%Ai%NT/T%just, only
Logical Adjunct%Al%NT/T%really, though, as well
Replacement Logical Adjunct%Alrepl%NT%-
Modal Adjunct%Am%NT/T%maybe, probably
Metalingual Adjunct%Aml%NT/T%say, I mean
Negative Adjunct%An%NT/T%never, neither
Politeness Adjunct%Ap%NT/T%there, please
Tag Adjunct%Atg%NT/T%is it, isn't it
Wh-Adjunct%Awh%NT/T%how, when, where, why
Binder%B%NT/T%because, cos, if, so, when
Main-verb-completing Complement%Cm%NT/T%across, in, on, up
Formula%F%NT/T%alright, yes, no, pardon, what
Main verb%M%T%builds, kicked, went
Operator%O%T%did, does, do, let's
Modal Operator%Om%T%'ll, 'd, 'm, are, can, could, is
Negative Modal Operator%Omn%T%can't, couldn't, isn't, won't
Negative Operator%On%T%didn't, doesn't, don't
Auxiliary Operator%OX%T%'m, 're, 've, have, was
Negative Auxiliary Operator%OXn%T%haven't, wasn't
Dummy it Subject%Sit%T%it
Dummy there Subject%Sth%T%there
Auxiliary%X%T%be, going to, have, used
Modal/Necessity Auxiliary%Xm%T%better, got to, have to
Negative Modal Auxiliary%Xmn%T%mustn't
Negative Auxiliary%Xn%T%don't, hadn't, haven't
unfinished nominal group%ngpun%NT%-
deictic determiner (also in qqgp)%dd%NT/T%the, this, that, her, my
wh-deictic determiner%ddwh%T%what, which
quantifying determiner (also in qqgp)%dq%NT/T%a, an, one, four, any, all
negative quantifying determiner%dqn%NT/T%no, none
wh-quantifying determiner%dqwh%T%how many, how much
ordinative determiner%do%NT/T%first, sixth, last
modifier (= experiential modifier)%mo%NT%-
comparison modifier%moc%NT/T%other, else, same, different
quantifying modifier%moq%NT/T%five, only, ten
thing modifier%moth%NT/T% plastic, square, table
head (i.e. 'common noun')%h%T%brick, books, men
('proper') name head%hn%T%America, Alf, Barry-Island, Batman
pronoun head%hp%T%anything, he, her, him, I, it
negative pronoun head%hpn%T%no-one, nobody, nothing
situation head%hsit%NT/T%painting, reading
wh-pronoun head%hwh%T%what, which, who
unfinished prepositional group%pgpun%NT%-
preposition%p%NT/T%on, in, up, under
Main-verb-completing preposition%pm%T%about, after, at, for, into
unfinished quantity-quality group%qqgpun%NT%-
temperer (also in pgp)%t%NT/T%a bit, about, all, over, very
apex%ax%NT/T%always, away, back, big, black
tempering apex%axt%T%biggest, better, higher, smaller
wh-apex%axwh%NT/T%how, where, why, when
finisher%fi%NT/T%of all, together
ELEMENTS OCCURRING IN MORE THAN ONE UNITS NOT SPECIFIED ABOVE
Linker%&%T%and, and then, but, or, so, then
Appendix 2: Brief Description of the Corpus
Date of Compilation: 1978-84
Location: Polytechnic of Wales, Pontypridd, S. Wales.
Compiled by: Dr. Robin P. Fawcett and Dr. Michael R. Perkins
Type of Data:
Spoken corpus, recordings transcribed using conventions from SMEU at UCL,
and those of a similar project at Bristol, with pitch movements marked by
Fully hand parsed, using a Systemic Functional Grammar developed by
Fawcett, with rich syntactico-semantic categories, capable of
handling raising, dummy subject clauses, ellipsis, replacement strings.
Parse trees stored in a numerical format (not standard bracketed) to
capture discontinuities in syntactic structures.
Children's English from Pontypridd, S.Wales. Informal register.
The subjects were screened to exclude those with strong second language
influence (Welsh or otherwise). 120 children aged between 6-12, (all within 3
months either side of their 6th, 8th, 10th or 12th birthday ) divided equally
according to sex, age, and socio-economic class established by profession and
highest educational level of parents. Small cells of 3 children were recorded
at play with Lego bricks, and each child also interviewed by the same
`friendly' adult on his/her favourite games and TV programmes.
65,000 words approximately, in 11,396 lines. 1 parsed sentence per line,
hence some very long lines. (also available in 80 chars wrap round format)
1.1 Mb. storage.
194 files, each with a reference to age, social class, sex, play session or
interview, and child's initials. (each file is a sample of a single child's
speech in a play session or an interview).
Only the parsed corpus is available in machine readable form; the recorded
tapes and 4-volume transcripts with intonation contours are available in hard
copy from the British Library Inter-Library Loans System. Original recordings
are available from:
Dr Robin Fawcett,
Computational Linguistics Unit,
University of Wales College of Cardiff.
DAT versions of the original recordings are also being used at Leeds,
Sheffield and Reading Universities.
Original reason for collection:
Psycholinguistic research into development of childrens' English between ages
of 6 and 12, investigating the growing use of a variety of syntactico-semantic
Current research (1987-9):
COMMUNAL project; Natural Language Processing at UWCC and Leeds University
Extracting machine-readable systemic functional grammars and lexicons for use
in parsing. Suites of programs developed to achieve this, including converting
the corpus into bracketed form. The grammar used for the hand parsing in the
corpus was not formalised in terms of phrase-structure rules, or RTNs, but in
system networks of semantic/functional features and their realisation rules
more suitable for NL generation than parsing.