Re: Corpora: CLAWS sentence enumeration

Paul Rayson (paul@comp.lancs.ac.uk)
Tue, 28 Jul 1998 15:40:40 +0100 (BST)

Kristine,

> Particularly, we wonder about the 'c="0000037 002"' component. For
> instance, are we correct in assuming that 002 refers to the first sentence
> in a turn? If so, how are the following sentences within the same turn
> numbered? And what about the '0000037' part?

This looks like the reference number inserted by CLAWS at the start of each
line in its vertical output format. There is a reference number for each word
in this format. The first part '0000037' refers to the line number in the
untagged input file, and the second part increments by 1 for punctuation and by
10 for other items.

> We also wonder about the numbers in the overlap tags (<ptr t= >). As we
> understand it, the example above is an illustration of correct enumeration
> (the reason we are asking this, is that we have seen instances where the
> numbering is
> different).

I assume these follow the same rules as in the encoding scheme for the British
National Corpus.

> We have tried to get this information both from Lancaster and the internet,

Who did you contact at Lancaster?

Regards,
Paul.