Re: Corpora: Corpora and XML

Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Thu, 30 Sep 1999 10:26:41 +0100 (BST)

On Wed, 29 Sep 1999, Ted E. Dunning wrote:

|
|
|I beg to differ.
|
|I believe that this is a valid SGML document (given an appropriate DTD
|with I don't provide, but can be surmised).

I also beg to differ! Without a DTD this is *not* a valid SGML
document and should not be thought of as such. This is precisely
because in the absence of a dtd it's impossible to tell whether <page>
in your example is an empty tag (in which case the document is valid),
or not (in which case it is not)

|I have provided
|end-markers in order to be very explicit about the non-nested nature
|of concurrent markup.

Ah! Actually that makes the document invalid SGML anyway -- concurrent
markup requires a different syntax from that used here. You can't just
bung tags from discrete hierarchies together into the same document
and expect the parser to sort them out for you, alas.

There is exhaustive discussion of the much vaunted SGML "multiple
hierarchy problem" in the TEI Guidelines chapter on non-hierarchic
structures, available at http://www.hcu.ox.ac.uk/TEI/P4beta/NH.htm

All the solutions discussed there are equally applicable to XML,
mutatis mutandis, except use of CONCUR which is explicitly disallowed
by XML: some of them have even been implemented (e.g. Henry Thompson's
work on "standoff markup" uses the pointer solution)

|
| <book>
| <frontspiece>Who and when</frontspiece>
| <page n=1>
| <p>
| This is the first paragraph on the first page.
| </p>
| <p>
| The second paragraph extends
| </page>
| <page n=2>
| onto the second page.
| </p>
| <p>
| The third paragraph is entirely on the second page.
| </p>
| </page>
| <endmatter>Index and stuff goes here</endmatter>
| </book>
|
|Here we have concurrent markup which has two concurrent, well-nested
|structures (shown here in XML form):
|
| <book><frontspiece/><page/><page/><endmatter/></book>
|
|and
|
| <book><p/><p/><p/></book>

err, I'm confused here! why are you using empty tags?

|
|Now, my SGML is a bit rusty (it started out that way, I should hasten
|to add), but I am pretty sure that this structure can be represented
|in SGML and cannot be directly converted to XML.

What you *can* do is
|cheat by using empty elements to mark page boundaries. By doing this,
|you lose the syntactic guarantees about where pages must start and
|end.

If you are constrained to only one hierarchy then any other hierarchy
can only be represented at the points where its boundaries
("milestones" in TEIspeak) are visible in that hierarchy. I don't see
why using empty elements should be regarded as cheating.

Here's one way of doing your example in XML syntax using ID/IDREF
linkage as a means of providing the "syntactic guarantees" you asked
for:

|
| <book>
| <frontspiece>Who and when</frontspiece>
| <pagestart id="p1" end="ep1"/>
| <p>
| This is the first paragraph on the first page.
| </p>
| <p>
| The second paragraph extends
| <pageend id="ep1" start="p1"/>
| <pagestart id="p2" end="ep2"/>
| onto the second page.
| </p>
| <p>
| The third paragraph is entirely on the second page.
| </p>
| <pageend id="ep2" start="p2"/>
| <endmatter>Index and stuff goes here</endmatter>
| </book>
|

----------------------------------------------------------------
Lou Burnard http://users.ox.ac.uk/~lou
----------------------------------------------------------------