[Corpora-List] Summary: Custom tagging validator

From: Przemek Kaszubski (przemka@amu.edu.pl)
Date: Mon Nov 21 2005 - 21:52:28 MET

Next message: Przemek Kaszubski: "[Corpora-List] Concordancer user logging"

Previous message: Brett Reynolds: "[Corpora-List] CGEL-based parser"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 5 November I announced the following request:

"I'm looking for a flexible tool that would validate files tagged by my
students. The tags follow the <tag>tagged_text</tag> convention but are
not linked to any DTD, and entirely my own. I'd like to be able to test
quickly if my students spelled the tag names correctly, closed the tags,
applied the < and > symbols etc. The tagging scheme is simple (sth like
10-12 tags in all), with no embedding or special properties."

Well, it turned out what I really needed was a tool for checking mostly
well-formedness and some validity, given our very simple tagging scheme
with only two nestings. As the student project I am coordinating
expands, we may need to put in place some of the robust validation
tools. Meanwhile we have been settled on the simple solution of
combining a browser's (Firefox) XML parsing facility and simple file
editing – as suggested by Rafał L. Górski and Mark P. Line.

The other suggestions, which I may need to consider in the near future,
are briefly reported below in chronological order. I have put in simple
comments, labelled PK, for those looking for info on the tool's
applicability to my immediate purpose.

Many thanks to all that replied!

-----------

Lou Burnard suggested any xml validator, such as xsltproc

PK: available with the Cygwin packages libxml2 and libxslt, apparently
assumes well-formedness, however

-----------

Kiril Simov suggested CLaRK system
(http://www.bultreebank.org/clark/index.html)

PK: I thought this being too complex for the task, especially for my
computer-unsavvy students

-----------

David Graff: provided me with a perl script/filter, and generally
advised the development of a customized editing tool for studebts as a
more reliable solution
PK: Perl script: no nesting or attributes were supported, which I, sadly
after the original post, added to the files ...:

#!/usr/bin/perl

# Simple script to check for certain error conditions involving
# strings enclosed within angle brackets:
# - for each "<tag>", the next angle-bracketed string must be "</tag>"
# - tag names are purely alphanumeric, with no attributes
# - tags do not embed

# For a given input of tagged text, the output is a listing of tags found
# and their frequency of occurrence, along with any warnings about
# violations of the above conditions.

use strict;

die "Usage: $0 tagged_file.txt\n" if ( @ARGV == 0 and -t STDIN );

my $text = do { local $/; <> }; # read entire file into $text
my @segs = split( m{(</?\w+>)}, $text ); # split into data and tags

my $linenum = 1;
my $expect = '';
my %taghist;

for ( @segs ) {
    if ( /^<(\w+)>$/ ) { # this is an open-tag
        my $tag = uc $1;
        $taghist{"$tag Open"}++;
        if ( $expect ) { # true if we're expecting a close-tag
            warn "found <$tag>, expecting </$expect> at line $linenum\n";
        }
        $expect = $tag;
    }
    elsif ( m{^</(\w+)>$} ) { # this is a close-tag
        my $tag = uc $1;
        $taghist{"$tag close"}++;
        if ( $tag ne $expect ) { # this close tag is wrong
            my $wanted = ( $expect ) ? "</$expect>" : "an open tag";
            warn "found </$tag>, expecting $wanted at line $linenum\n";
        }
        $expect = '';
    }
    elsif ( 0 == tr/<>// ) { # text with no angle-brackets
        $linenum += tr/\n//;
    }
    else { # angle bracket(s) that are not part of a valid tag
        my @lines = split "\n";
        for my $l ( @lines ) {
            warn "bad angle bracket(s) at line $linenum\n" if ( $l =~ /[<>]/ );
            $linenum++;
        }
    }
}

printf( "%5d %s\n", $taghist{$_}, $_ ) for ( sort keys %taghist );

__END__

-----------

Valentin Jijkoun: suggested xmllint (Linux) and xmlvalid (web-based,
http://www.stg.brown.edu/service/xmlvalid/)

PK: the latter requires DTD, which I wanted to avoid

----------

Ken Beesley: suggested Relax NG
(http://www.thaiopensource.com/relaxng/jing.html), requiring DTD
(http://relaxng.org/compact-tutorial-20030326.html), and provided a
short tutorial to the system

PK: requires DTD

----------

Ken Litkowski: kindly sent me his XML tools!

PK: they can do so much more...

-------

Jin-Dong: well-formedness: xmlwf; validity: xmllint (both available with
Cygwin, also open source implementations available)

-----------

Mario Barcala: suggested rxp (Linux)

-------------

Rafał L. Górski: use IE or Mozilla, alternatively Altova XMLSpy (home
edition free, http://www.altova.com/download_spy_home.html)

PK: the latter tool may be overkill, but supposedly is efficient

----------

Chinedu Uchechukwu (Bamberg): use Butterfly XML (opensource xml editor,
java)

PK: editor + parser, looks promising

-----------

Steven Bird: provided python script

PK: one can provide tags, and it will look for errors (attributes
probably unsupported)

----snip----
# Simple Python script to check a text file containing embedded XML tags
# Errors detected:
# - unbalanced tags: <a>afd</a> lakjf<a>
# - mismatched tags: <a>lakf</b>
# - illegal tags: <a>kafsd</a> lajf <x>lawq</x>

import sys, re

# check usage
if len(sys.argv) != 2:
print "Usage: %s filename" % sys.argv[0]
sys.exit(1)

# read file into string
text = open(sys.argv[1]).read()

# the permissible tags, associated regexps
tags = ("a", "b")
legal_tag = re.compile(r"</?(?:%s)>" % "|".join(tags))
any_tag = re.compile(r"</?.*?>")

# get the sequence of legal tags, ignoring everything else
tag_seq = legal_tag.findall(text)

# check this sequence consists of paired begin-end tags
if len(tag_seq) % 2 != 0:
    print "Unbalanced tags"
    sys.exit(1)
for i in range(len(tag_seq), 2):
    begin, end = tag_seq[i], tag_seq[i+1]
    if begin[1:] != end[2:]:
        print "Mismatched tags: %s, %s" % begin, end
        sys.exit(1)

# remove all legal tags and report any others
residue = legal_tag.sub("", text)
tag_seq = any_tag.findall(residue)
if tag_seq:
print "Illegal tags:", " ".join(tag_seq)
sys.exit(1)

print "Correct use of tags:", " ".join(tags)
----snip----

-- 
Dr Przemyslaw Kaszubski
+48 61 8293515
http://elex.amu.edu.pl/ifa/staff/kaszubski.html
PICLE LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html
COMPREHENSIVE CORPORA BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka
MY SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm
ACADEMIC WRITING PAGE (FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing
=======================================
School of English (IFA)
Adam Mickiewicz University
http://elex.amu.edu.pl/ifa
=======================================

Next message: Przemek Kaszubski: "[Corpora-List] Concordancer user logging"
Previous message: Brett Reynolds: "[Corpora-List] CGEL-based parser"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Nov 21 2005 - 22:15:32 MET