[Corpora-List] Summary: Custom tagging validator

From: Przemek Kaszubski (przemka@amu.edu.pl)
Date: Mon Nov 21 2005 - 21:52:28 MET

  • Next message: Przemek Kaszubski: "[Corpora-List] Concordancer user logging"

    On 5 November I announced the following request:

    "I'm looking for a flexible tool that would validate files tagged by my
    students. The tags follow the <tag>tagged_text</tag> convention but are
    not linked to any DTD, and entirely my own. I'd like to be able to test
    quickly if my students spelled the tag names correctly, closed the tags,
    applied the < and > symbols etc. The tagging scheme is simple (sth like
    10-12 tags in all), with no embedding or special properties."

    Well, it turned out what I really needed was a tool for checking mostly
    well-formedness and some validity, given our very simple tagging scheme
    with only two nestings. As the student project I am coordinating
    expands, we may need to put in place some of the robust validation
    tools. Meanwhile we have been settled on the simple solution of
    combining a browser's (Firefox) XML parsing facility and simple file
    editing – as suggested by Rafał L. Górski and Mark P. Line.

    The other suggestions, which I may need to consider in the near future,
    are briefly reported below in chronological order. I have put in simple
    comments, labelled PK, for those looking for info on the tool's
    applicability to my immediate purpose.

    Many thanks to all that replied!

    -----------

    Lou Burnard suggested any xml validator, such as xsltproc

    PK: available with the Cygwin packages libxml2 and libxslt, apparently
    assumes well-formedness, however

    -----------

    Kiril Simov suggested CLaRK system
    (http://www.bultreebank.org/clark/index.html)

    PK: I thought this being too complex for the task, especially for my
    computer-unsavvy students

    -----------

    David Graff: provided me with a perl script/filter, and generally
    advised the development of a customized editing tool for studebts as a
    more reliable solution
    PK: Perl script: no nesting or attributes were supported, which I, sadly
    after the original post, added to the files ...:

    #!/usr/bin/perl

    # Simple script to check for certain error conditions involving
    # strings enclosed within angle brackets:
    # - for each "<tag>", the next angle-bracketed string must be "</tag>"
    # - tag names are purely alphanumeric, with no attributes
    # - tags do not embed

    # For a given input of tagged text, the output is a listing of tags found
    # and their frequency of occurrence, along with any warnings about
    # violations of the above conditions.

    use strict;

    die "Usage: $0 tagged_file.txt\n" if ( @ARGV == 0 and -t STDIN );

    my $text = do { local $/; <> }; # read entire file into $text
    my @segs = split( m{(</?\w+>)}, $text ); # split into data and tags

    my $linenum = 1;
    my $expect = '';
    my %taghist;

    for ( @segs ) {
        if ( /^<(\w+)>$/ ) { # this is an open-tag
            my $tag = uc $1;
            $taghist{"$tag Open"}++;
            if ( $expect ) { # true if we're expecting a close-tag
                warn "found <$tag>, expecting </$expect> at line $linenum\n";
            }
            $expect = $tag;
        }
        elsif ( m{^</(\w+)>$} ) { # this is a close-tag
            my $tag = uc $1;
            $taghist{"$tag close"}++;
            if ( $tag ne $expect ) { # this close tag is wrong
                my $wanted = ( $expect ) ? "</$expect>" : "an open tag";
                warn "found </$tag>, expecting $wanted at line $linenum\n";
            }
            $expect = '';
        }
        elsif ( 0 == tr/<>// ) { # text with no angle-brackets
            $linenum += tr/\n//;
        }
        else { # angle bracket(s) that are not part of a valid tag
            my @lines = split "\n";
            for my $l ( @lines ) {
                warn "bad angle bracket(s) at line $linenum\n" if ( $l =~ /[<>]/ );
                $linenum++;
            }
        }
    }

    printf( "%5d %s\n", $taghist{$_}, $_ ) for ( sort keys %taghist );

    __END__

    -----------

    Valentin Jijkoun: suggested xmllint (Linux) and xmlvalid (web-based,
    http://www.stg.brown.edu/service/xmlvalid/)

    PK: the latter requires DTD, which I wanted to avoid

    ----------

    Ken Beesley: suggested Relax NG
    (http://www.thaiopensource.com/relaxng/jing.html), requiring DTD
    (http://relaxng.org/compact-tutorial-20030326.html), and provided a
    short tutorial to the system

    PK: requires DTD

    ----------

    Ken Litkowski: kindly sent me his XML tools!

    PK: they can do so much more...

    -------

    Jin-Dong: well-formedness: xmlwf; validity: xmllint (both available with
    Cygwin, also open source implementations available)

    -----------

    Mario Barcala: suggested rxp (Linux)

    -------------

    Rafał L. Górski: use IE or Mozilla, alternatively Altova XMLSpy (home
    edition free, http://www.altova.com/download_spy_home.html)

    PK: the latter tool may be overkill, but supposedly is efficient

    ----------

    Chinedu Uchechukwu (Bamberg): use Butterfly XML (opensource xml editor,
    java)

    PK: editor + parser, looks promising

    -----------

    Steven Bird: provided python script

    PK: one can provide tags, and it will look for errors (attributes
    probably unsupported)

    ----snip----
    # Simple Python script to check a text file containing embedded XML tags
    # Errors detected:
    # - unbalanced tags: <a>afd</a> lakjf<a>
    # - mismatched tags: <a>lakf</b>
    # - illegal tags: <a>kafsd</a> lajf <x>lawq</x>

    import sys, re

    # check usage
    if len(sys.argv) != 2:
        print "Usage: %s filename" % sys.argv[0]
        sys.exit(1)

    # read file into string
    text = open(sys.argv[1]).read()

    # the permissible tags, associated regexps
    tags = ("a", "b")
    legal_tag = re.compile(r"</?(?:%s)>" % "|".join(tags))
    any_tag = re.compile(r"</?.*?>")

    # get the sequence of legal tags, ignoring everything else
    tag_seq = legal_tag.findall(text)

    # check this sequence consists of paired begin-end tags
    if len(tag_seq) % 2 != 0:
        print "Unbalanced tags"
        sys.exit(1)
    for i in range(len(tag_seq), 2):
        begin, end = tag_seq[i], tag_seq[i+1]
        if begin[1:] != end[2:]:
            print "Mismatched tags: %s, %s" % begin, end
            sys.exit(1)

    # remove all legal tags and report any others
    residue = legal_tag.sub("", text)
    tag_seq = any_tag.findall(residue)
    if tag_seq:
        print "Illegal tags:", " ".join(tag_seq)
        sys.exit(1)

    print "Correct use of tags:", " ".join(tags)
    ----snip----

    -- 
    Dr Przemyslaw Kaszubski
    +48 61 8293515
    http://elex.amu.edu.pl/ifa/staff/kaszubski.html
    

    PICLE LEARNER CORPUS ONLINE: http://www.staff.amu.edu.pl/~przemka/picle.html

    COMPREHENSIVE CORPORA BIBLIOGRAPHY: http://www.staff.amu.edu.pl/~przemka

    MY SEMINARS: http://www.staff.amu.edu.pl/~przemka/seminars.htm

    ACADEMIC WRITING PAGE (FULL-TIME PROGRAMME): http://www.staff.amu.edu.pl/~przemka/IFA_writing

    ======================================= School of English (IFA) Adam Mickiewicz University http://elex.amu.edu.pl/ifa =======================================



    This archive was generated by hypermail 2b29 : Mon Nov 21 2005 - 22:15:32 MET