Re: [Corpora-List] Custom tagging validator

From: Steven Bird (sb@csse.unimelb.edu.au)
Date: Tue Nov 08 2005 - 12:16:34 MET

  • Next message: hanane.mahjoubi@voila.fr: "[Corpora-List] Mime-Version: 1.0"

    Here's a simple Python script to quickly check that a file contains
    balanced, matched XML-style tags taken from a fixed set. It avoids
    the need for a DTD and an artificial root element. For more on Python
    for NLP, see nltk.sourceforge.net. -Steven Bird

    ----snip----
    # Simple Python script to check a text file containing embedded XML tags
    # Errors detected:
    # - unbalanced tags: <a>afd</a> lakjf<a>
    # - mismatched tags: <a>lakf</b>
    # - illegal tags: <a>kafsd</a> lajf <x>lawq</x>

    import sys, re

    # check usage
    if len(sys.argv) != 2:
        print "Usage: %s filename" % sys.argv[0]
        sys.exit(1)

    # read file into string
    text = open(sys.argv[1]).read()

    # the permissible tags, associated regexps
    tags = ("a", "b")
    legal_tag = re.compile(r"</?(?:%s)>" % "|".join(tags))
    any_tag = re.compile(r"</?.*?>")

    # get the sequence of legal tags, ignoring everything else
    tag_seq = legal_tag.findall(text)

    # check this sequence consists of paired begin-end tags
    if len(tag_seq) % 2 != 0:
        print "Unbalanced tags"
        sys.exit(1)
    for i in range(len(tag_seq), 2):
        begin, end = tag_seq[i], tag_seq[i+1]
        if begin[1:] != end[2:]:
            print "Mismatched tags: %s, %s" % begin, end
            sys.exit(1)

    # remove all legal tags and report any others
    residue = legal_tag.sub("", text)
    tag_seq = any_tag.findall(residue)
    if tag_seq:
        print "Illegal tags:", " ".join(tag_seq)
        sys.exit(1)

    print "Correct use of tags:", " ".join(tags)
    ----snip----

    On 11/8/05, neduchi@netscape.net <neduchi@netscape.net> wrote:
    > Hallo,
    > Please hae a look at this free xml editor: http://www.butterflyxml.org/
    >
    > May be it might be of help.
    >
    > Chinedu Uchechukwu
    > Otto-Friedrich-Uniersität, Bamberg
    >
    >
    > -----Original Message-----
    > From: Przemek Kaszubski <przemka@amu.edu.pl>
    > To: CORPORA@uib.no
    > Sent: Sat, 05 Nov 2005 17:32:21 +0100
    > Subject: [Corpora-List] Custom tagging validator
    >
    > Dear Members,
    >
    > I'm looking for a flexible tool that would validate files tagged by my
    > students. The tags follow the <tag>tagged_text</tag> convention but are
    > not linked to any DTD, and entirely my own. I'd like to be able to test
    > quickly if my students spelled the tag names correctly, closed the
    > tags, applied the < and > symbols etc. The tagging scheme is simple
    > (sth like 10-12 tags in all), with no embedding or special properties.
    >
    > Does anyone know of a tool or script of this kind, or perhaps
    > developed one?
    >
    > Thank you for any help,
    >
    > Przemek
    >
    > -- Dr Przemyslaw Kaszubski
    > +48 61 8293515
    > http://elex.amu.edu.pl/ifa/staff/kaszubski.html
    >
    > PICLE LEARNER CORPUS ONLINE:
    > http://www.staff.amu.edu.pl/~przemka/picle.html
    >
    > COMPREHENSIVE CORPORA BIBLIOGRAPHY:
    > http://www.staff.amu.edu.pl/~przemka
    >
    > MY SEMINARS:
    > http://www.staff.amu.edu.pl/~przemka/seminars.htm
    >
    > ACADEMIC WRITING PAGE (FULL-TIME PROGRAMME):
    > http://www.staff.amu.edu.pl/~przemka/IFA_writing
    >
    > =======================================
    > School of English (IFA)
    > Adam Mickiewicz University
    > http://elex.amu.edu.pl/ifa
    > =======================================
    >
    >
    >
    >
    > ___________________________________________________
    > Try the New Netscape Mail Today!
    > Virtually Spam-Free | More Storage | Import Your Contact List
    > http://mail.netscape.com
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Nov 08 2005 - 12:29:15 MET