Re: Corpora: sgml detagger

From: Vlado Keselj (vkeselj@cs.uwaterloo.ca)
Date: Tue Apr 16 2002 - 22:21:47 MET DST

  • Next message: Ute Römer: "Re: Corpora: Historical background of Corpus Linguistics"

    On Tue, 16 Apr 2002, Alexander S. Yeh wrote:

    > The script below will work for most tags, but may fail in the following
    > more complicated cases:
    >
    > 1. A tag is spread out over more than 1 line (usual cases: comment tags,
    > tags with attribute/value pairs).
    >
    > 2. A tag has an attribute value that has a ">" in it.
    >
    > 3. A comment tag has a ">" embedded in it.
    >
    > I have encountered these in html files of journal articles gotten off
    > the web. Thanks.
    >
    > -Alex Yeh

    True.

    Actually, writting a correct and general SGML detagger would be a *very*
    difficult task. The actual document processing depends on a DTD, which
    can define very flexible syntax. The difficulty of general SGML parser
    was one of the main reasons to come up with XML.

    However, removing comments and tags from an HTML, XML, or typical SGML
    document should not be so difficult task. I just wrote a script to do it
    and it is appended below. Please report any bugs that you find.

    Note that it follows the strict rules for HTML (SGML) comments, which may
    be counter-intuitive, and I would not bet that all browsers (not to
    mention users) observe them. The rules say that a comment may be <!>, or
    it starts with <!--. If it starts with <!--, then it finishes with --.
    After -- and possibly some whitespace we can either finish the comment tag
    with > or start new comment with --.

    Vlado

    #!/usr/bin/perl
    # 2002 Vlado Keselj <vkeselj@cs.uwaterloo.ca>
    # Version: 0.1
    # The newest version can be found at:
    # http://vlado.keselj.net/srcperl/
    #
    # Cleans HTML tags.
    # Warning: Follows strict HTML syntax for comments (which may be
    # counter-intuitive), e.g., valid comments are:
    # <!> <!-- cm --> <!-- comment 1 ---- comment2 -- -- c3 -- >
    # and invalid comments are:
    # <!-- comment 1 -- ERR --> <!-- comment 1 -- --> NOT FINISHED

    $state = 'normal';

    while (<>) {
        while ($_) {
            if ($state eq 'normal') {
                if (/^([^<]*)<!>/) { print $1; $_ = $'; }
                elsif (/^([^<]*)<!--/) {
                    print $1; $_ = $'; $state = 'comment';
                }
                elsif (/^([^<]*)</) {
                    print $1; $_ = $'; $state = 'tag';
                }
                else { print; $_ = ''; }
            }
            elsif ($state eq 'comment') {
                if (/--/) { $_ = $'; $state = 'betweencomments'; }
                else { $_ = '' }
            }
            elsif ($state eq 'betweencomments') {
                if (/^\s*>/) { $_ = $'; $state = 'normal' }
                elsif (/^\s*--/) { $_= $'; $state = 'comment'; }
                elsif (/^\s*$/) { $_ = '' }
                else { die "IMPROPER HTML COMMENT" }
            }
            elsif ($state eq 'tag') {
                if (/^[^>\"\']*([>\'\"])/) {
                    $_ = $';
                    if ($1 eq '>') { $state = 'normal' }
                    else { $state = 'quote'; $quote = $1; }
                }
                else { $_ = '' }
            }
            elsif ($state eq 'quote') {
                if (/$quote/) { $_ = $'; $state = 'tag' }
                else { $_ = '' }
            }
            else { die "UNKNOWN STATE ($state)" }
        }
    }



    This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 22:17:49 MET DST