Re: Corpora: sgml detagger

From: Vlado Keselj (vkeselj@cs.uwaterloo.ca)
Date: Tue Apr 16 2002 - 22:21:47 MET DST

Next message: Ute Römer: "Re: Corpora: Historical background of Corpus Linguistics"

Previous message: David Graff: "Re: Corpora: sgml detagger"
In reply to: Alexander S. Yeh: "Re: Corpora: sgml detagger"
Next in thread: Vlado Keselj: "Re: Corpora: sgml detagger"
Next in thread: David Graff: "Re: Corpora: sgml detagger"
Reply: Vlado Keselj: "Re: Corpora: sgml detagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 16 Apr 2002, Alexander S. Yeh wrote:

> The script below will work for most tags, but may fail in the following
> more complicated cases:
>
> 1. A tag is spread out over more than 1 line (usual cases: comment tags,
> tags with attribute/value pairs).
>
> 2. A tag has an attribute value that has a ">" in it.
>
> 3. A comment tag has a ">" embedded in it.
>
> I have encountered these in html files of journal articles gotten off
> the web. Thanks.
>
> -Alex Yeh

True.

Actually, writting a correct and general SGML detagger would be a *very*
difficult task. The actual document processing depends on a DTD, which
can define very flexible syntax. The difficulty of general SGML parser
was one of the main reasons to come up with XML.

However, removing comments and tags from an HTML, XML, or typical SGML
document should not be so difficult task. I just wrote a script to do it
and it is appended below. Please report any bugs that you find.

Note that it follows the strict rules for HTML (SGML) comments, which may
be counter-intuitive, and I would not bet that all browsers (not to
mention users) observe them. The rules say that a comment may be <!>, or
it starts with <!--. If it starts with <!--, then it finishes with --.
After -- and possibly some whitespace we can either finish the comment tag
with > or start new comment with --.

Vlado

#!/usr/bin/perl
# 2002 Vlado Keselj <vkeselj@cs.uwaterloo.ca>
# Version: 0.1
# The newest version can be found at:
# http://vlado.keselj.net/srcperl/
#
# Cleans HTML tags.
# Warning: Follows strict HTML syntax for comments (which may be
# counter-intuitive), e.g., valid comments are:
# <!>    NOT FINISHED

$state = 'normal';

while (<>) {
    while ($_) {
        if ($state eq 'normal') {
            if (/^([^<]*)<!>/) { print $1; $_ = $'; }
            elsif (/^([^<]*)<!--/) {
                print $1; $_ = $'; $state = 'comment';
            }
            elsif (/^([^<]*)</) {
                print $1; $_ = $'; $state = 'tag';
            }
            else { print; $_ = ''; }
        }
        elsif ($state eq 'comment') {
            if (/--/) { $_ = $'; $state = 'betweencomments'; }
            else { $_ = '' }
        }
        elsif ($state eq 'betweencomments') {
            if (/^\s*>/) { $_ = $'; $state = 'normal' }
            elsif (/^\s*--/) { $_= $'; $state = 'comment'; }
            elsif (/^\s*$/) { $_ = '' }
            else { die "IMPROPER HTML COMMENT" }
        }
        elsif ($state eq 'tag') {
            if (/^[^>\"\']*([>\'\"])/) {
                $_ = $';
                if ($1 eq '>') { $state = 'normal' }
                else { $state = 'quote'; $quote = $1; }
            }
            else { $_ = '' }
        }
        elsif ($state eq 'quote') {
            if (/$quote/) { $_ = $'; $state = 'tag' }
            else { $_ = '' }
        }
        else { die "UNKNOWN STATE ($state)" }
    }
}

Next message: Ute Römer: "Re: Corpora: Historical background of Corpus Linguistics"
Previous message: David Graff: "Re: Corpora: sgml detagger"
In reply to: Alexander S. Yeh: "Re: Corpora: sgml detagger"
Next in thread: Vlado Keselj: "Re: Corpora: sgml detagger"
Next in thread: David Graff: "Re: Corpora: sgml detagger"
Reply: Vlado Keselj: "Re: Corpora: sgml detagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 22:17:49 MET DST