Re: Corpora: sgml detagger

From: David Graff (graff@unagi.cis.upenn.edu)
Date: Tue Apr 16 2002 - 22:11:36 MET DST

  • Next message: Vlado Keselj: "Re: Corpora: sgml detagger"

    > Does anybody know an easy way to remove [sgml] tags and save the texts
    > as 'raw' .txt files?
    > Maybe a PERL script?

    Perl is very good for this. If you're confident that _all_ the text
    data in the sgml files (i.e. everything that is not an sgml tag) is
    usable for down-stream processing, then this perl script would work
    (even when sgml tags span multiple lines):

    #!/usr/bin/perl

    # set input record separator to empty string
    # (entire content of input file will be fetched in a single read):

    $/ = "";

    # assume that command line args are file names to be converted;
    # for each input file, read it and write "file_name.raw"

    foreach $file ( @ARGV ) {
        open( IN, $file ) or do { warn "can't open $file\n"; next; };
        $data = <IN>;
        close IN;
        (defined $data) or do { warn "can't read data from $file\n"; next; };

        $data =~ s/<[^>]+>//g; # remove tags (strings bounded by "<...>")
        $data =~ s/\n\s+/\n/g; # remove blank lines (not essential)

        open( OUT, ">$file.raw" ) or do { warn "can't write $file.raw\n"; next; };
        print OUT $data or warn "can't write data to $file.raw\n";
        close OUT or die "error trying to close $file.raw\n";
    }

    __END__

    However, it is not uncommon for sgml files to contain tags whose data
    content is not human language; for example, you might find markup like
    the following:

    <DOC>
    <DOCNO> AP891231-0001 </DOCNO>
    <FILEID>AP-NR-12-31-89 2359EDT</FILEID>
    <FIRST>r a PM-MonkeyBusiness 12-31 0269</FIRST>
    <SECOND>PM-Monkey Business,0276</SECOND>
    <HEAD>Yacht That Took Gary Hart On Famous Cruise Suffered From Fame</HEAD>
    <DATELINE>DENVER (AP) </DATELINE>
    <TEXT>
       Monkey Business, the yacht that helped sink Gary
    Hart's presidential aspirations in 1988, is for sale, and its
    ...

    (This example is drawn from an sgml file in the TIPSTER corpus.) The
    point is that you might want to filter out more than just the sgml tags,
    if your down-stream process is going to treat everything that remains as
    language data.

    If the sgml markup makes it easy to identify what portion(s) you want to
    keep, then a couple additions to the Perl script above would suffice --
    e.g. for the TIPSTER case, you could add these two lines just before the
    line that removes all the tags:

       $data =~ s/^.*<TEXT>//s; # remove everything up to/including <TEXT>
       $data =~ s%</TEXT>.*%%s; # remove </TEXT> and everything after it

    Depending on where your sgml files came from -- and if you have the DTD
    that they are supposed to be based on -- it may be a good idea to
    validate the tagging first, using a standard sgml parser, like nsgmls;
    it's hard to create any kind of useful sgml filter when there are
    mistakes in the tagging.

    For that matter, it's probably easier/safer to write a filter that works
    on the output of an sgml parser, rather than the sgml file.

    Best regards,

            Dave Graff



    This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 22:09:10 MET DST