Re: Corpora: sgml detagger

From: Vlado Keselj (vkeselj@cs.uwaterloo.ca)
Date: Tue Apr 16 2002 - 22:40:28 MET DST

  • Next message: Vlado Keselj: "Re: Corpora: sgml detagger"

    On Tue, 16 Apr 2002, David Graff wrote:

    >
    > > Does anybody know an easy way to remove [sgml] tags and save the texts
    > > as 'raw' .txt files?
    > > Maybe a PERL script?
    >
    > Perl is very good for this. If you're confident that _all_ the text
    > data in the sgml files (i.e. everything that is not an sgml tag) is
    > usable for down-stream processing, then this perl script would work
    > (even when sgml tags span multiple lines):

    You also have to assume that:
     - no quoted strings in tags contain > sign (e.g., <div id="><">)
     - there are no comments that include > sign, and
     - each file is not too large so it can fit in the memory

    Vlado

    >
    > #!/usr/bin/perl
    >
    > # set input record separator to empty string
    > # (entire content of input file will be fetched in a single read):
    >
    > $/ = "";
    >
    > # assume that command line args are file names to be converted;
    > # for each input file, read it and write "file_name.raw"
    >
    > foreach $file ( @ARGV ) {
    > open( IN, $file ) or do { warn "can't open $file\n"; next; };
    > $data = <IN>;
    > close IN;
    > (defined $data) or do { warn "can't read data from $file\n"; next; };
    >
    > $data =~ s/<[^>]+>//g; # remove tags (strings bounded by "<...>")
    > $data =~ s/\n\s+/\n/g; # remove blank lines (not essential)
    >
    > open( OUT, ">$file.raw" ) or do { warn "can't write $file.raw\n"; next; };
    > print OUT $data or warn "can't write data to $file.raw\n";
    > close OUT or die "error trying to close $file.raw\n";
    > }
    >
    > __END__
    >
    > However, it is not uncommon for sgml files to contain tags whose data
    > content is not human language; for example, you might find markup like
    > the following:
    >
    > <DOC>
    > <DOCNO> AP891231-0001 </DOCNO>
    > <FILEID>AP-NR-12-31-89 2359EDT</FILEID>
    > <FIRST>r a PM-MonkeyBusiness 12-31 0269</FIRST>
    > <SECOND>PM-Monkey Business,0276</SECOND>
    > <HEAD>Yacht That Took Gary Hart On Famous Cruise Suffered From Fame</HEAD>
    > <DATELINE>DENVER (AP) </DATELINE>
    > <TEXT>
    > Monkey Business, the yacht that helped sink Gary
    > Hart's presidential aspirations in 1988, is for sale, and its
    > ...
    >
    > (This example is drawn from an sgml file in the TIPSTER corpus.) The
    > point is that you might want to filter out more than just the sgml tags,
    > if your down-stream process is going to treat everything that remains as
    > language data.
    >
    > If the sgml markup makes it easy to identify what portion(s) you want to
    > keep, then a couple additions to the Perl script above would suffice --
    > e.g. for the TIPSTER case, you could add these two lines just before the
    > line that removes all the tags:
    >
    > $data =~ s/^.*<TEXT>//s; # remove everything up to/including <TEXT>
    > $data =~ s%</TEXT>.*%%s; # remove </TEXT> and everything after it
    >
    > Depending on where your sgml files came from -- and if you have the DTD
    > that they are supposed to be based on -- it may be a good idea to
    > validate the tagging first, using a standard sgml parser, like nsgmls;
    > it's hard to create any kind of useful sgml filter when there are
    > mistakes in the tagging.
    >
    > For that matter, it's probably easier/safer to write a filter that works
    > on the output of an sgml parser, rather than the sgml file.
    >
    > Best regards,
    >
    > Dave Graff
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 22:40:22 MET DST