Re: Corpora: sgml detagger

From: Vlado Keselj (vkeselj@cs.uwaterloo.ca)
Date: Tue Apr 16 2002 - 22:40:28 MET DST

Next message: Vlado Keselj: "Re: Corpora: sgml detagger"

Previous message: Ute Römer: "Re: Corpora: Historical background of Corpus Linguistics"
In reply to: David Graff: "Re: Corpora: sgml detagger"
Next in thread: Michael Betsch: "Re: Corpora: sgml detagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 16 Apr 2002, David Graff wrote:

>
> > Does anybody know an easy way to remove [sgml] tags and save the texts
> > as 'raw' .txt files?
> > Maybe a PERL script?
>
> Perl is very good for this. If you're confident that _all_ the text
> data in the sgml files (i.e. everything that is not an sgml tag) is
> usable for down-stream processing, then this perl script would work
> (even when sgml tags span multiple lines):

You also have to assume that:
- no quoted strings in tags contain > sign (e.g., <div id="><">)
- there are no comments that include > sign, and
- each file is not too large so it can fit in the memory

Vlado

>
> #!/usr/bin/perl
>
> # set input record separator to empty string
> # (entire content of input file will be fetched in a single read):
>
> $/ = "";
>
> # assume that command line args are file names to be converted;
> # for each input file, read it and write "file_name.raw"
>
> foreach $file ( @ARGV ) {
> open( IN, $file ) or do { warn "can't open $file\n"; next; };
> $data = <IN>;
> close IN;
> (defined $data) or do { warn "can't read data from $file\n"; next; };
>
> $data =~ s/<[^>]+>//g; # remove tags (strings bounded by "<...>")
> $data =~ s/\n\s+/\n/g; # remove blank lines (not essential)
>
> open( OUT, ">$file.raw" ) or do { warn "can't write $file.raw\n"; next; };
> print OUT $data or warn "can't write data to $file.raw\n";
> close OUT or die "error trying to close $file.raw\n";
> }
>
> __END__
>
> However, it is not uncommon for sgml files to contain tags whose data
> content is not human language; for example, you might find markup like
> the following:
>
> <DOC>
> <DOCNO> AP891231-0001 </DOCNO>
> <FILEID>AP-NR-12-31-89 2359EDT</FILEID>
> <FIRST>r a PM-MonkeyBusiness 12-31 0269</FIRST>
> <SECOND>PM-Monkey Business,0276</SECOND>
> <HEAD>Yacht That Took Gary Hart On Famous Cruise Suffered From Fame</HEAD>
> <DATELINE>DENVER (AP) </DATELINE>
> <TEXT>
> Monkey Business, the yacht that helped sink Gary
> Hart's presidential aspirations in 1988, is for sale, and its
> ...
>
> (This example is drawn from an sgml file in the TIPSTER corpus.) The
> point is that you might want to filter out more than just the sgml tags,
> if your down-stream process is going to treat everything that remains as
> language data.
>
> If the sgml markup makes it easy to identify what portion(s) you want to
> keep, then a couple additions to the Perl script above would suffice --
> e.g. for the TIPSTER case, you could add these two lines just before the
> line that removes all the tags:
>
> $data =~ s/^.*<TEXT>//s; # remove everything up to/including <TEXT>
> $data =~ s%</TEXT>.*%%s; # remove </TEXT> and everything after it
>
> Depending on where your sgml files came from -- and if you have the DTD
> that they are supposed to be based on -- it may be a good idea to
> validate the tagging first, using a standard sgml parser, like nsgmls;
> it's hard to create any kind of useful sgml filter when there are
> mistakes in the tagging.
>
> For that matter, it's probably easier/safer to write a filter that works
> on the output of an sgml parser, rather than the sgml file.
>
> Best regards,
>
> Dave Graff
>
>
>

Next message: Vlado Keselj: "Re: Corpora: sgml detagger"
Previous message: Ute Römer: "Re: Corpora: Historical background of Corpus Linguistics"
In reply to: David Graff: "Re: Corpora: sgml detagger"
Next in thread: Michael Betsch: "Re: Corpora: sgml detagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 16 2002 - 22:40:22 MET DST