Re: Corpora: sgml detagger

From: Michael Betsch (Michael.Betsch@uni-tuebingen.de)
Date: Wed Apr 17 2002 - 09:44:54 MET DST

Next message: Josephine Lo: "Corpora: Spontaneous speech corpora"

Previous message: Steven Bird: "Corpora: ACL Anthology and ACL Anthology Fund"
In reply to: Tine & Colleen: "Corpora: sgml detagger"
Next in thread: William H. Fletcher: "Re: Corpora: sgml detagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

It will probably be more easy to use an existing sgml parser than to
write a script that can really identify _all_ possible tags and
remove them.

The (freely available) parser onsgmls has in its output format all
data content on lines of their own, which are prefixed by a "-". So
you can simply run onsgmls on your sgml-files and retain only those
lines that start with "-". (using 'grep -e "^-"'); then you can
easily remove the leading "-" with perl or something similar. This
assumes that all data is good and not e.g. a javascript, which you
will probably not want to include in your corpus.

_______________________________________________________________________ Dr. Michael Betsch privat: SFB 441, Projekt B1 Nauklerstraße 35 Rappenberghalde 27 72074 Tübingen 72070 Tübingen Tel. 07071/29-77161 Tel. 07071/51917 email: Michael.Betsch@uni-tuebingen.de _______________________________________________________________________

Next message: Josephine Lo: "Corpora: Spontaneous speech corpora"
Previous message: Steven Bird: "Corpora: ACL Anthology and ACL Anthology Fund"
In reply to: Tine & Colleen: "Corpora: sgml detagger"
Next in thread: William H. Fletcher: "Re: Corpora: sgml detagger"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Apr 17 2002 - 09:50:06 MET DST