Re: [Corpora-List] jumk java

From: Michael Betsch (michael.betsch@uni-tuebingen.de)
Date: Mon Jun 27 2005 - 07:11:06 MET DST

  • Next message: Vlad V. Gojol: "[Corpora-List] Word Lists"

    > I've had this problem on several occasions - I convert html files to txt and
    > strip out the html as best I can (this last time I used beautifulsoup) only
    > to find large chunks of what appears to be java code still perched inside
    > many of the texts.
    >
    > I've tried writing code to strip it out, but it is pretty resistant. At
    > present I'm looking for duplicate chunks of it and will try to use these as
    > templates to erase the stuff but it is not a happy process and is certain to
    > leave non-duplicate occurrences.

    (You mean javascript scripts)

    It is difficult to first strip html tags and then look for specific
    content. Javascript scripts in a html-file are tagged with

    <script type="text/javascript"> (javascript) </script>

    so they can be easily seen and removed before html tags are cut, but not
    after that moment.

    For instance, you can use a program that understands html for the
    conversion html => text. Lynx can "dump" the text:

    lynx -dump html-file(s) > textfile

    or any other sgml-to-sgml conversion will do, if it allows to specify a
    treatment for specific sgml-elements.

    Michael Betsch



    This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 07:28:11 MET DST