Re: [Corpora-List] jumk java

From: Alexander S. Yeh (asy@mitre.org)
Date: Mon Jun 27 2005 - 23:27:05 MET DST

  • Next message: Iman Thabet: "[Corpora-List] A tool for syllable identification"

    Michael Betsch wrote:
    >>I've had this problem on several occasions - I convert html files to txt and
    >>strip out the html as best I can (this last time I used beautifulsoup) only
    >>to find large chunks of what appears to be java code still perched inside
    >>many of the texts.
    >>
    >>I've tried writing code to strip it out, but it is pretty resistant. At
    >>present I'm looking for duplicate chunks of it and will try to use these as
    >>templates to erase the stuff but it is not a happy process and is certain to
    >>leave non-duplicate occurrences.
    >
    >
    > (You mean javascript scripts)

    Possibly related: when I tried to convert html to txt a few years ago, I
    would find large comment tags that would go across several lines (new
    lines within the comment tag). It turns out that these tags had embedded
      javascript within it. Embedding the javascript within a comment tag
    meant that a browser which could not deal with javascript would just
    ignore it.

    To strip out such tags, somebody wrote a tag stripper that could handle
    tags where the tag start ("<") and tag end (">") were not on the same line.

    -Alex Yeh

    >
    > It is difficult to first strip html tags and then look for specific
    > content. Javascript scripts in a html-file are tagged with
    >
    > <script type="text/javascript"> (javascript) </script>
    >
    > so they can be easily seen and removed before html tags are cut, but not
    > after that moment.
    >
    > For instance, you can use a program that understands html for the
    > conversion html => text. Lynx can "dump" the text:
    >
    > lynx -dump html-file(s) > textfile
    >
    > or any other sgml-to-sgml conversion will do, if it allows to specify a
    > treatment for specific sgml-elements.
    >
    > Michael Betsch
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 23:46:39 MET DST