Re: [Corpora-List] jumk java

From: Andy Roberts (andyr@comp.leeds.ac.uk)
Date: Mon Jun 27 2005 - 00:16:01 MET DST

  • Next message: Marios Stamoulos: "Re: [Corpora-List] jumk java"

    Jerry,

    I've found JTidy (http://jtidy.sourceforge.net/) to be extremely simple.
    It's a Java package which provides methods for extracting the plain
    content from HTML documents.

    Andy

    On Sun, 26 Jun 2005 j_kurjian@hotmail.com wrote:

    > Hi all,
    >
    > I've had this problem on several occasions - I convert html files to txt and
    > strip out the html as best I can (this last time I used beautifulsoup) only
    > to find large chunks of what appears to be java code still perched inside
    > many of the texts.
    >
    > I've tried writing code to strip it out, but it is pretty resistant. At
    > present I'm looking for duplicate chunks of it and will try to use these as
    > templates to erase the stuff but it is not a happy process and is certain to
    > leave non-duplicate occurrences.
    >
    > Has anyone else had this problem? Has anyone satisfactorily managed to
    > overcome it?
    >
    > Jerry
    >
    > _________________________________________________________________
    > FREE pop-up blocking with the new MSN Toolbar - get it now!
    > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
    >
    >



    This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 00:26:42 MET DST