Re: [Corpora-List] jumk java

From: Andy Roberts (andyr@comp.leeds.ac.uk)
Date: Mon Jun 27 2005 - 00:16:01 MET DST

Next message: Marios Stamoulos: "Re: [Corpora-List] jumk java"

Previous message: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
In reply to: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
Next in thread: Marios Stamoulos: "Re: [Corpora-List] jumk java"
Reply: Marios Stamoulos: "Re: [Corpora-List] jumk java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Jerry,

I've found JTidy (http://jtidy.sourceforge.net/) to be extremely simple.
It's a Java package which provides methods for extracting the plain
content from HTML documents.

Andy

On Sun, 26 Jun 2005 j_kurjian@hotmail.com wrote:

> Hi all,
>
> I've had this problem on several occasions - I convert html files to txt and
> strip out the html as best I can (this last time I used beautifulsoup) only
> to find large chunks of what appears to be java code still perched inside
> many of the texts.
>
> I've tried writing code to strip it out, but it is pretty resistant. At
> present I'm looking for duplicate chunks of it and will try to use these as
> templates to erase the stuff but it is not a happy process and is certain to
> leave non-duplicate occurrences.
>
> Has anyone else had this problem? Has anyone satisfactorily managed to
> overcome it?
>
> Jerry
>
> _________________________________________________________________
> FREE pop-up blocking with the new MSN Toolbar - get it now!
> http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
>
>

Next message: Marios Stamoulos: "Re: [Corpora-List] jumk java"
Previous message: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
In reply to: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
Next in thread: Marios Stamoulos: "Re: [Corpora-List] jumk java"
Reply: Marios Stamoulos: "Re: [Corpora-List] jumk java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 00:26:42 MET DST