Re: [Corpora-List] jumk java

From: Marios Stamoulos (m.stamoulos@ntlworld.com)
Date: Mon Jun 27 2005 - 00:36:48 MET DST

Next message: Michael Betsch: "Re: [Corpora-List] jumk java"

Previous message: Andy Roberts: "Re: [Corpora-List] jumk java"
In reply to: Andy Roberts: "Re: [Corpora-List] jumk java"
Next in thread: santinim\@inwind\.it: "Re:[Corpora-List] jumk java"
Next in thread: Michael Betsch: "Re: [Corpora-List] jumk java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,
If you are familiar with Java i could point you out to this:
http://javaalmanac.com/egs/javax.swing.text.html/GetText.html

Simple HTML parser using java classes ;) Saves you lots of time writting a
nice parser :D

enjoy!
Marios

----- Original Message -----
From: "Andy Roberts" <andyr@comp.leeds.ac.uk>
To: <j_kurjian@hotmail.com>
Cc: <CORPORA@UIB.NO>
Sent: Sunday, June 26, 2005 11:16 PM
Subject: Re: [Corpora-List] jumk java

> Jerry,
>
> I've found JTidy (http://jtidy.sourceforge.net/) to be extremely simple.
> It's a Java package which provides methods for extracting the plain
> content from HTML documents.
>
> Andy
>
> On Sun, 26 Jun 2005 j_kurjian@hotmail.com wrote:
>
> > Hi all,
> >
> > I've had this problem on several occasions - I convert html files to txt
and
> > strip out the html as best I can (this last time I used beautifulsoup)
only
> > to find large chunks of what appears to be java code still perched
inside
> > many of the texts.
> >
> > I've tried writing code to strip it out, but it is pretty resistant. At
> > present I'm looking for duplicate chunks of it and will try to use these
as
> > templates to erase the stuff but it is not a happy process and is
certain to
> > leave non-duplicate occurrences.
> >
> > Has anyone else had this problem? Has anyone satisfactorily managed to
> > overcome it?
> >
> > Jerry
> >
> > _________________________________________________________________
> > FREE pop-up blocking with the new MSN Toolbar - get it now!
> > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
> >
> >
>

Next message: Michael Betsch: "Re: [Corpora-List] jumk java"
Previous message: Andy Roberts: "Re: [Corpora-List] jumk java"
In reply to: Andy Roberts: "Re: [Corpora-List] jumk java"
Next in thread: santinim\@inwind\.it: "Re:[Corpora-List] jumk java"
Next in thread: Michael Betsch: "Re: [Corpora-List] jumk java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 07:28:07 MET DST