Re: [Corpora-List] jumk java

From: Marios Stamoulos (m.stamoulos@ntlworld.com)
Date: Mon Jun 27 2005 - 00:36:48 MET DST

  • Next message: Michael Betsch: "Re: [Corpora-List] jumk java"

    Hello,
    If you are familiar with Java i could point you out to this:
    http://javaalmanac.com/egs/javax.swing.text.html/GetText.html

    Simple HTML parser using java classes ;) Saves you lots of time writting a
    nice parser :D

    enjoy!
    Marios

    ----- Original Message -----
    From: "Andy Roberts" <andyr@comp.leeds.ac.uk>
    To: <j_kurjian@hotmail.com>
    Cc: <CORPORA@UIB.NO>
    Sent: Sunday, June 26, 2005 11:16 PM
    Subject: Re: [Corpora-List] jumk java

    > Jerry,
    >
    > I've found JTidy (http://jtidy.sourceforge.net/) to be extremely simple.
    > It's a Java package which provides methods for extracting the plain
    > content from HTML documents.
    >
    > Andy
    >
    > On Sun, 26 Jun 2005 j_kurjian@hotmail.com wrote:
    >
    > > Hi all,
    > >
    > > I've had this problem on several occasions - I convert html files to txt
    and
    > > strip out the html as best I can (this last time I used beautifulsoup)
    only
    > > to find large chunks of what appears to be java code still perched
    inside
    > > many of the texts.
    > >
    > > I've tried writing code to strip it out, but it is pretty resistant. At
    > > present I'm looking for duplicate chunks of it and will try to use these
    as
    > > templates to erase the stuff but it is not a happy process and is
    certain to
    > > leave non-duplicate occurrences.
    > >
    > > Has anyone else had this problem? Has anyone satisfactorily managed to
    > > overcome it?
    > >
    > > Jerry
    > >
    > > _________________________________________________________________
    > > FREE pop-up blocking with the new MSN Toolbar - get it now!
    > > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
    > >
    > >
    >



    This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 07:28:07 MET DST