[Corpora-List] jumk java

From: j_kurjian@hotmail.com
Date: Sun Jun 26 2005 - 22:41:02 MET DST

Next message: j_kurjian@hotmail.com: "[Corpora-List] jumk java"

Previous message: Eric Atwell: "Re: [Corpora-List] semantic primitives"
Next in thread: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
Reply: Michael Betsch: "Re: [Corpora-List] jumk java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

I've had this problem on several occasions - I convert html files to txt and
strip out the html as best I can (this last time I used beautifulsoup) only
to find large chunks of what appears to be java code still perched inside
many of the texts.

I've tried writing code to strip it out, but it is pretty resistant. At
present I'm looking for duplicate chunks of it and will try to use these as
templates to erase the stuff but it is not a happy process and is certain to
leave non-duplicate occurrences.

Has anyone else had this problem? Has anyone satisfactorily managed to
overcome it?

Jerry

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now!
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

Next message: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
Previous message: Eric Atwell: "Re: [Corpora-List] semantic primitives"
Next in thread: j_kurjian@hotmail.com: "[Corpora-List] jumk java"
Reply: Michael Betsch: "Re: [Corpora-List] jumk java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Jun 26 2005 - 22:56:08 MET DST