Re:[Corpora-List] jumk java

From: santinim\@inwind\.it
Date: Mon Jun 27 2005 - 09:18:38 MET DST

  • Next message: Marco Baroni: "Re: [Corpora-List] jumk java"

    Hi,

    I use HTMASC, a very simple and practical utility,
    where you can specify several options (whether to keep java code or not is one of these options).
    It's shareware and you can get free trial.

    best

    Marina

    ---------- Initial Header -----------

    From : owner-corpora@lists.uib.no
    To : CORPORA@UIB.NO
    Cc :
    Date : Sun, 26 Jun 2005 20:41:02 +0000
    Subject : [Corpora-List] jumk java


    > Hi all,
    >
    > I've had this problem on several occasions - I convert html files to txt and
    > strip out the html as best I can (this last time I used beautifulsoup) only
    > to find large chunks of what appears to be java code still perched inside
    > many of the texts.
    >
    > I've tried writing code to strip it out, but it is pretty resistant. At
    > present I'm looking for duplicate chunks of it and will try to use these as
    > templates to erase the stuff but it is not a happy process and is certain to
    > leave non-duplicate occurrences.
    >
    > Has anyone else had this problem? Has anyone satisfactorily managed to
    > overcome it?
    >
    > Jerry
    >
    > _________________________________________________________________
    > FREE pop-up blocking with the new MSN Toolbar - get it now!
    > http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
    >
    >
    > ____________________________________________________________ 6X velocizzare la tua navigazione a 56k? 6X Web Accelerator di Libero! Scaricalo su INTERNET GRATIS 6X http://www.libero.it



    This archive was generated by hypermail 2b29 : Mon Jun 27 2005 - 09:22:00 MET DST