Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Tom Emerson (tree@basistech.com)
Date: Tue Aug 09 2005 - 22:23:57 MET DST

  • Next message: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Lou Burnard writes:
    > The other tool for this purpose which no-one has (so far) mentioned is
    > tidy -- http://tidy.,sourceforge.net
    >
    > It will take almost any html and turn it into something usable very
    > fast; it's also very robust and there is a choice of APIs for
    > integrating it into your own production system

    Just a warning to folks: while Tidy is good, it can get very confused
    on bogus HTML, and will crash horribly in ways that are non-trivial to
    debug. I've found that pages which have bogus JavaScript embedded can
    cause lots of problems, as well as pages in stranger character
    encodings.

        -tree

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)
    



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 22:40:57 MET DST