Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Tom Emerson (tree@basistech.com)
Date: Tue Aug 09 2005 - 22:23:57 MET DST

Next message: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Lou Burnard writes:
> The other tool for this purpose which no-one has (so far) mentioned is
> tidy -- http://tidy.,sourceforge.net
>
> It will take almost any html and turn it into something usable very
> fast; it's also very robust and there is a choice of APIs for
> integrating it into your own production system

Just a warning to folks: while Tidy is good, it can get very confused
on bogus HTML, and will crash horribly in ways that are non-trivial to
debug. I've found that pages which have bogus JavaScript embedded can
cause lots of problems, as well as pages in stranger character
encodings.

-tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)

Next message: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Lou Burnard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Mike Maxwell: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 22:40:57 MET DST