My approach is based on the HTML tags, rather than the more elaborate
DOMs and REs (as suggested in other responses to this message). The
problem in basic HTML is that <p>'s don't have to be closed. But, you
can assume that if you've got an opening <p>, then any prior one is now
closed. So, now you've got a stretch of material and you can examine it
for any other tags, which almost always have a closing tag, and remove
those tags, and perhaps what's in them. This will get rid of links,
<img> elements, etc. This is the starting point for your algorithm, and
you then refine it from there. (One main problem with a <p> is that it
may be embedded in a table, so you have to decide what you want to do
with tabular material.)
Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many
problems. And, then you start getting into all sorts of other web
pages. Unless you have the resources (both time and money) to devote to
a more elaborate solution, you can do surprisingly well.
Ken
Helge Thomas Hellerud wrote:
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>
>
-- Ken Litkowski TEL.: 301-482-0237 CL Research EMAIL: ken@clres.com 9208 Gue Road Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 17:18:31 MET DST