Hello,
I want to extract the article text of a HTML page (for instance the text of
a news article). But a HTML page contains much "noise", like menus and ads.
So I want to ask if anyone know a way to eliminate unwanted elements like
menus and ads, and only extract the editorial article text?
Of course, I can use Regex to look for patterns in the HTML code (by
defining a starting point and an ending point), but the solution will be a
hack that will not work if the pattern in the HTML page suddenly is changed.
So do you know how to extract the content without using such a hack?
Thanks in advance.
Helge Thomas Hellerud
This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 12:12:41 MET DST