Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Ken Litkowski (ken@clres.com)
Date: Tue Aug 09 2005 - 17:00:55 MET DST

Next message: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

My approach is based on the HTML tags, rather than the more elaborate
DOMs and REs (as suggested in other responses to this message). The
problem in basic HTML is that <p>'s don't have to be closed. But, you
can assume that if you've got an opening <p>, then any prior one is now
closed. So, now you've got a stretch of material and you can examine it
for any other tags, which almost always have a closing tag, and remove
those tags, and perhaps what's in them. This will get rid of links,
<img> elements, etc. This is the starting point for your algorithm, and
you then refine it from there. (One main problem with a <p> is that it
may be embedded in a table, so you have to decide what you want to do
with tabular material.)

Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many
problems. And, then you start getting into all sorts of other web
pages. Unless you have the resources (both time and money) to devote to
a more elaborate solution, you can do surprisingly well.

Ken

Helge Thomas Hellerud wrote:

> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>
>

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken@clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com

Next message: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 17:18:31 MET DST