Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Hal Daume III (hdaume@ISI.EDU)
Date: Tue Aug 09 2005 - 13:02:36 MET DST

Next message: Andy Roberts: "RE: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: peetm: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I looked at this a while ago; the solution I came up with is not perfect,
but seems to do a pretty good job, at least with news-like web pages. The
key idea is to look for a subsequence of the text with the highest (# of
words) to (# of html tags) ratio. You can fairly easily do this if you
first tokenize the web page into seqs of html tags and seqs of
non-html-tags, and then do simple dynamic programming to find the longest
contiguous sequence that's "mostly" words. The only thing this misses is
on web pages that look like "First paragraph of article <some ads> rest of
article." In these cases, the first paragraph is often lost. You could
probably hueristically fix this, but I didn't.

On Tue, 9 Aug 2005, Helge Thomas Hellerud wrote:

> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>

-- 
 Hal Daume III                                   | hdaume@isi.edu
 "Arrest this man, he talks in maths."           | www.isi.edu/~hdaume

Next message: Andy Roberts: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: peetm: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 13:16:41 MET DST