Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Hal Daume III (hdaume@ISI.EDU)
Date: Tue Aug 09 2005 - 13:02:36 MET DST

  • Next message: Andy Roberts: "RE: [Corpora-List] Extracting only editorial content from a HTML page"

    I looked at this a while ago; the solution I came up with is not perfect,
    but seems to do a pretty good job, at least with news-like web pages. The
    key idea is to look for a subsequence of the text with the highest (# of
    words) to (# of html tags) ratio. You can fairly easily do this if you
    first tokenize the web page into seqs of html tags and seqs of
    non-html-tags, and then do simple dynamic programming to find the longest
    contiguous sequence that's "mostly" words. The only thing this misses is
    on web pages that look like "First paragraph of article <some ads> rest of
    article." In these cases, the first paragraph is often lost. You could
    probably hueristically fix this, but I didn't.

    On Tue, 9 Aug 2005, Helge Thomas Hellerud wrote:

    > Hello,
    >
    > I want to extract the article text of a HTML page (for instance the text of
    > a news article). But a HTML page contains much "noise", like menus and ads.
    > So I want to ask if anyone know a way to eliminate unwanted elements like
    > menus and ads, and only extract the editorial article text?
    >
    > Of course, I can use Regex to look for patterns in the HTML code (by
    > defining a starting point and an ending point), but the solution will be a
    > hack that will not work if the pattern in the HTML page suddenly is changed.
    > So do you know how to extract the content without using such a hack?
    >
    > Thanks in advance.
    >
    > Helge Thomas Hellerud
    >
    >

    -- 
     Hal Daume III                                   | hdaume@isi.edu
     "Arrest this man, he talks in maths."           | www.isi.edu/~hdaume
    



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 13:16:41 MET DST