[Corpora-List] Extracting only editorial content from a HTML page

From: Helge Thomas Hellerud (helgetho@stud.ntnu.no)
Date: Tue Aug 09 2005 - 11:43:12 MET DST

  • Next message: peetm: "RE: [Corpora-List] Extracting only editorial content from a HTML page"

    Hello,

    I want to extract the article text of a HTML page (for instance the text of
    a news article). But a HTML page contains much "noise", like menus and ads.
    So I want to ask if anyone know a way to eliminate unwanted elements like
    menus and ads, and only extract the editorial article text?

    Of course, I can use Regex to look for patterns in the HTML code (by
    defining a starting point and an ending point), but the solution will be a
    hack that will not work if the pattern in the HTML page suddenly is changed.
    So do you know how to extract the content without using such a hack?

    Thanks in advance.

    Helge Thomas Hellerud



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 12:12:41 MET DST