Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Min-Yen Kan (knmnyn@gmail.com)
Date: Tue Aug 09 2005 - 16:49:14 MET DST

  • Next message: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Hi Helge, all:

    In addition to all the tools that people have mentioned, I will add my
    own. We have developed a tool in java, available through sourceforge
    to help people do this task and others where some fragment of the web
    page needs to be identified and/or extracted. We have experimented
    with tagging and extracting the main text, navigation links, title,
    headers, etc. from news stories on various sites on the web. Our
    software, PARCELS, also partially handles sites that use XHTML/CSS
    (e.g. <DIV> tags) to place text.

    You can find PARCELS on sourceforge at http://parcels.sourceforge.net

    It may be overkill for a simple problem, but if you need to extract
    the same type of information from multiple websites with different
    formats, this toolkit may be of help.

    Min-Yen Kan
    National University of Singapore

    On 8/9/05, Helge Thomas Hellerud <helgetho@stud.ntnu.no> wrote:
    > Hello,
    >
    > I want to extract the article text of a HTML page (for instance the text of
    > a news article). But a HTML page contains much "noise", like menus and ads.
    > So I want to ask if anyone know a way to eliminate unwanted elements like
    > menus and ads, and only extract the editorial article text?
    >
    > Of course, I can use Regex to look for patterns in the HTML code (by
    > defining a starting point and an ending point), but the solution will be a
    > hack that will not work if the pattern in the HTML page suddenly is changed.
    > So do you know how to extract the content without using such a hack?
    >
    > Thanks in advance.
    >
    > Helge Thomas Hellerud
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 16:53:00 MET DST