Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Min-Yen Kan (knmnyn@gmail.com)
Date: Tue Aug 09 2005 - 16:49:14 MET DST

Next message: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Alex Clark: "[Corpora-List] Post doc in unsupervised learning/grammatical inference"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Helge, all:

In addition to all the tools that people have mentioned, I will add my
own. We have developed a tool in java, available through sourceforge
to help people do this task and others where some fragment of the web
page needs to be identified and/or extracted. We have experimented
with tagging and extracting the main text, navigation links, title,
headers, etc. from news stories on various sites on the web. Our
software, PARCELS, also partially handles sites that use XHTML/CSS
(e.g. <DIV> tags) to place text.

You can find PARCELS on sourceforge at http://parcels.sourceforge.net

It may be overkill for a simple problem, but if you need to extract
the same type of information from multiple websites with different
formats, this toolkit may be of help.

Min-Yen Kan
National University of Singapore

On 8/9/05, Helge Thomas Hellerud <helgetho@stud.ntnu.no> wrote:
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>

Next message: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Alex Clark: "[Corpora-List] Post doc in unsupervised learning/grammatical inference"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 16:53:00 MET DST