Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Alexander Schutz (goalscoringsuperstarhero@gmail.com)
Date: Tue Aug 09 2005 - 15:41:34 MET DST

Next message: Alex Clark: "[Corpora-List] Post doc in unsupervised learning/grammatical inference"

Previous message: Alex Murzaku: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Niels Ott: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Niels Ott: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Helge,

Aidan Finn and Nick Kushmerick did some interesting research on how to
identify and extract relevant parts (i.e. containing plaintext) of a
given webpage.
The boilerplate removal tool worked quite well for me when I tested
it and I've heard some good things from other people about it, too.
check out this link and follow BTE
http://www.smi.ucd.ie/hyppia/

Best,
Alex

On 8/9/05, Helge Thomas Hellerud <helgetho@stud.ntnu.no> wrote:
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>

-- 
Alexander Schutz
Student of Computational Linguistics
University of Saarland, Germany

Next message: Alex Clark: "[Corpora-List] Post doc in unsupervised learning/grammatical inference"
Previous message: Alex Murzaku: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Niels Ott: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Niels Ott: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 15:56:16 MET DST