[Corpora-List] Extracting only editorial content from a HTML page

From: Helge Thomas Hellerud (helgetho@stud.ntnu.no)
Date: Tue Aug 09 2005 - 11:43:12 MET DST

Next message: peetm: "RE: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Andrea Kowalski: "[Corpora-List] Reminder: last CfP DGfS-06 Workshop on Corpus-based Approaches to Non-compositional Phenomena"
Next in thread: peetm: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: peetm: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Hal Daume III: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Alex Murzaku: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Min-Yen Kan: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Ken Litkowski: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Martin Thomas: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Copperman, Max: "RE: [Corpora-List] Extracting only editorial content from a HTML page"
Reply: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

I want to extract the article text of a HTML page (for instance the text of
a news article). But a HTML page contains much "noise", like menus and ads.
So I want to ask if anyone know a way to eliminate unwanted elements like
menus and ads, and only extract the editorial article text?

Of course, I can use Regex to look for patterns in the HTML code (by
defining a starting point and an ending point), but the solution will be a
hack that will not work if the pattern in the HTML page suddenly is changed.
So do you know how to extract the content without using such a hack?

Thanks in advance.

Helge Thomas Hellerud

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 12:12:41 MET DST