Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Alex Murzaku (lissus@gmail.com)
Date: Tue Aug 09 2005 - 15:10:30 MET DST

Next message: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

Previous message: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Since I was scrapping text from a limited number of Albanian language
websites, it was easy for me to search for repeated text in every page
coming from the same site. The repeated text was removed. This meant
that I had only one copy of pages containing everything. One of the
sites I was spidering changed its format three times in two months
which generated quite a bit of noise. The only way to get rid of it
was to get back to regex. I ended up using only regex in the end. As
for the "sudden" changes, you could use the absence of text repetition
as a signal that there is a change and, then, modify the regex
accordingly.

Good luck,

Alex

On 8/9/05, Helge Thomas Hellerud <helgetho@stud.ntnu.no> wrote:
> Hello,
>
> I want to extract the article text of a HTML page (for instance the text of
> a news article). But a HTML page contains much "noise", like menus and ads.
> So I want to ask if anyone know a way to eliminate unwanted elements like
> menus and ads, and only extract the editorial article text?
>
> Of course, I can use Regex to look for patterns in the HTML code (by
> defining a starting point and an ending point), but the solution will be a
> hack that will not work if the pattern in the HTML page suddenly is changed.
> So do you know how to extract the content without using such a hack?
>
> Thanks in advance.
>
> Helge Thomas Hellerud
>
>
>

Next message: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Previous message: Lars Nygaard: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
In reply to: Helge Thomas Hellerud: "[Corpora-List] Extracting only editorial content from a HTML page"
Next in thread: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 15:16:54 MET DST