Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Alex Murzaku (lissus@gmail.com)
Date: Tue Aug 09 2005 - 15:10:30 MET DST

  • Next message: Alexander Schutz: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    Since I was scrapping text from a limited number of Albanian language
    websites, it was easy for me to search for repeated text in every page
    coming from the same site. The repeated text was removed. This meant
    that I had only one copy of pages containing everything. One of the
    sites I was spidering changed its format three times in two months
    which generated quite a bit of noise. The only way to get rid of it
    was to get back to regex. I ended up using only regex in the end. As
    for the "sudden" changes, you could use the absence of text repetition
    as a signal that there is a change and, then, modify the regex
    accordingly.

    Good luck,

    Alex

    On 8/9/05, Helge Thomas Hellerud <helgetho@stud.ntnu.no> wrote:
    > Hello,
    >
    > I want to extract the article text of a HTML page (for instance the text of
    > a news article). But a HTML page contains much "noise", like menus and ads.
    > So I want to ask if anyone know a way to eliminate unwanted elements like
    > menus and ads, and only extract the editorial article text?
    >
    > Of course, I can use Regex to look for patterns in the HTML code (by
    > defining a starting point and an ending point), but the solution will be a
    > hack that will not work if the pattern in the HTML page suddenly is changed.
    > So do you know how to extract the content without using such a hack?
    >
    > Thanks in advance.
    >
    > Helge Thomas Hellerud
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 15:16:54 MET DST