Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Ken Litkowski (ken@clres.com)
Date: Tue Aug 09 2005 - 17:00:55 MET DST

  • Next message: Rob Malouf: "Re: [Corpora-List] Extracting only editorial content from a HTML page"

    My approach is based on the HTML tags, rather than the more elaborate
    DOMs and REs (as suggested in other responses to this message). The
    problem in basic HTML is that <p>'s don't have to be closed. But, you
    can assume that if you've got an opening <p>, then any prior one is now
    closed. So, now you've got a stretch of material and you can examine it
    for any other tags, which almost always have a closing tag, and remove
    those tags, and perhaps what's in them. This will get rid of links,
    <img> elements, etc. This is the starting point for your algorithm, and
    you then refine it from there. (One main problem with a <p> is that it
    may be embedded in a table, so you have to decide what you want to do
    with tabular material.)

    Clearly, basic HMTL is the most difficult; XHTML wouldn't have as many
    problems. And, then you start getting into all sorts of other web
    pages. Unless you have the resources (both time and money) to devote to
    a more elaborate solution, you can do surprisingly well.

            Ken

    Helge Thomas Hellerud wrote:

    > Hello,
    >
    > I want to extract the article text of a HTML page (for instance the text of
    > a news article). But a HTML page contains much "noise", like menus and ads.
    > So I want to ask if anyone know a way to eliminate unwanted elements like
    > menus and ads, and only extract the editorial article text?
    >
    > Of course, I can use Regex to look for patterns in the HTML code (by
    > defining a starting point and an ending point), but the solution will be a
    > hack that will not work if the pattern in the HTML page suddenly is changed.
    > So do you know how to extract the content without using such a hack?
    >
    > Thanks in advance.
    >
    > Helge Thomas Hellerud
    >
    >
    >
    >

    -- 
    Ken Litkowski                     TEL.: 301-482-0237
    CL Research                       EMAIL: ken@clres.com
    9208 Gue Road
    Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com
    



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 17:18:31 MET DST