Re: [Corpora-List] Extracting only editorial content from a HTML page

From: Alexander Schutz (goalscoringsuperstarhero@gmail.com)
Date: Tue Aug 09 2005 - 15:41:34 MET DST

  • Next message: Alex Clark: "[Corpora-List] Post doc in unsupervised learning/grammatical inference"

    Helge,

    Aidan Finn and Nick Kushmerick did some interesting research on how to
    identify and extract relevant parts (i.e. containing plaintext) of a
    given webpage.
    The boilerplate removal tool worked quite well for me when I tested
    it and I've heard some good things from other people about it, too.
    check out this link and follow BTE
    http://www.smi.ucd.ie/hyppia/

    Best,
    Alex

    On 8/9/05, Helge Thomas Hellerud <helgetho@stud.ntnu.no> wrote:
    > Hello,
    >
    > I want to extract the article text of a HTML page (for instance the text of
    > a news article). But a HTML page contains much "noise", like menus and ads.
    > So I want to ask if anyone know a way to eliminate unwanted elements like
    > menus and ads, and only extract the editorial article text?
    >
    > Of course, I can use Regex to look for patterns in the HTML code (by
    > defining a starting point and an ending point), but the solution will be a
    > hack that will not work if the pattern in the HTML page suddenly is changed.
    > So do you know how to extract the content without using such a hack?
    >
    > Thanks in advance.
    >
    > Helge Thomas Hellerud
    >
    >
    >

    -- 
    Alexander Schutz
    Student of Computational Linguistics
    University of Saarland, Germany
    



    This archive was generated by hypermail 2b29 : Tue Aug 09 2005 - 15:56:16 MET DST